op-conductor
service is and how it works on
a high level. It will also get you started on setting it up in your own
environment.
Enhancing sequencer reliability and availability
The op-conductor is an auxiliary service designed to enhance the reliability and availability of a sequencer within high-availability setups. By minimizing the risks associated with a single point of failure, the op-conductor ensures that the sequencer remains operational and responsive.Assumptions
It is important to note that theop-conductor
does not incorporate Byzantine
fault tolerance (BFT). This means the system operates under the assumption that
all participating nodes are honest and act correctly.
Summary of guarantees
The design of theop-conductor
provides the following guarantees:
- No Unsafe Reorgs
- No Unsafe Head Stall During Network Partition
- 100% Uptime with No More Than 1 Node Failure
Design
op-conductor
serves the following functions:
Raft consensus layer participation
- Leader determination: Participates in the Raft consensus algorithm to determine the leader among sequencers.
- State management: Stores the latest unsafe block ensuring consistency across the system.
RPC request handling
- Admin RPC: Provides administrative RPCs for manual recovery scenarios, including, but not limited to: stopping the leadership vote and removing itself from the cluster.
- Health RPC: Offers health RPCs for the
op-node
to determine whether it should allow the publishing of transactions and unsafe blocks.
Sequencer health monitoring
- Continuously monitors the health of the sequencer (op-node) to ensure optimal performance and reliability.
Control loop management
- Implements a control loop to manage the status of the sequencer (op-node), including starting and stopping operations based on different scenarios and health checks.
Conductor state transition
The following is a state machine diagram of how the op-conductor manages the sequencers Raft consensus.Setup
At OP Labs, op-conductor is deployed as a kubernetes statefulset because it requires a persistent volume to store the raft log. This guide describes setting up conductor on an existing network without incurring downtime. You can utilize the op-conductor-ops tool to confirm the conductor status between the steps.Assumptions
This setup guide has the following assumptions:- 3 deployed sequencers (sequencer-0, sequencer-1, sequencer-2) that are all in sync and in the same vpc network
- sequencer-0 is currently the active sequencer
- You can execute a blue/green style sequencer deployment workflow that involves no downtime (described below)
- conductor and sequencers are running in k8s or some other container orchestrator (vm-based deployment may be slightly different and not covered here)
Spin up op-conductor
1
Deploy conductor
Deploy a conductor instance per sequencer with sequencer-1 as the raft cluster
bootstrap node:
-
suggested conductor configs:
-
sequencer-1 op-conductor extra config:
2
Pause two conductors
Pause
sequencer-0
& sequencer-2
conductors with conductor_pause RPC request.3
Update op-node configuration and switch the active sequencer
Deploy an
op-node
config update to all sequencers that enables conductor. Use
a blue/green style deployment workflow that switches the active sequencer to
sequencer-1
:-
all sequencer op-node configs:
4
Confirm sequencer switch was successful
Confirm
sequencer-1
is active and successfully producing unsafe blocks.
Because sequencer-1
was the raft cluster bootstrap node, it is now committing
unsafe payloads to the raft log.5
Add voting nodes
Add voting nodes to cluster using conductor_AddServerAsVoter
RPC request to the leader conductor (
sequencer-1
)6
Confirm state
Confirm cluster membership and sequencer state:
sequencer-0
andsequencer-2
:- raft cluster follower
- sequencer is stopped
- conductor is paused
- conductor enabled in op-node config
sequencer-1
- raft cluster leader
- sequencer is active
- conductor is paused
- conductor enabled in op-node config
7
Resume conductors
Resume all conductors with conductor_resume RPC request to
each conductor instance.
8
Confirm state
Confirm all conductors successfully resumed with conductor_paused
9
Transfer leadership
Trigger leadership transfer to
sequencer-0
using conductor_transferLeaderToServer10
Confirm state
Confirm cluster membership and sequencer state:
sequencer-1
and sequencer-2
:- raft cluster follower
- sequencer is stopped
- conductor is active
- conductor enabled in op-node config
sequencer-0
:- raft cluster leader
- sequencer is active
- conductor is active
- conductor enabled in op-node config
11
Update configuration
Deploy a config change to
sequencer-1
conductor to remove the
OP_CONDUCTOR_PAUSED: true
flag and OP_CONDUCTOR_RAFT_BOOTSTRAP
flag.Blue/green deployment
In order to ensure there is no downtime when setting up conductor, you need to have a deployment script that can update sequencers without network downtime. An example of this workflow might look like:- Query current state of the network and determine which sequencer is currently active (referred to as “original” sequencer below). From the other available sequencers, choose a candidate sequencer.
- Deploy the change to the candidate sequencer and then wait for it to sync up to the original sequencer’s unsafe head. You may want to check peer counts and other important health metrics.
- Stop the original sequencer using
admin_stopSequencer
which returns the last inserted unsafe block hash. Wait for candidate sequencer to sync with this returned hash in case there is a delta. - Start the candidate sequencer at the original’s last inserted unsafe block
hash.
- Here you can also execute additional check for unsafe head progression and decide to roll back the change (stop the candidate sequencer, start the original, rollback deployment of candidate, etc.)
- Deploy the change to the original sequencer, wait for it to sync to the chain head. Execute health checks.
Post-conductor launch deployments
After conductor is live, a similar canary style workflow is used to ensure minimal downtime in case there is an issue with deployment:- Choose a candidate sequencer from the raft-cluster followers
- Deploy to the candidate sequencer. Run health checks on the candidate.
- Transfer leadership to the candidate sequencer using
conductor_transferLeaderToServer
. Run health checks on the candidate. - Test if candidate is still the leader using
conductor_leader
after some grace period (ex: 30 seconds)- If not, then there is likely an issue with the deployment. Roll back.
- Upgrade the remaining sequencers, run healthchecks.
Configuration options
It is configured via its flags / environment variables—consensus.addr (CONSENSUS_ADDR
)
- Usage: Address to listen for consensus connections
- Default Value: 127.0.0.1
- Required: yes
—consensus.advertised (CONSENSUS_ADVERTISED
)
- Usage: Address to advertise for consensus connections
- Default Value: 127.0.0.1
- Required: yes
—consensus.port (CONSENSUS_PORT
)
- Usage: Port to listen for consensus connections
- Default Value: 50050
- Required: yes
—raft.bootstrap (RAFT_BOOTSTRAP
)
For bootstrapping a new cluster. This should only be used on the sequencer
that is currently active and can only be started once with this flag,
otherwise the flag has to be removed or the raft log must be deleted before
re-bootstrapping the cluster.
- Usage: If this node should bootstrap a new raft cluster
- Default Value: false
- Required: no
—raft.server.id (RAFT_SERVER_ID
)
- Usage: Unique ID for this server used by raft consensus
- Default Value: None specified
- Required: yes
—raft.storage.dir (RAFT_STORAGE_DIR
)
- Usage: Directory to store raft data
- Default Value: None specified
- Required: yes
—node.rpc (NODE_RPC
)
- Usage: HTTP provider URL for op-node
- Default Value: None specified
- Required: yes
—execution.rpc (EXECUTION_RPC
)
- Usage: HTTP provider URL for execution layer
- Default Value: None specified
- Required: yes
—healthcheck.interval (HEALTHCHECK_INTERVAL
)
- Usage: Interval between health checks
- Default Value: None specified
- Required: yes
—healthcheck.unsafe-interval (HEALTHCHECK_UNSAFE_INTERVAL
)
- Usage: Interval allowed between unsafe head and now measured in seconds
- Default Value: None specified
- Required: yes
—healthcheck.safe-enabled (HEALTHCHECK_SAFE_ENABLED
)
- Usage: Whether to enable safe head progression checks
- Default Value: false
- Required: no
—healthcheck.safe-interval (HEALTHCHECK_SAFE_INTERVAL
)
- Usage: Interval between safe head progression measured in seconds
- Default Value: 1200
- Required: no
—healthcheck.min-peer-count (HEALTHCHECK_MIN_PEER_COUNT
)
- Usage: Minimum number of peers required to be considered healthy
- Default Value: None specified
- Required: yes
—paused (PAUSED
)
There is no configuration state, so if you unpause via RPC and then restart,
it will start paused again.
- Usage: Whether the conductor is paused
- Default Value: false
- Required: no
—rpc.enable-proxy (RPC_ENABLE_PROXY
)
- Usage: Enable the RPC proxy to underlying sequencer services
- Default Value: true
- Required: no
RPCs
Conductor exposes admin RPCs on theconductor
namespace.
conductor_overrideLeader
OverrideLeader
is used to override the leader status, this is only used to
return true for Leader()
& LeaderWithID()
calls. It does not impact the
actual raft consensus leadership status. It is supposed to be used when the
cluster is unhealthy and the node is the only one up, to allow batcher to
be able to connect to the node so that it could download blocks from the
manually started sequencer.
conductor_pause
Pause
pauses op-conductor.
conductor_resume
Resume
resumes op-conductor.
conductor_paused
Paused returns true if the op-conductor is paused.conductor_stopped
Stopped returns true if the op-conductor is stopped.conductor_sequencerHealthy
SequencerHealthy returns true if the sequencer is healthy.conductor_leader
API related to consensus.
conductor_leaderWithID
API related to consensus.
conductor_addServerAsVoter
API related to consensus.
conductor_addServerAsNonvoter
API related to consensus.
conductor_removeServer
API related to consensus.
conductor_transferLeader
API related to consensus.
conductor_transferLeaderToServer
API related to consensus.
conductor_clusterMembership
ClusterMembership returns the current cluster membership configuration.conductor_active
API called by
op-node
.conductor_commitUnsafePayload
API called by
op-node
.Next steps
- Checkout op-conductor-mon: which monitors multiple op-conductor instances and provides a unified interface for reporting metrics.
- Get familiar with op-conductor-opsto interact with op-conductor.