Slow Lane

The slow lane is MOSAIC’s durable rendering and telemetry path. It carries every step and episode event from workers through gRPC, a publish-subscribe bus, and into SQLite: feeding both the live UI and persistent storage for replay and analytics. Where the Fast Lane optimises for latency, the slow lane optimises for completeness: every event is persisted.

Pipeline Overview

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph LR
    W["Worker Process<br/>(JSONL stdout)"]
    PROXY["TrainerTelemetryProxy<br/>JSONL → gRPC"]
    DAEMON["TrainerService<br/>(gRPC Daemon)"]
    BUS["RunBus<br/>pub-sub · queue 2 048"]
    LTC["LiveTelemetryController<br/>(background thread)"]
    CM["CreditManager<br/>200 credits/stream"]
    RSR["RenderingSpeedRegulator<br/>100 ms drain · queue 32"]
    LTT["LiveTelemetryTab"]
    SINK["TelemetryDBSink<br/>batch 256"]
    DB[("SQLite / WAL")]

    W --> PROXY --> DAEMON --> BUS
    BUS --> LTC --> CM --> RSR --> LTT
    BUS --> SINK --> DB

    style BUS fill:#fff3e0,stroke:#e65100,color:#333
    style DB fill:#e3f2fd,stroke:#1565c0,color:#333
    

All queue sizes above are governed by Application Constants: see constants_telemetry.py and constants_telemetry_bus.py.

Components

TrainerTelemetryProxy

Lives in gym_gui/services/trainer/trainer_telemetry_proxy.py. Tails the worker’s JSONL stdout stream and translates each line into a gRPC PublishRunSteps / PublishRunEpisodes call on the daemon. When Fast Lane mode is active, the proxy also extracts RGB frames and writes them to a FastLaneWriter: bridging the two lanes.

RunBus

An in-process publish-subscribe event bus (gym_gui/telemetry/run_bus.py) that fans telemetry events to all subscribers. Key topics:

  • STEP_APPENDED: a new step arrived.

  • EPISODE_FINALIZED: an episode completed.

  • CONTROL: pause / resume / stop commands.

Default queue size: RUNBUS_DEFAULT_QUEUE_SIZE = 2048 (RUNBUS_UI_PATH_QUEUE_SIZE = 512, RUNBUS_DB_PATH_QUEUE_SIZE = 1024).

LiveTelemetryController

A QObject (gym_gui/controllers/live_telemetry_controllers.py) that subscribes to the RunBus on a background thread and emits Qt signals on the main thread.

Signals:

Signal

Description

run_tab_requested(run_id, agent_id, tab_name)

First event for a new agent → create a Render Tabs LiveTelemetryTab.

telemetry_stats_updated(run_id, stats)

Aggregate stats changed (steps, episodes, mean reward).

run_completed(run_id)

Training run finished: clean up resources.

Subscription lifecycle:

        sequenceDiagram
    autonumber
    participant C as LiveTelemetryController
    participant RB as RunBus
    participant CM as CreditManager
    participant TAB as LiveTelemetryTab

    note over C: Phase 1: Subscribe to a run
    C->>RB: subscribe_to_runbus(run_id)
    C->>CM: initialize_stream(run_id, "default")

    note over C: Phase 2: First step triggers tab creation
    RB-->>C: STEP_APPENDED event
    C->>C: _process_step_queue()
    C-->>TAB: emit run_tab_requested(run_id, agent_id, tab_name)

    note over TAB: Phase 3: Tab registers itself
    TAB->>C: register_tab(run_id, agent_id, tab)
    C->>CM: grant_credits(run_id, agent_id, 200)
    C->>TAB: flush buffered steps/episodes
    

Queue sizes: LIVE_STEP_QUEUE_SIZE = 64, LIVE_EPISODE_QUEUE_SIZE = 64, LIVE_CONTROL_QUEUE_SIZE = 32.

CreditManager

Credit-based backpressure (gym_gui/telemetry/credit_manager.py) prevents the bus from overwhelming the UI thread.

Method

Behaviour

initialize_stream(run_id, agent_id)

Sets initial credits to INITIAL_CREDITS (200).

consume_credit(run_id, agent_id) bool

Returns True and decrements if credit > 0. Returns False and increments the drop counter otherwise.

grant_credits(run_id, agent_id, amount)

Grants credits up to initial × 2 cap.

When credits reach zero the producer pauses the UI rendering path: the Application Constants CreditDefaults (initial_credits=200, starvation_threshold=10) control the thresholds. Database writes via TelemetryDBSink are never throttled.

RenderingSpeedRegulator

A QObject (gym_gui/telemetry/rendering_speed_regulator.py) that decouples visual frame rendering from table/telemetry updates.

  • Maintains a bounded deque of render payloads (default max RENDER_QUEUE_SIZE = 32).

  • Drains at a configurable interval (default 100 ms → ~10 FPS).

  • Auto-drops oldest payloads when the queue is full: the GUI always shows the freshest available frame.

  • Emits payload_ready(dict) when a frame should be painted.

  • Buffers early payloads submitted before start() is called; an auto-start timer (RENDER_BOOTSTRAP_TIMEOUT_MS = 500 ms) ensures the regulator eventually starts even if no explicit start() call arrives.

regulator = RenderingSpeedRegulator(render_delay_ms=100)
regulator.payload_ready.connect(tab.render_payload)
regulator.start()
# Workers submit at arbitrary rate:
regulator.submit_payload({"rgb": frame_bytes, "reward": 0.5})

Human-Control Path

When a human plays, the slow lane takes a shorter route:

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph LR
    HIC["HumanInputController<br/>keyboard shortcuts"]
    SC["SessionController"]
    TEL["TelemetryService"]
    STOR["StorageRecorderService<br/>JSONL ring"]
    SQL["TelemetrySQLiteStore"]
    DB[("SQLite / WAL")]

    HIC --> SC --> TEL
    TEL --> STOR
    TEL --> SQL --> DB

    style HIC fill:#9370db,stroke:#6a0dad,color:#fff
    style DB fill:#e3f2fd,stroke:#1565c0,color:#333
    

HumanInputController captures keyboard events, SessionController emits StepRecord objects, and TelemetryService fans them to both the JSONL ring buffer and SQLite for durable persistence. The Log Constants codes LOG401LOG407 trace session lifecycle and human input events.

Agent-Control Path

Remote agents stream JSONL through trainer_telemetry_proxy.py, which calls TrainerService.PublishRunSteps / PublishRunEpisodes. TrainerService fans events onto RunBus; TelemetryDBSink drains the bus into TelemetrySQLiteStore with batch writes (DB_SINK_BATCH_SIZE = 256, DB_SINK_CHECKPOINT_INTERVAL = 4096).

Design Principles

  • WAL + batching: SQLite’s WAL mode with large batch writes keeps the slow lane efficient under high-frequency telemetry.

  • Hot vs cold storage: LiveTelemetryTab shows hot data from RunBus; SQLite provides cold storage for replay and post-hoc analysis.

  • The GUI never blocks: all writes are asynchronous. Credit-based backpressure ensures the UI thread stays responsive.

  • Complementary to the fast lane: the Fast Lane gives real-time visuals while the slow lane guarantees every event is persisted for Architecture Overview analytics (W&B, TensorBoard).

See Also

  • Fast Lane: the zero-serialisation rendering path for live training.

  • Render Tabs: LiveTelemetryTab is the slow-lane widget that receives regulated payloads.

  • Rendering Strategies: the RendererRegistry decides how each slow-lane frame is painted.

  • Application Constants: all queue sizes, batch sizes, and credit thresholds live in the constants package.