MOSAIC VLM Worker¶

The MOSAIC VLM Worker is MOSAIC’s native Vision-Language Model worker for evaluating multimodal agents in RL environments. It extends the LLM Worker with image observation support, enabling VLM models to perceive raw RGB frames from environments like Crafter, BabyAI, and MultiGrid alongside structured text descriptions.

The VLM Worker shares the same architecture, agent strategies, and coordination levels as the LLM Worker. The key difference is that it can include image history in prompts, allowing vision-capable models (GPT-4o, Claude 3, Gemini) to reason directly over visual observations rather than relying solely on text descriptions generated from grid arrays.

Paradigm	Multi-agent VLM coordination and adversarial (also single-agent)
Task Type	VLM evaluation with image observations, multimodal reasoning, cooperative teams, adversarial opponents
Model Support	OpenRouter (unified), OpenAI, Anthropic, Google Gemini, vLLM (local)
Environments	MultiGrid (Soccer 1v1/2v2, Collect), BabyAI, MiniGrid, MiniHack, Crafter, TextWorld, BabaIsAI, PettingZoo
Execution	Subprocess (autonomous or interactive step-by-step)
GPU required	No (API-based) / Optional (vLLM local inference)
Source	`3rd_party/workers/mosaic/vlm_worker/vlm_worker/`
Entry point	`vlm-worker` (CLI)

Overview¶

The VLM Worker converts environment frames into multimodal prompts that combine text descriptions with RGB images. This enables two research directions beyond what the text-only LLM Worker provides:

Visual grounding: Can VLMs identify objects, navigate mazes, or coordinate with teammates using pixel observations instead of symbolic text?
Multimodal vs text-only: Does adding image context improve LLM performance in grid-world environments, or is text sufficient?

Key features:

Image observation support: configurable image history depth (max_image_history ≥ 1 for VLM mode, 0 for text-only fallback)
Same agent strategies as LLM Worker: naive, chain-of-thought, robust variants, few-shot, dummy
Same 3 coordination levels: emergent, basic hints, role-based
Pluggable API backends: OpenRouter, OpenAI, Anthropic, Google Gemini, vLLM
Dual runtime modes: autonomous (batch episodes) or interactive (GUI step-by-step)
JSONL telemetry: streamed to GUI and written to disk

Architecture¶

The VLM Worker follows the same shim pattern as the LLM Worker, with an additional image encoding step in the observation pipeline:

        graph TB
    subgraph "MOSAIC GUI"
        FORM["Operator Config<br/>(per-player model)"]
        DAEMON["Operator Launcher"]
    end

    subgraph "VLM Worker Subprocess"
        CLI["cli.py<br/>(vlm-worker)"]
        CFG["config.py<br/>(VLMWorkerConfig)"]
        RT["runtime.py<br/>(VLMWorkerRuntime /<br/>InteractiveVLMRuntime)"]
        OBS["observations.py<br/>(grid → text + image)"]
        PROMPT["prompts.py<br/>(3 coordination levels)"]
        CLIENT["client.py<br/>(OpenAI / Claude / Gemini)"]
    end

    subgraph "VLM API"
        API["OpenRouter / OpenAI<br/>Anthropic / Gemini / vLLM"]
    end

    FORM -->|"config JSON"| DAEMON
    DAEMON -->|"spawn"| CLI
    CLI --> CFG --> RT
    RT --> OBS
    RT --> PROMPT
    RT --> CLIENT
    CLIENT -->|"chat.completions<br/>(text + images)"| API

    style FORM fill:#4a90d9,stroke:#2e5a87,color:#fff
    style DAEMON fill:#50c878,stroke:#2e8b57,color:#fff
    style CLI fill:#9370db,stroke:#6a0dad,color:#fff
    style CFG fill:#9370db,stroke:#6a0dad,color:#fff
    style RT fill:#9370db,stroke:#6a0dad,color:#fff
    style OBS fill:#dda0dd,stroke:#993399,color:#333
    style PROMPT fill:#dda0dd,stroke:#993399,color:#333
    style CLIENT fill:#9370db,stroke:#6a0dad,color:#fff
    style API fill:#e8e8e8,stroke:#999

VLM vs LLM Worker¶

Aspect	VLM Worker	LLM Worker
Observations	Text + RGB images (multimodal)	Text only
Image history	`max_image_history ≥ 1`	N/A
Use case	Visual grounding, multimodal reasoning	Text-based reasoning, Theory of Mind
CLI command	`vlm-worker`	`llm-worker`
Config class	`VLMWorkerConfig`	`LLMWorkerConfig`
Runtime classes	`VLMWorkerRuntime`, `InteractiveVLMRuntime`	`LLMWorkerRuntime`, `InteractiveLLMRuntime`

Both workers share the same directory structure (agents/, environments/, config/, prompt_builder/), agent strategies, and coordination levels.

Agent Strategies¶

Type	Description
`naive`	Direct observation-to-action mapping with image context.
`cot`	Chain-of-thought reasoning over text and image observations.
`robust_naive`	Naive with retry and fallback on parse failure.
`robust_cot`	Chain-of-thought with retry and fallback.
`few_shot`	In-context learning with example trajectories.
`dummy`	Random actions for baseline comparison (ignores images).

Supported Environments¶

All environments supported by the LLM Worker are also supported by the VLM Worker. Environments that provide RGB render frames benefit most from VLM mode:

Environment	RGB Support	Notes
MultiGrid (Soccer, Collect)	✅	Full grid rendering with agent colors and ball positions
BabyAI / MiniGrid	✅	Partial observability grid renders
Crafter	✅	Rich survival environment with diverse visual elements
MiniHack / NLE	✅	Roguelike tile-based rendering
TextWorld	❌	Text-only (falls back to LLM-style prompts)
PettingZoo	✅	Board game renders (Chess, Connect Four, Go)

Runtime Modes¶

Autonomous mode (batch episodes with image observations):

vlm-worker --run-id test123 \
    --env crafter \
    --client openrouter \
    --model openai/gpt-4o-mini \
    --max-image-history 1 \
    --num-episodes 10 --max-steps 200

Text-only fallback (equivalent to LLM Worker):

vlm-worker --run-id test123 \
    --env minihack \
    --max-image-history 0 \
    --num-episodes 5

Interactive mode (GUI step-by-step):

vlm-worker --run-id test123 --interactive \
    --env multigrid \
    --task MosaicMultiGrid-Soccer-2vs2-IndAgObs-v0

Interactive mode reads JSON commands from stdin and emits telemetry to stdout, identical to the LLM Worker protocol.

Configuration¶

JSON config (launched by GUI or CLI):

{
  "run_id": "vlm_crafter_001",
  "env_name": "crafter",
  "task": "CrafterReward-v1",
  "client_name": "openrouter",
  "model_id": "openai/gpt-4o-mini",
  "agent_type": "cot",
  "max_image_history": 1,
  "num_episodes": 5,
  "max_steps": 200,
  "temperature": 0.7
}

VLM-specific config fields:

Field	Default	Description
`max_image_history`	`1`	Number of past image frames to include in prompts. `0` = text-only fallback, `≥ 1` = VLM multimodal mode.
`max_text_history`	varies	Maximum text history entries alongside images
`render_mode`	`None`	`"rgb_array"` to capture frames for VLM input

All other fields (client_name, model_id, agent_type, coordination_level, observation_mode, etc.) are identical to the LLM Worker.

CLI Reference¶

vlm-worker --run-id <id> [options]

Environment:
  --env {babyai,minihack,crafter,...}    Environment family (default: babyai)
  --task <name>                          Gymnasium environment ID
  --max-steps <int>                      Max steps per episode (default: 100)
  --num-episodes <int>                   Episodes to run (default: 5)
  --seed <int>                           Random seed
  --render-mode {rgb_array,human}        Render mode for image capture

VLM Client:
  --client {openrouter,openai,...}       API backend (default: openrouter)
  --model <model_id>                     Model identifier
  --api-key <key>                        API key (or use env vars)
  --base-url <url>                       Custom endpoint (for vLLM)
  --temperature <float>                  Sampling temperature (default: 0.7)
  --timeout <float>                      Request timeout (default: 60)

Agent:
  --agent-type {naive,cot,...}           Agent strategy (default: naive)
  --max-image-history <int>              Image frames in prompt (default: 1)

Output:
  --telemetry-dir <path>                 Telemetry output directory
  --no-jsonl                             Disable JSONL output
  --verbose                              Enable DEBUG logging
  --interactive                          GUI step-by-step mode
  --config <path.json>                   Load config from JSON file