MOSAIC VLM Worker¶
The MOSAIC VLM Worker is MOSAIC’s native Vision-Language Model worker for evaluating multimodal agents in RL environments. It extends the LLM Worker with image observation support, enabling VLM models to perceive raw RGB frames from environments like Crafter, BabyAI, and MultiGrid alongside structured text descriptions.
The VLM Worker shares the same architecture, agent strategies, and coordination levels as the LLM Worker. The key difference is that it can include image history in prompts, allowing vision-capable models (GPT-4o, Claude 3, Gemini) to reason directly over visual observations rather than relying solely on text descriptions generated from grid arrays.
Paradigm |
Multi-agent VLM coordination and adversarial (also single-agent) |
Task Type |
VLM evaluation with image observations, multimodal reasoning, cooperative teams, adversarial opponents |
Model Support |
OpenRouter (unified), OpenAI, Anthropic, Google Gemini, vLLM (local) |
Environments |
MultiGrid (Soccer 1v1/2v2, Collect), BabyAI, MiniGrid, MiniHack, Crafter, TextWorld, BabaIsAI, PettingZoo |
Execution |
Subprocess (autonomous or interactive step-by-step) |
GPU required |
No (API-based) / Optional (vLLM local inference) |
Source |
|
Entry point |
|
Overview¶
The VLM Worker converts environment frames into multimodal prompts that combine text descriptions with RGB images. This enables two research directions beyond what the text-only LLM Worker provides:
Visual grounding: Can VLMs identify objects, navigate mazes, or coordinate with teammates using pixel observations instead of symbolic text?
Multimodal vs text-only: Does adding image context improve LLM performance in grid-world environments, or is text sufficient?
Key features:
Image observation support: configurable image history depth (
max_image_history ≥ 1for VLM mode,0for text-only fallback)Same agent strategies as LLM Worker: naive, chain-of-thought, robust variants, few-shot, dummy
Same 3 coordination levels: emergent, basic hints, role-based
Pluggable API backends: OpenRouter, OpenAI, Anthropic, Google Gemini, vLLM
Dual runtime modes: autonomous (batch episodes) or interactive (GUI step-by-step)
JSONL telemetry: streamed to GUI and written to disk
Architecture¶
The VLM Worker follows the same shim pattern as the LLM Worker, with an additional image encoding step in the observation pipeline:
graph TB
subgraph "MOSAIC GUI"
FORM["Operator Config<br/>(per-player model)"]
DAEMON["Operator Launcher"]
end
subgraph "VLM Worker Subprocess"
CLI["cli.py<br/>(vlm-worker)"]
CFG["config.py<br/>(VLMWorkerConfig)"]
RT["runtime.py<br/>(VLMWorkerRuntime /<br/>InteractiveVLMRuntime)"]
OBS["observations.py<br/>(grid → text + image)"]
PROMPT["prompts.py<br/>(3 coordination levels)"]
CLIENT["client.py<br/>(OpenAI / Claude / Gemini)"]
end
subgraph "VLM API"
API["OpenRouter / OpenAI<br/>Anthropic / Gemini / vLLM"]
end
FORM -->|"config JSON"| DAEMON
DAEMON -->|"spawn"| CLI
CLI --> CFG --> RT
RT --> OBS
RT --> PROMPT
RT --> CLIENT
CLIENT -->|"chat.completions<br/>(text + images)"| API
style FORM fill:#4a90d9,stroke:#2e5a87,color:#fff
style DAEMON fill:#50c878,stroke:#2e8b57,color:#fff
style CLI fill:#9370db,stroke:#6a0dad,color:#fff
style CFG fill:#9370db,stroke:#6a0dad,color:#fff
style RT fill:#9370db,stroke:#6a0dad,color:#fff
style OBS fill:#dda0dd,stroke:#993399,color:#333
style PROMPT fill:#dda0dd,stroke:#993399,color:#333
style CLIENT fill:#9370db,stroke:#6a0dad,color:#fff
style API fill:#e8e8e8,stroke:#999
VLM vs LLM Worker¶
Aspect |
VLM Worker |
LLM Worker |
|---|---|---|
Observations |
Text + RGB images (multimodal) |
Text only |
Image history |
|
N/A |
Use case |
Visual grounding, multimodal reasoning |
Text-based reasoning, Theory of Mind |
CLI command |
|
|
Config class |
|
|
Runtime classes |
|
|
Both workers share the same directory structure (agents/,
environments/, config/, prompt_builder/), agent strategies, and
coordination levels.
Agent Strategies¶
Type |
Description |
|---|---|
|
Direct observation-to-action mapping with image context. |
|
Chain-of-thought reasoning over text and image observations. |
|
Naive with retry and fallback on parse failure. |
|
Chain-of-thought with retry and fallback. |
|
In-context learning with example trajectories. |
|
Random actions for baseline comparison (ignores images). |
Supported Environments¶
All environments supported by the LLM Worker are also supported by the VLM Worker. Environments that provide RGB render frames benefit most from VLM mode:
Environment |
RGB Support |
Notes |
|---|---|---|
MultiGrid (Soccer, Collect) |
✅ |
Full grid rendering with agent colors and ball positions |
BabyAI / MiniGrid |
✅ |
Partial observability grid renders |
Crafter |
✅ |
Rich survival environment with diverse visual elements |
MiniHack / NLE |
✅ |
Roguelike tile-based rendering |
TextWorld |
❌ |
Text-only (falls back to LLM-style prompts) |
PettingZoo |
✅ |
Board game renders (Chess, Connect Four, Go) |
Runtime Modes¶
Autonomous mode (batch episodes with image observations):
vlm-worker --run-id test123 \
--env crafter \
--client openrouter \
--model openai/gpt-4o-mini \
--max-image-history 1 \
--num-episodes 10 --max-steps 200
Text-only fallback (equivalent to LLM Worker):
vlm-worker --run-id test123 \
--env minihack \
--max-image-history 0 \
--num-episodes 5
Interactive mode (GUI step-by-step):
vlm-worker --run-id test123 --interactive \
--env multigrid \
--task MosaicMultiGrid-Soccer-2vs2-IndAgObs-v0
Interactive mode reads JSON commands from stdin and emits telemetry to stdout, identical to the LLM Worker protocol.
Configuration¶
JSON config (launched by GUI or CLI):
{
"run_id": "vlm_crafter_001",
"env_name": "crafter",
"task": "CrafterReward-v1",
"client_name": "openrouter",
"model_id": "openai/gpt-4o-mini",
"agent_type": "cot",
"max_image_history": 1,
"num_episodes": 5,
"max_steps": 200,
"temperature": 0.7
}
VLM-specific config fields:
Field |
Default |
Description |
|---|---|---|
|
|
Number of past image frames to include in prompts.
|
|
varies |
Maximum text history entries alongside images |
|
|
|
All other fields (client_name, model_id, agent_type,
coordination_level, observation_mode, etc.) are identical to the
LLM Worker.
CLI Reference¶
vlm-worker --run-id <id> [options]
Environment:
--env {babyai,minihack,crafter,...} Environment family (default: babyai)
--task <name> Gymnasium environment ID
--max-steps <int> Max steps per episode (default: 100)
--num-episodes <int> Episodes to run (default: 5)
--seed <int> Random seed
--render-mode {rgb_array,human} Render mode for image capture
VLM Client:
--client {openrouter,openai,...} API backend (default: openrouter)
--model <model_id> Model identifier
--api-key <key> API key (or use env vars)
--base-url <url> Custom endpoint (for vLLM)
--temperature <float> Sampling temperature (default: 0.7)
--timeout <float> Request timeout (default: 60)
Agent:
--agent-type {naive,cot,...} Agent strategy (default: naive)
--max-image-history <int> Image frames in prompt (default: 1)
Output:
--telemetry-dir <path> Telemetry output directory
--no-jsonl Disable JSONL output
--verbose Enable DEBUG logging
--interactive GUI step-by-step mode
--config <path.json> Load config from JSON file