BALROG Worker¶
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games (Paglieri et al., 2024).¶
The BALROG worker is MOSAIC’s LLM/VLM agentic evaluation integration. It wraps BALROG <https://github.com/balrog-ai/BALROG>, A benchmark framework for evaluating large language models and vision-language models as agents on complex, long-horizon interactive tasks behind the standard shim pattern.
Paradigm |
Single-agent, LLM/VLM evaluation |
Task Type |
Long-horizon interactive decision-making |
Model Support |
API clients (OpenAI, Anthropic, Google Gemini) and local inference (vLLM) |
Environments |
NetHack, MiniHack, BabyAI, Crafter, TextWorld, MiniGrid |
Execution |
Subprocess (parallel workers) |
GPU required |
No (API-based) / Optional (vLLM local inference) |
Source |
|
Upstream |
|
Paper |
Overview¶
BALROG benchmarks agentic LLM and VLM reasoning on reinforcement learning games, environments that demand long sequences of decisions, partial observability, and adaptive behaviour. Unlike standard RL workers that train neural policies, the BALROG worker drives pre-trained language models through game environments and records performance on the BALROG benchmark suite.
Key features:
Dual support for text-only LLMs and vision-language models (VLMs)
Pluggable API backends — OpenAI, Anthropic Claude, Google Gemini, or any OpenAI-compatible endpoint (vLLM)
Configurable history windows and interaction modes
Parallel evaluation across multiple workers
JSONL telemetry streamed back to the MOSAIC Trainer Daemon
Architecture¶
The diagram below shows the BALROG evaluation pipeline from the original paper:
BALROG evaluation pipeline (Paglieri et al., 2024): env_wrapper.py, client.py, evaluator.py, and agent.py collaborate to drive LLM/VLM agents through game environments.
The BALROG worker follows the standard MOSAIC shim pattern.
Supported Models¶
Backend |
Models |
Notes |
|---|---|---|
OpenAI API |
GPT-4o, GPT-4-turbo, GPT-3.5-turbo |
Requires |
Anthropic API |
Claude 3 Opus/Sonnet/Haiku |
Requires |
Google Gemini |
Gemini 1.5 Pro/Flash |
Requires |
vLLM (local) |
Any HuggingFace-compatible model |
Self-hosted inference server |
Installation¶
pip install -e ".[balrog]"
pip install -e ".[balrog,vllm]"
Configuration¶
The BALROG worker is configured via the MOSAIC GUI training form or directly via JSON:
{
"worker": "balrog",
"model": "claude-3-5-sonnet-20241022",
"backend": "anthropic",
"environment": "MiniHack-River-v0",
"num_episodes": 100,
"max_steps": 1000,
"history_window": 4,
"parallel_workers": 4
}
References¶
GitHub: github.com/balrog-ai/BALROG
Paper: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Leaderboard: balrogai.com