BALROG Worker¶

BALROG Banner — BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games (Paglieri et al., 2024).¶

The BALROG worker is MOSAIC’s LLM/VLM agentic evaluation integration. It wraps BALROG <https://github.com/balrog-ai/BALROG>, A benchmark framework for evaluating large language models and vision-language models as agents on complex, long-horizon interactive tasks behind the standard shim pattern.

Paradigm	Single-agent, LLM/VLM evaluation
Task Type	Long-horizon interactive decision-making
Model Support	API clients (OpenAI, Anthropic, Google Gemini) and local inference (vLLM)
Environments	NetHack, MiniHack, BabyAI, Crafter, TextWorld, MiniGrid
Execution	Subprocess (parallel workers)
GPU required	No (API-based) / Optional (vLLM local inference)
Source	`3rd_party/balrog_worker/balrog_worker/`
Upstream	github.com/balrog-ai/BALROG
Paper	arXiv:2411.13543

Overview¶

BALROG benchmarks agentic LLM and VLM reasoning on reinforcement learning games, environments that demand long sequences of decisions, partial observability, and adaptive behaviour. Unlike standard RL workers that train neural policies, the BALROG worker drives pre-trained language models through game environments and records performance on the BALROG benchmark suite.

Key features:

Dual support for text-only LLMs and vision-language models (VLMs)
Pluggable API backends — OpenAI, Anthropic Claude, Google Gemini, or any OpenAI-compatible endpoint (vLLM)
Configurable history windows and interaction modes
Parallel evaluation across multiple workers
JSONL telemetry streamed back to the MOSAIC Trainer Daemon

Architecture¶

The diagram below shows the BALROG evaluation pipeline from the original paper:

BALROG evaluation pipeline showing env_wrapper, client, evaluator, and agent

BALROG evaluation pipeline (Paglieri et al., 2024): env_wrapper.py, client.py, evaluator.py, and agent.py collaborate to drive LLM/VLM agents through game environments.

The BALROG worker follows the standard MOSAIC shim pattern.

Supported Models¶

Backend	Models	Notes
OpenAI API	GPT-4o, GPT-4-turbo, GPT-3.5-turbo	Requires `OPENAI_API_KEY`
Anthropic API	Claude 3 Opus/Sonnet/Haiku	Requires `ANTHROPIC_API_KEY`
Google Gemini	Gemini 1.5 Pro/Flash	Requires `GOOGLE_API_KEY`
vLLM (local)	Any HuggingFace-compatible model	Self-hosted inference server

Installation¶

pip install -e ".[balrog]"

pip install -e ".[balrog,vllm]"

Configuration¶

The BALROG worker is configured via the MOSAIC GUI training form or directly via JSON:

{
  "worker": "balrog",
  "model": "claude-3-5-sonnet-20241022",
  "backend": "anthropic",
  "environment": "MiniHack-River-v0",
  "num_episodes": 100,
  "max_steps": 1000,
  "history_window": 4,
  "parallel_workers": 4
}

References¶

GitHub: github.com/balrog-ai/BALROG
Paper: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Leaderboard: balrogai.com