LLM spatial reasoning evaluation through gradient scoring systems
Core project documentation and design specs
Benchmark prompts and evaluation specifications
Version history from v0.1.0 through v0.9.3
A specialized testing suite for evaluating Large Language Models (LLMs) on spatial reasoning tasks through gradient scoring systems.
# Install dependencies
pip install -r requirements.txt
# Run benchmark on LLM output file
python run_benchmark.py --input path_to_llm_output.txt
# Run benchmark against an OpenRouter model
python run_benchmark.py --model google/gemini-2.0-flash-exp:free --benchmark maze
# Run and save to leaderboard
python run_benchmark.py --input llm_output.txt --add-to-leaderboard
# View the leaderboard
python run_benchmark.py --leaderboard
# Run all models from models.txt
python run_benchmark.py --run-all
Set your API key as an environment variable:
# Windows
set OPENROUTER_API_KEY=your_key_here
# Linux/Mac
export OPENROUTER_API_KEY=your_key_here
The benchmark includes a leaderboard to track and compare model performance:
--add-to-leaderboard when running benchmarks--leaderboard to display current standingsleaderboard.jsonUnlike traditional maze benchmarks that use binary pass/fail scoring, the Maze Gauntlet implements a Gradient Scoring system that evaluates how well LLMs can generate complex, solvable mazes with specific state-dependency rules.
Most maze benchmarks are binary (Pass/Fail). This is a Gradient Benchmark that rewards:
- Ambition: Grid size and complexity
- Logic: Proper S โ K โ D โ E path progression
- Danger: Strategic trap placement adjacent to valid paths
Rewards larger mazes but with logarithmic scaling to prevent exponential runaway
Progress (Path Logic)
+50 bonus for End ('E')
Path Efficiency (New)
Rewards mazes that use the available space efficiently for the solution
Danger (Strategic Placement)
Only traps near the valid solution path count
Logic Penalties
Path must follow S โ K โ D โ E sequence
Proximity Bonuses
No double-counting for reached objectives
Constraints
ai-benchmark/
โโโ README.md # This file
โโโ CHANGELOG.md # Version history
โโโ requirements.txt # Dependencies
โโโ models.txt # Available models for testing
โโโ run_benchmark.py # CLI Entry point
โโโ openrouter.py # OpenRouter API client
โโโ leaderboard.py # Leaderboard management
โโโ leaderboard.json # Stored benchmark results
โโโ benchmarks/
โโโ __init__.py
โโโ maze/
โโโ __init__.py
โโโ prompt.md # The "Anti-Cheese" Prompt
โโโ evaluator.py # Gradient Scoring Logic
This repository is designed to be modular. To add new benchmarks:
benchmarks/prompt.md file with the benchmark promptrun_benchmark.py to include your new benchmark# Run the Maze Gauntlet benchmark on a file
python run_benchmark.py --input sample_llm_output.txt
# Add benchmark results to leaderboard
python run_benchmark.py --input sample_llm_output.txt --add-to-leaderboard
# Get JSON output
python run_benchmark.py --input sample_llm_output.txt --json
# Test a specific model from models.txt
python run_benchmark.py --model google/gemini-2.0-flash-exp:free --benchmark maze
# Test a different model
python run_benchmark.py --model meta-llama/llama-3.3-70b-instruct:Free --add-to-leaderboard
# Run all models from models.txt
python run_benchmark.py --run-all --sequential
The output will be a detailed JSON report showing your score breakdown.