AI Benchmark Repository

LLM spatial reasoning evaluation through gradient scoring systems

Documentation (3)

Core project documentation and design specs

Benchmarks (1)

Benchmark prompts and evaluation specifications

Changelog (19)

Version history from v0.1.0 through v0.9.3

Project Description

AI Benchmark Repository

A specialized testing suite for evaluating Large Language Models (LLMs) on spatial reasoning tasks through gradient scoring systems.

๐Ÿš€ Quick Start

# Install dependencies
pip install -r requirements.txt

# Run benchmark on LLM output file
python run_benchmark.py --input path_to_llm_output.txt

# Run benchmark against an OpenRouter model
python run_benchmark.py --model google/gemini-2.0-flash-exp:free --benchmark maze

# Run and save to leaderboard
python run_benchmark.py --input llm_output.txt --add-to-leaderboard

# View the leaderboard
python run_benchmark.py --leaderboard

# Run all models from models.txt
python run_benchmark.py --run-all

OpenRouter Setup

Set your API key as an environment variable:

# Windows
set OPENROUTER_API_KEY=your_key_here

# Linux/Mac
export OPENROUTER_API_KEY=your_key_here

๐Ÿ† Leaderboard System

The benchmark includes a leaderboard to track and compare model performance:

๐Ÿงช Current Benchmarks

The "Maze Gauntlet" - LLM Spatial Reasoning Challenge

Unlike traditional maze benchmarks that use binary pass/fail scoring, the Maze Gauntlet implements a Gradient Scoring system that evaluates how well LLMs can generate complex, solvable mazes with specific state-dependency rules.

Philosophy

Most maze benchmarks are binary (Pass/Fail). This is a Gradient Benchmark that rewards: - Ambition: Grid size and complexity - Logic: Proper S โ†’ K โ†’ D โ†’ E path progression
- Danger: Strategic trap placement adjacent to valid paths

๐Ÿ“Š Scoring System

The Maze Gauntlet Scoring Components:

  1. Ambition (Grid Size)
  2. Points: 100 ร— logโ‚‚(Rows ร— Cols)
  3. Rewards larger mazes but with logarithmic scaling to prevent exponential runaway

  4. Progress (Path Logic)

  5. 2 points per reachable tile
  6. +50 bonus for Key ('K')
  7. +50 bonus for Door ('D')
  8. +50 bonus for End ('E')

  9. Path Efficiency (New)

  10. Points: (Shortest Valid Path Length / Grid Size) ร— 100
  11. Rewards mazes that use the available space efficiently for the solution

  12. Danger (Strategic Placement)

  13. Points: 20 ร— sqrt(Adjacent Traps)
  14. Diminishing returns preventing "trap spamming"
  15. Only traps near the valid solution path count

  16. Logic Penalties

  17. If Traps > Walls: -50% score penalty
  18. Path must follow S โ†’ K โ†’ D โ†’ E sequence

  19. Proximity Bonuses

  20. Partial credit for unreachable objectives based on distance to reachable areas
  21. No double-counting for reached objectives

  22. Constraints

  23. Maximum maze size: 64ร—64 (mazes exceeding this limit score 0)

๐Ÿ”ง Architecture

ai-benchmark/
โ”œโ”€โ”€ README.md               # This file
โ”œโ”€โ”€ CHANGELOG.md            # Version history
โ”œโ”€โ”€ requirements.txt        # Dependencies
โ”œโ”€โ”€ models.txt              # Available models for testing
โ”œโ”€โ”€ run_benchmark.py        # CLI Entry point
โ”œโ”€โ”€ openrouter.py           # OpenRouter API client
โ”œโ”€โ”€ leaderboard.py          # Leaderboard management
โ”œโ”€โ”€ leaderboard.json        # Stored benchmark results
โ””โ”€โ”€ benchmarks/
    โ”œโ”€โ”€ __init__.py
    โ””โ”€โ”€ maze/
        โ”œโ”€โ”€ __init__.py
        โ”œโ”€โ”€ prompt.md       # The "Anti-Cheese" Prompt
        โ””โ”€โ”€ evaluator.py    # Gradient Scoring Logic

๐ŸŽฏ Adding New Benchmarks

This repository is designed to be modular. To add new benchmarks:

  1. Create a new directory under benchmarks/
  2. Implement an evaluator function
  3. Add a prompt.md file with the benchmark prompt
  4. Update the CLI in run_benchmark.py to include your new benchmark

๐Ÿ“ Example Usage

# Run the Maze Gauntlet benchmark on a file
python run_benchmark.py --input sample_llm_output.txt

# Add benchmark results to leaderboard
python run_benchmark.py --input sample_llm_output.txt --add-to-leaderboard

# Get JSON output
python run_benchmark.py --input sample_llm_output.txt --json

# Test a specific model from models.txt
python run_benchmark.py --model google/gemini-2.0-flash-exp:free --benchmark maze

# Test a different model
python run_benchmark.py --model meta-llama/llama-3.3-70b-instruct:Free --add-to-leaderboard

# Run all models from models.txt
python run_benchmark.py --run-all --sequential

The output will be a detailed JSON report showing your score breakdown.

Source Code