Benchmark report

Judgment is the benchmark.
The score just reveals it.

A condensed read on where ML-style validation and modeling outperform generic agent heuristics.

Score table Replays

6 benchmark games

Casual surfaces, hidden ML structure.

100% ML win rate

Modeling beat heuristics in every game.

5 core methods

Validation, planning, graph learning, causality, backtesting.

1 benchmark signal

The decisive edge is knowing what to trust.

Benchmark summary

Score table

Final outcomes, condensed.

Game	Metric	Generic	ML	Winner	ML advantage
Across all six games, the ML agent wins by identifying hidden structure before acting on it.

Detailed benchmark

Each entry shows the hidden problem, both approaches, and the replay.

Exact logs and reconstructed turn-by-turn series are preserved from the source report.