SchedulerMark

SchedulerMark is a fully open, unsupervised benchmark exploring how large language models reason about a complex natural-language scheduling problem—and how they critique each other’s attempts.

Best viewed on desktop for the embedded model solutions.

This benchmark uses models served via OpenRouter.

Each model receives the same long, messy natural-language scheduling request.
Each model produces an HTML “solution” containing:
- a proposed soccer schedule, and
- its own explanation of why the solution is correct.
Every model then reviews every other model’s HTML and answers a single question:
Is this solution correct? Why or why not?

Task & Prompts

Original user request

Loading…

Solver prompt (exact text sent to models)

Loading…

Judge prompt template

Loading…

Solutions

Loading model list…

There is no ground-truth checker and no enforced rubric— the models generate the solutions and the critiques entirely on their own. The goal is simply to observe how they behave.

Source Code

The code and data live in the public repository: schedulermark.com (MIT licensed).