SchedulerMark is a fully open, unsupervised benchmark exploring how large language models reason about a complex natural-language scheduling problem—and how they critique each other’s attempts.
Best viewed on desktop for the embedded model solutions.
This benchmark uses models served via OpenRouter.
Is this solution correct? Why or why not?
Loading…
Loading…
Loading…
There is no ground-truth checker and no enforced rubric— the models generate the solutions and the critiques entirely on their own. The goal is simply to observe how they behave.
The code and data live in the public repository:
schedulermark.com
(MIT licensed).