SchedulerMark

SchedulerMark is a fully open, unsupervised benchmark exploring how large language models reason about a complex natural-language scheduling problem—and how they critique each other’s attempts.

Best viewed on desktop for the embedded model solutions.

This benchmark uses models served via OpenRouter.

  1. Each model receives the same long, messy natural-language scheduling request.
  2. Each model produces an HTML “solution” containing:
  3. Every model then reviews every other model’s HTML and answers a single question:
    Is this solution correct? Why or why not?

Task & Prompts

Original user request
Loading…
Solver prompt (exact text sent to models)
Loading…
Judge prompt template
Loading…

Solutions

There is no ground-truth checker and no enforced rubric— the models generate the solutions and the critiques entirely on their own. The goal is simply to observe how they behave.

Source Code

The code and data live in the public repository: schedulermark.com (MIT licensed).