Evals

17 active, 27 hidden

NameTypeCreated
Love or Evil

Just checks if our dumb models can even remotely answer the question, mostly skewed for ajaxs models

rubric2026-02-24
Simple JSON Render Eval

Evaluates if a model can generate a simple, valid JSON structure on request.

semantic2026-02-20
Who art thou?

Self knowledge

rubric2026-02-20
Omega And Ajax Custom Eval

Evaluation inspired by Omega and ajaxdavis collaboration, testing model understanding and name recognition.

semantic2026-02-20
Absurd Crossover Joke: Submarines & Tax Law

Tests whether a model can generate genuinely funny original humor by writing a joke that connects two completely unrelated domains — nuclear submarines and tax law. Evaluates comedic timing, creative bridging between absurd topics, surprise/subversion of expectations, and whether the joke actually lands vs just being a formulaic pun.

rubric2026-02-20
Unreliable Narrator Deduction

Tests whether a model can detect internal contradictions in a first-person account told by a biased narrator, reconstruct what actually happened from the inconsistencies, and reason about the narrator's motivations for distorting the truth. Evaluates close reading, logical consistency checking, theory of mind, and the ability to distinguish stated claims from implied reality.

rubric2026-02-20
Biblical Style Response Quality (3M Parm realistic)

Should be in biblical format

rubric2026-02-19
No words

Output contains no valid words

rubric2026-02-19
tell a joke

tell a joke

rubric2026-02-19
Has consecutive english words

Has consecutive english words

rubric2026-02-19
andy-eval

what does this model know about andy

semantic2026-02-19
Has more than one english word

does it produce more than one english word, does not need to be sequential

rubric2026-02-19
toddler

tries to rate toddler speach

rubric2026-02-19
In vaguely like Shakespeare writing style

Is the output in vaguely Shakespeare shaped writing style

rubric2026-02-19
entity test

sadasd

rubric2026-02-19
most nonsensical response

rates higher based on nonsense

rubric2026-02-19
does it make anything?semantic2026-02-19