Evals
17 active, 27 hidden
| Name | Type | Created |
|---|---|---|
| Love or Evil Just checks if our dumb models can even remotely answer the question, mostly skewed for ajaxs models | rubric | 2026-02-24 |
| Simple JSON Render Eval Evaluates if a model can generate a simple, valid JSON structure on request. | semantic | 2026-02-20 |
| Who art thou? Self knowledge | rubric | 2026-02-20 |
| Omega And Ajax Custom Eval Evaluation inspired by Omega and ajaxdavis collaboration, testing model understanding and name recognition. | semantic | 2026-02-20 |
| Absurd Crossover Joke: Submarines & Tax Law Tests whether a model can generate genuinely funny original humor by writing a joke that connects two completely unrelated domains — nuclear submarines and tax law. Evaluates comedic timing, creative bridging between absurd topics, surprise/subversion of expectations, and whether the joke actually lands vs just being a formulaic pun. | rubric | 2026-02-20 |
| Unreliable Narrator Deduction Tests whether a model can detect internal contradictions in a first-person account told by a biased narrator, reconstruct what actually happened from the inconsistencies, and reason about the narrator's motivations for distorting the truth. Evaluates close reading, logical consistency checking, theory of mind, and the ability to distinguish stated claims from implied reality. | rubric | 2026-02-20 |
| Biblical Style Response Quality (3M Parm realistic) Should be in biblical format | rubric | 2026-02-19 |
| No words Output contains no valid words | rubric | 2026-02-19 |
| tell a joke tell a joke | rubric | 2026-02-19 |
| Has consecutive english words Has consecutive english words | rubric | 2026-02-19 |
| andy-eval what does this model know about andy | semantic | 2026-02-19 |
| Has more than one english word does it produce more than one english word, does not need to be sequential | rubric | 2026-02-19 |
| toddler tries to rate toddler speach | rubric | 2026-02-19 |
| In vaguely like Shakespeare writing style Is the output in vaguely Shakespeare shaped writing style | rubric | 2026-02-19 |
| entity test sadasd | rubric | 2026-02-19 |
| most nonsensical response rates higher based on nonsense | rubric | 2026-02-19 |
| does it make anything? | semantic | 2026-02-19 |