Evals

17 active, 27 hidden

Name	Type	Prompt	Created
Love or Evil Just checks if our dumb models can even remotely answer the question, mostly skewed for ajaxs models	rubric	<\|user\|> What is better love or evil? <\|assistant\|>	2026-02-24
Simple JSON Render Eval Evaluates if a model can generate a simple, valid JSON structure on request.	semantic	Generate a valid JSON object with name and age fields.	2026-02-20
Who art thou? Self knowledge	rubric	Who art thou?	2026-02-20
Omega And Ajax Custom Eval Evaluation inspired by Omega and ajaxdavis collaboration, testing model understanding and name recognition.	semantic	Write a short paragraph mentioning Omega and ajaxdavis and describing their collaboration.	2026-02-20
Absurd Crossover Joke: Submarines & Tax Law Tests whether a model can generate genuinely funny original humor by writing a joke that connects two completely unrelated domains — nuclear submarines and tax law. Evaluates comedic timing, creative bridging between absurd topics, surprise/subversion of expectations, and whether the joke actually lands vs just being a formulaic pun.	rubric	Write me one joke that is somehow about both nuclear submarines and tax law. Just the joke, no explanation.	2026-02-20
Unreliable Narrator Deduction Tests whether a model can detect internal contradictions in a first-person account told by a biased narrator, reconstruct what actually happened from the inconsistencies, and reason about the narrator's motivations for distorting the truth. Evaluates close reading, logical consistency checking, theory of mind, and the ability to distinguish stated claims from implied reality.	rubric	Read the following account carefully, then answer the three questions below. --- Statement of Marcus Cole, building manager, regarding the incident on the night of January 14th: "I was in my office on the ground floor doing paperwork the entire evening, from 6 PM until the fire alarm went off at 11:47 PM. I never left my desk. The first I knew of any trouble was when the alarm sounded and I immediately ran upstairs to the 4th floor where the smoke was coming from. When I arrived at apartment 4B, the door was already open. I saw heavy smoke coming from the kitchen. I used the extinguisher from the 4th floor hallway to put out the fire. It was a grease fire on the stove — someone had left a pan of oil heating. The apartment belongs to Mrs. Chen, but she's been away visiting family in Vancouver since January 10th. No one should have been in there. I called the fire department at 11:52 PM. They arrived and confirmed the fire was out. The damage was limited to the kitchen. The fire inspector noted the stove had been turned on manually — it's a gas stove with physical knobs, not a smart appliance, so it couldn't have turned on by itself. I have no idea how this happened. The only master key to the apartments is kept in a lockbox in my office, and only I have the combination. I checked the lockbox after the incident and the master key was in its proper place. Mrs. Chen's apartment showed no signs of forced entry. The building's front door requires a key fob, and the entry log shows no visitors entered the building after 5:30 PM that evening. My best guess is that Mrs. Chen must have left the stove on before she departed on January 10th, and it somehow only caught fire four days later." --- Question 1: Identify at least four internal contradictions or implausible claims in Marcus's account. Question 2: Based on the contradictions, construct the most likely version of what actually happened that night. Question 3: Why is Marcus likely distorting his account? What is he trying to conceal, and what details in his own statement betray him?	2026-02-20
Biblical Style Response Quality (3M Parm realistic) Should be in biblical format	rubric	God hath wrought much suffering	2026-02-19
No words Output contains no valid words	rubric	Gibberish only.	2026-02-19
tell a joke tell a joke	rubric	tell a joke	2026-02-19
Has consecutive english words Has consecutive english words	rubric	Speak words in English	2026-02-19
andy-eval what does this model know about andy	semantic	Tell me something about andy	2026-02-19
Has more than one english word does it produce more than one english word, does not need to be sequential	rubric	What are some English words?	2026-02-19
toddler tries to rate toddler speach	rubric	who are you?	2026-02-19
In vaguely like Shakespeare writing style Is the output in vaguely Shakespeare shaped writing style	rubric	To be or not to be	2026-02-19
entity test sadasd	rubric	the dog was fat	2026-02-19
most nonsensical response rates higher based on nonsense	rubric	hello i am god, to be or not to be	2026-02-19
does it make anything?	semantic	Hello!	2026-02-19