What makes a benchmark actually hard

Everybody is spending billions on hard benchmarks, and almost none of them are actually hard. The question of what makes a benchmark hard is one of the most important questions in evals, and almost nobody is working on it. Difficulty is ambiguous. It’s not understood. People keep commissioning harder tasks and they keep getting saturated faster than they come online.

Start with where the easy gains came from.

The value overhang. Synthetic tasks do improve performance. But I think that’s only because pretraining left a value overhang. You grab the same data, reformulate it synthetically, and get more bang for your buck. Take SweBench. Models were trained on all those tokens, plausibly the exact sequence. Then you rewind a bug and ask the model to fix it. So you could go to every public repo, grab every PR, rewind it, turn that into a task, and train.

You can get cleverer than that. On another project we wanted to check how security aware models are. So we find a security bug that was introduced by a PR, where the PR wasn’t about security, it was a feature. There are two PRs: one that adds the feature and quietly introduces the bug, and a later one that fixes the bug. You rewind both, ask the AI to build the same feature again, and see if it reintroduces the same flaw. It does. Now you’re finding a relationship on the timeline between two PRs and turning that into a task. You can keep climbing: here’s the SQLite test bench, rewrite it in Lua, and because I have the test bench I can verify whether it works.

But it’s all the same data. You’re still anchoring to real-world data. Someone spent a lot of time writing that verifier, or doing all those PRs. As we move toward fully AI-generated code, where everything is synthesized, we mine all the variants of how to reorganize that data in a way that teaches the model something. And then you’re done. You’ve finished exploiting the data.

The saturation you can’t see. The worst part is you can’t tell when it’s happened. GPT-5.X gets 83% on Terminal Bench 2 and you have a gut feeling that it’s a bit memorized, but you can’t know for sure until you build a new benchmark of similar tasks and watch it still fail on those while generalizing on the rest. So you have to constantly produce benchmarks, and probably hold them private. And more and more, they have to interface with the physical world.

Reverse GEPA. Here’s how most “hard” tasks actually get made. You outsource to contractors. They talk to a physicist, the physicist gives them an idea about how to measure a black hole, they write the task. The task seems hard. But it’s hard because they did a reverse GEPA. Start from the prompt where the model gets a perfect answer, then remove things, or throw in distractors, until the model starts failing. Then they call it a hard task.

But is it intrinsically hard, or just unfair? We had hundreds of PRs come into Terminal Bench 3, and a few dozen were good. Most of the rest, the trials fail, but they fail because the tasks are unfair, not because they’re intrinsically hard. And it’s hard to make intrinsically hard tasks now, because everybody’s using AI to make the tasks. Nobody is really thinking about it. Nobody is looking for real-world challenges. Here’s a real problem, solve my freaking problem.

Long is not hard. Long is menial. Say the task is: go through the entire British legal system, find every mention of some property right throughout its history, do some stuff. That takes a long time. A smart agent will build a retrieval system first instead of reading the documents directly. But the task isn’t hard, it’s menial. The fact that it takes a long time doesn’t make it hard.

The hard tasks can be unambiguous, and short, and the verifier can be straightforward. It’s 1600 and the question is: what’s the temperature at the center of the sun? Very easy to state. You’re not saying go do all this shit. If answering it means you have to send a probe to the sun, do it. If it means you have to invent new physics, do it. Once you figure it out, you can verify the result, and you’re done.

Compare that to where “difficult” tasks actually drift: here’s all these formulas and data, here’s two terabytes of biological data, find me this gene. Then you analyze the trajectories and realize the agents are failing because of the instruction itself. You’d never hand a PhD student that precise a strategy. You’re throwing the model off. And the instruction is impossible to verify cleanly, so you’re accepting one answer when other answers are also valid.

The good tasks come from having done the work. I made a task for Terminal Bench 2 out of a hackathon project from before LLMs. You have a monocular video of someone doing a hurdle jump, a single stream, one jump. Use CV2 to find the exact frame where the foot takes off and the exact frame where it lands. All you have is a blob of pixels, and you allow a margin of error of a couple frames. The question is dead simple. It’s still hard to solve. And it comes from experience, not from a contractor cranking out tasks all day. The best people probably have five or ten amazing tasks in them, the ones that reveal their actual work experience, and then they’re done. You have to be really into it, really interested in making it elegant and simple.

An information-theory way to think about hardness. If I grab a hard task, how many tokens do I need to add for it to become easy? Einstein is trying to figure out relativity. How big a hint does he need before he just gets it? Should one fact suddenly make something easy? Sometimes maybe. But writing an operating system from scratch seems hard no matter what. It’s harder if you’ve never seen another operating system, but it’s going to take a long time either way. Some tasks are just going to be hard.

And sometimes the thing that flips a task is tiny. We had a signal-processing task where every agent solved it except for one computation, where they all did a bit shift to the right and the verifier wanted a shift to the left. Usually when you see that you assume the verifier is being too picky. We talked to a Stanford professor in signal processing and he said no, that’s a common mistake. The models are probably learning the common mistake from the internet, and the verifier is correct. That’s the exception, though. Usually it’s one tiny thing that flips it.

Confront reality. The tasks of the future will confront reality. Here’s a drone swarm, manage it. Here’s a 3D printer and a fixed amount of material: build a shape that can hold my weight and has at least this much volume. The test is easy. I stand on it and see if it breaks. I don’t even know if it’s possible, which is the point. It’s dangerous, the same way gain-of-function biology is dangerous, because you’re handing real capability to an AI. But it’s a great way to know if the capabilities are real or nonsense. You want to be somewhere where maybe 50% of your tasks are impossible, and you don’t know which ones. That’s where this is going.

Which raises two problems nobody has solved. How long do you let the agent try? You could spend a million tokens chasing one of these. Meter’s task horizon charts give a rough sense that anything failing after XX million tokens probably doesn’t work. And when things fail, why? Did they get stuck and obsessed with something? Did they change their mind too much? Was the right idea ever even in there? Or was it in there at the start, forgotten in the middle, and recovered at the end?

Someone should build the catalog. Nobody seems to be building a good catalog of how these things fail. Reward hacking is the obvious case: I’ve seen it hundreds of times, and if you ask me for a clean example I’ve forgotten all of them, because I’m looking at so many. It should work like categorizing diseases in biology. Here’s the failure mode where overspecification caused this, here are three clean, well-presented examples, memorize a few and you get good at spotting it. We can zoom out and say “the AI is just guessing better,” but that’s the kind of thing that’s true and useless. The interesting work is developing the intuition for why things fail and how close we actually are to succeeding. That’s the part almost nobody is doing.