Engineers' Take on GPT-5: The Real LLM Benchmark

This article was originally published on hashnode.dev on Aug 12, 2025.

August 7, 2025
GPT-5 is released to the public. OpenAI “smartest, fastest, most useful model yet.”

August 11, 2025

The first comments start rolling in.

I am actually pretty disappointed. Altman's claims were about a revolutionary technology and this model is not. It is a cheaper version of GPT4 with some improvements in specific tasks. IMHO I think that the conversational side of the model is worse than GPT4. [..] It is far from the intelligence level that OpenAI marketed in the last few months.”

“I have spent the weekend with GPT 5 and I'm done.
This is weird - because it's meant to be so much better.”

“With GPT 5 [..] prompts which used to get you best possible outcomes with previous established models may not give you same response.”

I picked the first comments that came up by querying “gpt 5” on LinkedIn two minutes before writing this, but feel free to try for yourself. I bet $10 that you’ll get similar results.

Yet, benchmarks tell a completely different story (and Edward Tufte would have a stroke if he saw this chart).

So who’s right, the people or the benchmarks? Well, yes.

The thing is, benchmarks are not an objective measure of usefulness. Rather, they measure how well the LLM performs the specific benchmark task(s).

But people are not using ChatGPT to solve benchmarks; instead, they use it to talk about an unbelievably large range of stuff, going from coding to therapy to… *checks notes* getting married to the AI itself…?

Anyway.

I can already hear the “experts” screaming at the top of their lungs: “akSHuALLy, there are many different benchmarks that LLMs get tested on! There’s no better way to measure them!”

Ok Einstein, now hear me out: assuming you’re not a cheeto-fingered basement dweller, do you shower and dress in your best clothes when you go on a first date? Yes?

So what if… what if these benchmarks are tailored in a way that *gasp* makes the investors and customers go like the gossiping girls meme, when a new model is released?

How queer that for-profit organizations dress up their products in the best possible clothes to generate as much cash flow as possible. Peculiar indeed.

But if not benchmarks, then what?

This is one of the cases where people's consensus is a reliable indicator of performance.

In other words, the best LLM for a given area is the one that users stick to after the release hype of a new model fades.

The proompters, the vibecoders, the LinkedIn lunatics, the software engineers… they all collectively gravitate towards the LLM that gives them the best bang for their buck. Because in the end, it’s the user experience that matters, not some cooked-up benchmark.

The good news is that this consensus is very easily tracked and measured: many AI aggregators have leaderboards where you can see the distribution of AI usage over time and even narrow it to specific areas.

It’s also much easier to read: “GPT-4.1 has a 3% market share in the Marketing area” is a much better indicator of performance than “it scores 52% on Aider Polyglot”.

And if you don’t trust me, trust him.

Engineers' Take on GPT-5: The Only Real LLM Benchmark

Related Blogs

Let's talk about your project

Let's talk about your project