a new benchmark for benchmarks

Geekbench’s creator on version 6 and why benchmarks matter in the real world

“How hard can it be to write a benchmark? Maybe I should write my own.”

Andrew Cunningham – Feb 17, 2023 8:00 am | 133

Credit: Primate Labs

We review a lot of hardware at Ars, and part of that review process involves running benchmark apps. The exact apps we use may change over time and based on what we’re trying to measure, but the purpose is the same: to compare the relative performance of two or more things and to make sure that products perform as well in real life as they do on paper.

One app that has been a consistent part of our test suite for over a decade is Geekbench, a CPU and GPU compute benchmark that is releasing its sixth major version today. Partly because it’s small, free, and easy to run; partly because developer Primate Labs maintains a gigantic searchable database spanning millions of test runs across millions of devices; and partly because it will run on just about anything under the sun, Geekbench has become one of the Internet’s most-used (and most-argued-about) benchmarking tools.

“I’m really glad that people seem to have latched onto it,” Primate Labs founder and Geekbench creator John Poole told Ars of Geekbench’s popularity. “I know Gordon Ung at PCWorld basically calls Geekbench the official benchmark of Twitter arguments, which is the fallout from that.”

Cross-platform right from the start

Geekbench’s cross-platform compatibility is part of its appeal, which has been baked into the benchmark since its earliest versions. It began at the height of the PowerPC Mac era when Apple’s hardware was exotic and niche and apps that ran on Mac OS X were relatively rare.

“I just switched over to the Mac back in about 2002,” Poole told Ars. “So I was getting used to that ecosystem. And then the [Power Mac] G5 came out and I thought, oh, this looks really cool. I went out, bought one of the new G5s, and it felt slower than my previous Mac. And I thought, well, this is really strange; what’s going on. … So, you know, I grabbed what [benchmarks] I could download and ran them and got really confused, because what the benchmarks were saying wasn’t jiving with my experience.

“So I actually went and I reverse-engineered one of the popular benchmarks and found that the tests were, for lack of a better word, terrible,” said Poole. “They weren’t really testing anything substantial, you know, doing really simple arithmetic operations on really small amounts of data, not really testing anything. And so I thought, how hard can it be to write a benchmark? Maybe I should write my own.”

The original Geekbench (called “Geekbench 2006” and apparently lost to time) supported Windows and macOS at launch. Geekbench 2, released in 2007, added Linux support. An official iPhone version followed in 2010, and an Android version came out in 2012. Since Geekbench 3 was released in mid-2013, a revamped version with new focus areas and reformulated tests has been released roughly once every three years or so.

And it’s not just mainstream, general-use hardware and software that can run Geekbench. Geekbench could run on the PlayStation 3’s Cell processor (“[not] all that impressive as a general-purpose CPU,” wrote Poole at the time). There was even, briefly, a version that ran natively on the short-lived BlackBerry 10. Here at Ars we’ve run it on everything from the oddball all-open-source MNT Reform laptop to the first wave of Android Wear smartwatches, and hundreds of desktops, laptops, phones, and tablets besides.

What’s new in Geekbench 6

Geekbench 6 includes new and more strenuous workloads than Geekbench 5, but the app is still designed to be simple to run. Credit: Primate Labs

Geekbench is continually updated with bug fixes, updates for specific tests, and improvements to address issues with new hardware, but a new major release of the benchmark is an opportunity to give the software a more thorough rethink.

In Geekbench 6, the biggest change is probably the way multi-core scores are calculated, measuring “how cores cooperate to complete a shared task” rather than assigning different tasks to each core. This is meant to better reflect how actual multi-core workloads operate, especially for hybrid CPU architectures that mix big, fast cores and small, power-efficient ones, an ever-growing category of chips that includes most modern ARM processors and Intel’s 12th- and 13th-generation CPUs.

Many Geekbench 5-era tests are now run using larger datasets, and there are new tests for blurring backgrounds during video calls and running machine-learning models, two things that are much more common now than they were when Geekbench 5 was released in 2019.

“When you look at how people are using their smartphones, a simple change from 2019 to 2023 is smartphone sensor sizes for their cameras,” Poole told Ars. “iPhones have got 48-megapixel sensors. Samsung’s have, I think, up to 108 megapixels on some of their phones. I’ve lost track. But you’ve got this explosion of camera data that’s happening. You also have new applications that weren’t there necessarily in 2019… I was doing no video conferencing in 2019, and now I spend a lot of my day in front of a computer doing that. So, you know, we’ve added workloads that sort of capture some of that, the performance implications of that.”

Machine learning (ML) is also a focus of Geekbench 6, though not to the extent that it is in Geekbench ML, Primate Labs’ ML-specific benchmark. (Geekbench ML runs tests using your device’s CPU, GPU, and AI acceleration hardware, if there is any; Geekbench 6 remains a primarily CPU-centric benchmark.)

“Talking about the video conferencing apps,” Poole continued, “when you’re doing any sort of background effects [to] hide your messy background or something like that, that’s using ML in the background to sort of segment the image into a foreground and background and then applying a blur. So, you’ve got a workload designed to specifically capture that sort of performance… Photo library applications, things like Apple Photos or Google photos or what have you, they’ll use ML in the background to automatically tag images.”

Other changes simply seek to resolve confusion about the things the app is testing. That includes the removal of broad sub-score categories like “integer” and “crypto,” the latter of which frequently confused Geekbench users (and even some Ars commenters) who erroneously assumed it was measuring cryptocurrency mining performance instead of cryptographic hardware acceleration. Some individual tests have been renamed, too.

“Having a workload called ‘camera’ makes people think that we’re actually testing the camera,” said Poole. “And I remember after we shipped in [Geekbench] 5, we introduced the camera workload and people are saying like, ‘Oh, this is testing the camera.’ I just, I remember sitting at my desk going, ‘Oh, wow, we really messed that one up.’”

What Geekbench is for

A Geekbench 6 run takes a few minutes to complete on a modern system. It’s intentionally designed not to be complicated or overly taxing to run. Credit: Andrew Cunningham

Read through the benchmark descriptions for Geekbench 6 (we’ve uploaded PDFs with short descriptions of the individual CPU and GPU compute tests), and you might think that many of the tests it’s running are still focused on relatively small datasets, especially compared to what you might encounter in the real world. Some tests include a 75MB compressed archive, a 1.5MB PDF file, or a three-megapixel image. Things are bigger than they were in Geekbench 5 but still not necessarily as big as the files you can encounter in the real world. The entire benchmark still takes a few minutes to run on a modern PC or phone.

That means Geekbench isn’t always great for measuring sustained performance—how your device will run over a long period of time—or how it will perform when your CPU and GPU are both active at the same time, when each component is consuming its own power and generating its own heat.

This isn’t always relevant information for people using their devices day to day. Launching an app, opening a file, or installing an update requires a lot of speed for a little while, but once those tasks are done, your CPU can go back down to the near-idle state it spends most of its time in, and rendering windows and dialog boxes only requires tiny blips of performance from your computer’s GPU. But it’s more important for gaming or CPU-based video encoding or rendering jobs—anything where your processor is partially or entirely engaged with a task for more than a few minutes.

These kinds of tests are especially important for enthusiasts and professional users, but they can also be used to measure heat output, sustained power use, and thermal throttling, variables that can vary dramatically and unpredictably between systems that, on paper, use the exact same components.

For Poole, staying focused on a lighter workload that doesn’t take long to run is part of the point: “We’re trying to make something that’s easy,” Poole told Ars. “We’re trying to make something that’s useful.”

“I think that’s one of the reasons why [Geekbench is] so popular is that, you know, you just download the app, click a button, and you’ve got a result three minutes later,” said Poole. “We run other benchmarks internally. We definitely want to know when Geekbench agrees with those benchmarks. We definitely want to know when Geekbench disagrees with those benchmarks. And what we’ve seen is that a lot of the other benchmarks… A lot of the consumer-facing ones, like 3DMark and Cinebench, obviously those are fairly easy to run as well. But when you get to a lot of other ones, the cross-platform ones that people hold up as the gold standards of CPU comparisons, they’re bears to run and, like, only a handful of people can do that.”

Aside from ease of use, it’s also true that no single benchmark can tell you everything there is to know about a device’s performance. There are more gaming- and GPU-centric benchmarks than one can easily count, spread out across all kinds of games, rendering engines, and graphics APIs. Benchmarks like Cinebench or our own Handbrake video transcoding test are available for testing sustained performance over time. Even benchmarks that purport to measure the same thing can produce varying results from system to system, based on how well the benchmark has been optimized for a given architecture (or how well the hardware has been optimized to run the tasks involved in a given benchmark).

Why benchmarking matters

The Geekbench results browser has a gigantic database of test scores you can dig through to compare your system’s performance to others with similar components. Credit: Andrew Cunningham

It’s easy to dismiss benchmarking as a pointless exercise, something that you don’t need to care about unless you’re a hardware reviewer or a hobbyist trying to show off your hardware’s prowess on a PC-building forum or subreddit. It’s true that, subjectively, for most day-to-day tasks, it’s going to be difficult for most people to tell the difference between a Core i5, a Core i7, or a Ryzen 7 CPU, or between an Apple A13 and A15.

But even if you don’t care about bragging rights, there’s still value in knowing how fast something is supposed to be, so that you can tell when something is wrong.

“Back a number of years ago, I was talking to a friend of mine who said, ‘Oh, yeah, I just used Geekbench the other day because my Mac felt slow,’” said Poole. “And ran Geekbench, the numbers were half of what they should be. He took it into the Apple Store, they opened up the laptop, and his heatsink had cracked in half. So people using it as a diagnostic tool, I think, is something that being extremely online as we are, we don’t necessarily see that side of it. But I think that’s still something that’s really useful.

“Obviously, you know, you’ve got the people who want to argue about performance on the Internet. You’ve got people who want to say, ‘Oh, hey, my score is better; therefore, I’m a better person than you.’ That sort of thing. But I think there’s a number of users out there that just want to know… what should I upgrade to, or if I’ve upgraded, is my new system working properly? Or is my current system working properly? We really don’t want to create a benchmark that just sits in a vacuum and generates numbers that have no reflection or bearing on what people are doing with their phones or their PCs or their Macs or their laptops or the desktops. We really want this to be a tool that people can use to figure out questions they have about performance.”

Listing image: Primate Labs

Andrew Cunningham Senior Technology Reporter

Andrew is a Senior Technology Reporter at Ars Technica, with a focus on consumer tech including computer hardware and in-depth reviews of operating systems like Windows and macOS. Andrew lives in Philadelphia and co-hosts a weekly book podcast called Overdue.

133 Comments