Google announces Gemma 4 open AI models, switches to Apache 2.0 license

michaeltherobot · Apr 2, 2026

It's neat to reflect that LLM capabilities that used obscene amounts of energy three years ago can now run on a smartphone. Every time I am tempted to post a snarky comment about how useless LLM output is compared to the energy cost involved, stories like this help me calm down.

And then the word "Gemmaverse" makes me even madder.

buback · Apr 2, 2026

~~Ollama link is incorrect:~~ ~~https://ollama.com/library/gemma4~~
edit. Link Corrected!

pjcamp · Apr 2, 2026

Google’s Gemini AI models have improved by leaps and bounds over the past year,

Apparently, Google forced Gemini onto my phone last night. It came up this morning when I gave an address to navigate to. Thereafter followed an annoying and stupid conversation.

"I know what's at that address, the Emory Brain Health Center. There are several clinics in the building, are you going to one in particular?"

No. shut up and go.

"Here are some places nearby. Would you like to add any of these to your trip?"

No. Shut up and go.

"It seems you want to navigate to the Emory Brain Health Center, Is that correct?"

Yes. Shut up and go.

"To do this, you need to give me permission to access Maps. Please press the OK button on the screen."

Which I did. And the damn thing forgot the entire conversation we just had. It no longer knew where I wanted to go. I gave it the address and had to go through the whole stupid conversation a second time.

If this is a leaps and bounds improvement, well, compared to what? Assistant is short and to the point. Gemini won't shut up. Ever. It will stop talking when it is damn good and ready. Your desire for it to just shut up and go is immaterial.

Gemini is annoying, stupid, and, like Trump, it has to continually demonstrate how not stupid it is.

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.

rwhitwam · Apr 2, 2026

buback said:
Ollama link is incorrect: https://ollama.com/library/gemma4

Sorry about that. Google gave us the wrong link!

MTSkibum · Apr 2, 2026

I wish they would have benchmarked with a 5080, i assume the performance would be somewhere between the mac and the 5090.

I am not able to find the requirements needed for their 26B and 31B models in this release.

https://blogs.nvidia.com/blog/rtx-ai-garage-open-models-google-gemma-4/

EmphyrioDonk · Apr 2, 2026

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.

Why wait. It's Google! The sooner you leave the better it is. No brainer stuff

buback · Apr 2, 2026

pjcamp said:
If this is a leaps and bounds improvement, well, compared to what? Assistant is short and to the point. Gemini won't shut up. Ever. It will stop talking when it is damn good and ready. Your desire for it to just shut up and go is immaterial.

Gemini is annoying, stupid, and, like Trump, it has to continually demonstrate how not stupid it is.

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.

These LLMs need an option for concise reponse.

buback · Apr 2, 2026

rwhitwam said:
Sorry about that. Google gave us the wrong link!

Ha! They probably had Gemini compile the links.

Tactical Finesse · Apr 2, 2026

rwhitwam said:
Sorry about that. Google gave us the wrong link!

I'm not surprised. Google and Microsoft compete for the dumbest dropped balls.

rwhitwam · Apr 2, 2026

buback said:
Ha! They probably had Gemini compile the links.

You know, I have begun to suspect that Google is using LLMs to assemble press kits. Some oddly consistent errors keep happening.

GameBoyColor · Apr 2, 2026

Does anyone have advice for how to run local models on android phones? Specifically, I'd like to run Gemma E4B on a Pixel in airplane mode.

**RTS** · Apr 2, 2026

buback said:
These LLMs need an option for concise reponse.

Why would they when they are paid per token?

gijames1225 · Apr 2, 2026

Does anyone know of any good comparisons of

pjcamp said:
Apparently, Google forced Gemini onto my phone last night. It came up this morning when I gave an address to navigate to. Thereafter followed an annoying and stupid conversation.

"I know what's at that address, the Emory Brain Health Center. There are several clinics in the building, are you going to one in particular?"

No. shut up and go.

"Here are some places nearby. Would you like to add any of these to your trip?"

No. Shut up and go.

"It seems you want to navigate to the Emory Brain Health Center, Is that correct?"

Yes. Shut up and go.

"To do this, you need to give me permission to access Maps. Please press the OK button on the screen."

Which I did. And the damn thing forgot the entire conversation we just had. It no longer knew where I wanted to go. I gave it the address and had to go through the whole stupid conversation a second time.

If this is a leaps and bounds improvement, well, compared to what? Assistant is short and to the point. Gemini won't shut up. Ever. It will stop talking when it is damn good and ready. Your desire for it to just shut up and go is immaterial.

Gemini is annoying, stupid, and, like Trump, it has to continually demonstrate how not stupid it is.

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.

The non-determinism (or at least, highly obfuscated determinism) of LLMs is fascinating. I have never had Gemini pop back with suggestions when navigating, but it must have some sort of context switch that tells it to be more or less ~~proactive~~ annoying.

Fred Duck · Apr 2, 2026

buback said:
These LLMs need an option for concise reponse.

"That is a great point! A concise answer would be helpful because it would allow you, the user, to receive the exact answer you needed–with minimal fluff! I can see how that sort of option would be beneficial so of course I'm afraid I can't do that, buback."

rwhitwam said:
You know, I have begun to suspect that Google is using LLMs to assemble press kits. Some oddly consistent errors keep happening.

Management probably thought it would be a fine showcase of their service and how it can assist businesses in future!

RoryEjinn · Apr 2, 2026

GameBoyColor said:
Does anyone have advice for how to run local models on android phones? Specifically, I'd like to run Gemma E4B on a Pixel in airplane mode.

I think it largely varies across your devices (Samsung has like an AI Select thing going on if you can get it working), but it's usually by using an app like Google AI Edge Gallery/AnythingLLM or by using Termux to install and build the Ollama client locally. But you'd need to allow internet access at least as long as it takes to download the model.

Xyler · Apr 2, 2026

80GB GPU. If that means VRAM, does that mean the 128GB Framework Desktop could theoretically run this monster?

You can allocate up to 112GB of the unified memory to the GPU under Linux after all.

georges · Apr 2, 2026

Damn, I'm getting 56tok/sec with Gemma 4 E4B on a stock MacbookPro M4 with LMStudio/MLX.

And it's very capable...

Tactical Finesse · Apr 2, 2026

Xyler said:
80GB GPU. If that means VRAM, does that mean the 128GB Framework Desktop could theoretically run this monster?

You can allocate up to 112GB of the unified memory to the GPU under Linux after all.

Actually under Strix Halo Linux will dynamically allocate however much of the shared memory pool it needs/can. And yes.

A 80GB model will not be quick. And, probably, a lighter quant will get a similar output and much faster output. My favorite model on LMStudio on my Framework Desktop is a ~30GB(?) Nemotron model from Nvidia....it is one of the newer models that is larger but still can do 60tokens/second without needing to spend my life trying to optimize configuration. There are much larger models--but for casual experimentation and play, the output isn't that much remarkably better worth the much slower output.

CrisR82 · Apr 2, 2026

pjcamp said:
Apparently, Google forced Gemini onto my phone last night. It came up this morning when I gave an address to navigate to. Thereafter followed an annoying and stupid conversation.

"I know what's at that address, the Emory Brain Health Center. There are several clinics in the building, are you going to one in particular?"

No. shut up and go.

"Here are some places nearby. Would you like to add any of these to your trip?"

No. Shut up and go.

"It seems you want to navigate to the Emory Brain Health Center, Is that correct?"

Yes. Shut up and go.

"To do this, you need to give me permission to access Maps. Please press the OK button on the screen."

Which I did. And the damn thing forgot the entire conversation we just had. It no longer knew where I wanted to go. I gave it the address and had to go through the whole stupid conversation a second time.

If this is a leaps and bounds improvement, well, compared to what? Assistant is short and to the point. Gemini won't shut up. Ever. It will stop talking when it is damn good and ready. Your desire for it to just shut up and go is immaterial.

Gemini is annoying, stupid, and, like Trump, it has to continually demonstrate how not stupid it is.

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.

You do realize you can just disable it right?
I mean yeah, it's dumb to have something you don't want installed on your device, but can we just take a moment to remember that Apple does the same thing?
Anyone else remembers the WALLET APP advertising the F1 movie? Don't kid yourself, whatever you buy - if there is a big company behind it, you'll get stuff shoved down your throat, it might be today, it might be tomorrow, it might be in a year, but it absolutely WILL happen.

EDIT: grammar

Resistance · Apr 2, 2026

**RTS** said:
Why would they when they are paid per token?

Because the cost per token generated appears to be more than the revenue per token generated.

norton_I · Apr 2, 2026

buback said:
These LLMs need an option for concise reponse.

No problem, just use another LLM to summarize the output from the first one.

MilesArcher · Apr 2, 2026

I just tried Gemma4 on my desktop PC. Didn't do anything like a benchmark. Just pointed it a problem that I'm trying to solve. It looks pretty good, though it didn't solve the problem (neither has Claude).

norton_I · Apr 2, 2026

Xyler said:
80GB GPU. If that means VRAM, does that mean the 128GB Framework Desktop could theoretically run this monster?

You can allocate up to 112GB of the unified memory to the GPU under Linux after all.

How powerful are the iGPU/NPUs in there? Obviously they aren't earth shattering, but are they good enough to do something useful when you care about the response time? How would it compare to running on something like an RTX5090 with the memory shared over PCIe? My impression is that these large models end up mostly memory bound and the penalty for swapping out over PCIe is a bigger impact than FLOPs, but a slow enough processor might flip that.

ERIFNOMI · Apr 2, 2026

pjcamp said:
Apparently, Google forced Gemini onto my phone last night. It came up this morning when I gave an address to navigate to. Thereafter followed an annoying and stupid conversation.

"I know what's at that address, the Emory Brain Health Center. There are several clinics in the building, are you going to one in particular?"

No. shut up and go.

"Here are some places nearby. Would you like to add any of these to your trip?"

No. Shut up and go.

"It seems you want to navigate to the Emory Brain Health Center, Is that correct?"

Yes. Shut up and go.

"To do this, you need to give me permission to access Maps. Please press the OK button on the screen."

Which I did. And the damn thing forgot the entire conversation we just had. It no longer knew where I wanted to go. I gave it the address and had to go through the whole stupid conversation a second time.

If this is a leaps and bounds improvement, well, compared to what? Assistant is short and to the point. Gemini won't shut up. Ever. It will stop talking when it is damn good and ready. Your desire for it to just shut up and go is immaterial.

Gemini is annoying, stupid, and, like Trump, it has to continually demonstrate how not stupid it is.

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.

Wait til you hear who is powering Siri/"Apple Intelligence."

It's Gemini.

coburnjohn575 · Apr 2, 2026

The Open Source and local AI models are conveniently always forgotten in the arguments against AI. The anti-AI arguments usually focus on the environmental cost of data centers and the idea of a few greedy billionaires pushing AI on the masses. But these locally running open source models are probably the future of AI. They kind of puncture those arguments because things that run locally won't require datacenters and the intense cooling needs. And the Open Source nature shows that technology is not just driven by a few people at the top, but is something that comes from the community tinkering and experiment, many times just for the love of it and not for pure greed.

Tactical Finesse · Apr 2, 2026

norton_I said:
How powerful are the iGPU/NPUs in there? Obviously they aren't earth shattering, but are they good enough to do something useful when you care about the response time? How would it compare to running on something like an RTX5090 with the memory shared over PCIe? My impression is that these large models end up mostly memory bound and the penalty for swapping out over PCIe is a bigger impact than FLOPs, but a slow enough processor might flip that.

The Strix Halo iGPU is roughly equivalent to a 5070 or 5070 dGPU laptop chip in terms of gaming compute performance. I have a 395 128GB framework desktop for my daily machine at home running Linux.

IME, for larger more complex models you run will max out the GPU before the 200GB/second memory becomes a limiter IIRC. But that is because you're dealing with a laptop GPU coupled with a server-level memory pool in terms of size and bandwidth. It is a fun platform to experiment with--particularly if you got one before the price increases hit (went up 25% since before RAMpocalypse), and a very capable machine in a quiet 4L package; thanks to the absurd memory pool I expect it to be in service for a long time.

I've never used the NPU. Some folks have. But you rapidly get into configuration and tweaking hell--as opposed to the simple joy of just opening LMStudio and running a model you downloaded without needing to brainstorm how to even get the NPU and GPU to run at once.

Xyler · Apr 2, 2026

Tactical Finesse said:
Actually under Strix Halo Linux will dynamically allocate however much of the shared memory pool it needs/can. And yes.

A 80GB model will not be quick. And, probably, a lighter quant will get a similar output and much faster output. My favorite model on LMStudio on my Framework Desktop is a ~30GB(?) Nemotron model from Nvidia....it is one of the newer models that is larger but still can do 60tokens/second without needing to spend my life trying to optimize configuration. There are much larger models--but for casual experimentation and play, the output isn't that much remarkably better worth the much slower output.

Neat. Thanks for the response,

Steak and potatoes · Apr 2, 2026

That Arena List link is even more eye opening when you turn off the open weights filter and see that the model is ranking higher than GPT-4.5 and just below GPT-5.1. Put another way, it is better than where the huge proprietary models were a year ago but not quite to where they were 6 months ago.

Resistance · Apr 2, 2026

coburnjohn575 said:
The Open Source and local AI models are conveniently always forgotten in the arguments against AI. The anti-AI arguments usually focus on the environmental cost of data centers and the idea of a few greedy billionaires pushing AI on the masses. But these locally running open source models are probably the future of AI. They kind of puncture those arguments because things that run locally won't require datacenters and the intense cooling needs. And the Open Source nature shows that technology is not just driven by a few people at the top, but is something that comes from the community tinkering and experiment, many times just for the love of it and not for pure greed.

What makes you think that the huge environmental impact of AI so far and until local becomes predominant is irrelevant?

What evidence do you have that the environmental impact of cooling an office building filled with workstations is less than cooling the same compute done in a datacenter?

What makes you so confident that there will come a point when either: all the very best AI models can be run locally, or, the use of the very best AI models will be insignificant?

What is your source that the anti AI arguments usually focus on the environmental cost?

Name one relevant LLM that purely comes from the community tinkering and experiment.

asnelt · Apr 2, 2026

GameBoyColor said:
Does anyone have advice for how to run local models on android phones? Specifically, I'd like to run Gemma E4B on a Pixel in airplane mode.

I haven't looked into the new Gemma 4 models yet. But I've been running a quantized Gemma 3 12B on my Pixel 9 Pro with llama.cpp in Termux. While there is a pre-compiled llama.cpp package in the Termux repositories, it lacks Vulkan support for the Pixel Mali GPU. Compiling llama.cpp with Vulkan support on the Pixel fixes this. With this setup, text is generated at decent speed on the GPU rather than the CPU.

EDIT: For compiling llama.cpp with Mali Vulkan support in Termux, do make sure to set the DVulkan_LIBRARY and DVulkan_INCLUDE_DIR compile flags. Sorry, it seems Ars won't let me share the full compile incantations here. But this GitLab snippet provides all the details.

kahn · Apr 2, 2026

Tactical Finesse said:
Actually under Strix Halo Linux will dynamically allocate however much of the shared memory pool it needs/can. And yes.

A 80GB model will not be quick. And, probably, a lighter quant will get a similar output and much faster output. My favorite model on LMStudio on my Framework Desktop is a ~30GB(?) Nemotron model from Nvidia....it is one of the newer models that is larger but still can do 60tokens/second without needing to spend my life trying to optimize configuration. There are much larger models--but for casual experimentation and play, the output isn't that much remarkably better worth the much slower output.

I'll try the gemma model tonight. It really looks like it could be a great model to run on Strix Halo, if Google's claims end up being correct.

So far my favourite model on Strix Halo is the GPT-OSS-120B model, which should end up at a similar size and is quite good at a variety of tasks. I get 30-40 tokens/s as well.

The main problem is prompt processing, which is quite slow compared to running models on a GPU.

Tactical Finesse · Apr 2, 2026

kahn said:
I'll try the gemma model tonight. It really looks like it could be a great model to run on Strix Halo, if Google's claims end up being correct.

So far my favourite model on Strix Halo is the GPT-OSS-120B model, which should end up at a similar size and is quite good at a variety of tasks. I get 30-40 tokens/s as well.

The main problem is prompt processing, which is quite slow compared to running models on a GPU.

Cool.

My biggest complaint...the models that are downloadable on LMStudio are pretty ancient for how fast the sector has been moving. Also given how Google has made AOSP "open source" but basically useless to anyone not a professional development house--part of me wonders if this Gemma model will be the same.Apache and FOSS, but useless without a massive amount of legwork. I think I tried a few of the Gemma models on LMStudio and they just crashed.

Of course...LMStudio, and Strix Halo LLM support is very very new and unstable on the best of days. I've never actually seen ROCm work IME, although I've chatted online with folks that did.

Waco · Apr 2, 2026

CrisR82 said:
You do realize you can just disable it right?

You can't, though. Gemini is the default for Android Auto and you can't revert once they force you onto it.

I tried. If they've changed that I'm all ears but last I checked once you got moved...you're stuck with Gemini.

uhuznaa · Apr 2, 2026

coburnjohn575 said:
The Open Source and local AI models are conveniently always forgotten in the arguments against AI. The anti-AI arguments usually focus on the environmental cost of data centers and the idea of a few greedy billionaires pushing AI on the masses. But these locally running open source models are probably the future of AI. They kind of puncture those arguments because things that run locally won't require datacenters and the intense cooling needs. And the Open Source nature shows that technology is not just driven by a few people at the top, but is something that comes from the community tinkering and experiment, many times just for the love of it and not for pure greed.

Why should the power needs be less just because it runs distributed locally? If the local AI should be more efficient running the same in datacenters will be just as efficient, probably more efficient even. You also will have to have the hardware there and powered on everywhere even if it will be idle most of the time.

I mean, I fully agree with the privacy reasons for that, but the hardware costs and energy needs will be at least the same.

Boskone · Apr 2, 2026

E2B and E4B are more interesting to me than most AI models, as I just want a locally-run digital assistant. Being able to run a model basically as-needed on a smallish computer would be fine.

E.g. answer "What's the weather?" with a "dumb" assistant, but "I'm traveling from A to B, what should I expect en route?" could fire up the AI model.

We're pretty much there, but a more capable small model would still be nice.

afidel · Apr 2, 2026

So with an 8GB RPi 5 am I better running E4B using Q4_0, or E2B using SPF8? I've got local voice processing working with Home Assistant, adding a local LLM for data processing might make it more useful.

darkowl · Apr 2, 2026

MTSkibum said:
I wish they would have benchmarked with a 5080, i assume the performance would be somewhere between the mac and the 5090.

I am not able to find the requirements needed for their 26B and 31B models in this release.

https://blogs.nvidia.com/blog/rtx-ai-garage-open-models-google-gemma-4/

If you’re running on a 5080 you’ll most likely be relying on community quants, and you’ll probably need at least 24GB for the dense model at any reasonable precision. More if you use a longer context window without limiting it to say 32k. (I think there’s a 32GB 5080 variant?)

CrisR82 · Apr 2, 2026

Waco said:
You can't, though. Gemini is the default for Android Auto and you can't revert once they force you onto it.

I tried. If they've changed that I'm all ears but last I checked once you got moved...you're stuck with Gemini.

Honestly not sure what to tell ya, for all I know, we're both right - I AM aware that some OEMs do work differently and there ARE per-region software weirdness with phones, but I just checked on both my S25 Ultra and my parents' S21 FE and S24 FE - the S25U/S24FE both allow you to disable it and uninstall its updates and the S21FE allows you to fully uninstall it, all 3 of them are purchased in the EU, 2 from Samsung directly, and the other 1 from a local tech store.

I don't have access to a car with Android Auto so can't check that, but I did notice the default apps menu has a selection toggle for Virtual Assistant that has Gemini pre-selected, maybe if you change it there to something else, that'll help? (if possible do reply with info on this, I'm genuinely curious)

Google announces Gemma 4 open AI models, switches to Apache 2.0 license

Ars Centurion

Ars Scholae Palatinae

Ars Tribunus Militum

Smack-Fu Master, in training

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Smack-Fu Master, in training

Ars Centurion

Ars Praetorian

Ars Praefectus

Ars Tribunus Angusticlavius

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Praefectus

Ars Centurion

Ars Praefectus

Ars Legatus Legionis

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Seniorius Lurkius

Wise, Aged Ars Veteran

Seniorius Lurkius

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Legatus Legionis

Ars Tribunus Militum

Wise, Aged Ars Veteran