Google announces Gemma 4 open AI models, switches to Apache 2.0 license

It's neat to reflect that LLM capabilities that used obscene amounts of energy three years ago can now run on a smartphone. Every time I am tempted to post a snarky comment about how useless LLM output is compared to the energy cost involved, stories like this help me calm down.

And then the word "Gemmaverse" makes me even madder.
 
Upvote
105 (110 / -5)

pjcamp

Ars Tribunus Militum
2,513
Google’s Gemini AI models have improved by leaps and bounds over the past year,

Apparently, Google forced Gemini onto my phone last night. It came up this morning when I gave an address to navigate to. Thereafter followed an annoying and stupid conversation.

"I know what's at that address, the Emory Brain Health Center. There are several clinics in the building, are you going to one in particular?"

No. shut up and go.

"Here are some places nearby. Would you like to add any of these to your trip?"

No. Shut up and go.

"It seems you want to navigate to the Emory Brain Health Center, Is that correct?"

Yes. Shut up and go.

"To do this, you need to give me permission to access Maps. Please press the OK button on the screen."

Which I did. And the damn thing forgot the entire conversation we just had. It no longer knew where I wanted to go. I gave it the address and had to go through the whole stupid conversation a second time.

If this is a leaps and bounds improvement, well, compared to what? Assistant is short and to the point. Gemini won't shut up. Ever. It will stop talking when it is damn good and ready. Your desire for it to just shut up and go is immaterial.

Gemini is annoying, stupid, and, like Trump, it has to continually demonstrate how not stupid it is.

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.
 
Upvote
32 (66 / -34)
Post content hidden for low score. Show…

buback

Ars Scholae Palatinae
785
If this is a leaps and bounds improvement, well, compared to what? Assistant is short and to the point. Gemini won't shut up. Ever. It will stop talking when it is damn good and ready. Your desire for it to just shut up and go is immaterial.

Gemini is annoying, stupid, and, like Trump, it has to continually demonstrate how not stupid it is.

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.
These LLMs need an option for concise reponse.
 
Upvote
8 (14 / -6)
Does anyone know of any good comparisons of
Apparently, Google forced Gemini onto my phone last night. It came up this morning when I gave an address to navigate to. Thereafter followed an annoying and stupid conversation.

"I know what's at that address, the Emory Brain Health Center. There are several clinics in the building, are you going to one in particular?"

No. shut up and go.

"Here are some places nearby. Would you like to add any of these to your trip?"

No. Shut up and go.

"It seems you want to navigate to the Emory Brain Health Center, Is that correct?"

Yes. Shut up and go.

"To do this, you need to give me permission to access Maps. Please press the OK button on the screen."

Which I did. And the damn thing forgot the entire conversation we just had. It no longer knew where I wanted to go. I gave it the address and had to go through the whole stupid conversation a second time.

If this is a leaps and bounds improvement, well, compared to what? Assistant is short and to the point. Gemini won't shut up. Ever. It will stop talking when it is damn good and ready. Your desire for it to just shut up and go is immaterial.

Gemini is annoying, stupid, and, like Trump, it has to continually demonstrate how not stupid it is.

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.
The non-determinism (or at least, highly obfuscated determinism) of LLMs is fascinating. I have never had Gemini pop back with suggestions when navigating, but it must have some sort of context switch that tells it to be more or less proactive annoying.
 
Upvote
13 (13 / 0)

Fred Duck

Ars Tribunus Angusticlavius
7,435
These LLMs need an option for concise reponse.
"That is a great point! A concise answer would be helpful because it would allow you, the user, to receive the exact answer you needed–with minimal fluff! I can see how that sort of option would be beneficial so of course I'm afraid I can't do that, buback."

You know, I have begun to suspect that Google is using LLMs to assemble press kits. Some oddly consistent errors keep happening.
Management probably thought it would be a fine showcase of their service and how it can assist businesses in future!
 
Upvote
16 (18 / -2)

RoryEjinn

Wise, Aged Ars Veteran
103
Subscriptor
Does anyone have advice for how to run local models on android phones? Specifically, I'd like to run Gemma E4B on a Pixel in airplane mode.
I think it largely varies across your devices (Samsung has like an AI Select thing going on if you can get it working), but it's usually by using an app like Google AI Edge Gallery/AnythingLLM or by using Termux to install and build the Ollama client locally. But you'd need to allow internet access at least as long as it takes to download the model.
 
Upvote
15 (15 / 0)
80GB GPU. If that means VRAM, does that mean the 128GB Framework Desktop could theoretically run this monster?

You can allocate up to 112GB of the unified memory to the GPU under Linux after all.
Actually under Strix Halo Linux will dynamically allocate however much of the shared memory pool it needs/can. And yes.

A 80GB model will not be quick. And, probably, a lighter quant will get a similar output and much faster output. My favorite model on LMStudio on my Framework Desktop is a ~30GB(?) Nemotron model from Nvidia....it is one of the newer models that is larger but still can do 60tokens/second without needing to spend my life trying to optimize configuration. There are much larger models--but for casual experimentation and play, the output isn't that much remarkably better worth the much slower output.
 
Upvote
27 (27 / 0)

CrisR82

Wise, Aged Ars Veteran
125
Apparently, Google forced Gemini onto my phone last night. It came up this morning when I gave an address to navigate to. Thereafter followed an annoying and stupid conversation.

"I know what's at that address, the Emory Brain Health Center. There are several clinics in the building, are you going to one in particular?"

No. shut up and go.

"Here are some places nearby. Would you like to add any of these to your trip?"

No. Shut up and go.

"It seems you want to navigate to the Emory Brain Health Center, Is that correct?"

Yes. Shut up and go.

"To do this, you need to give me permission to access Maps. Please press the OK button on the screen."

Which I did. And the damn thing forgot the entire conversation we just had. It no longer knew where I wanted to go. I gave it the address and had to go through the whole stupid conversation a second time.

If this is a leaps and bounds improvement, well, compared to what? Assistant is short and to the point. Gemini won't shut up. Ever. It will stop talking when it is damn good and ready. Your desire for it to just shut up and go is immaterial.

Gemini is annoying, stupid, and, like Trump, it has to continually demonstrate how not stupid it is.

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.
You do realize you can just disable it right?
I mean yeah, it's dumb to have something you don't want installed on your device, but can we just take a moment to remember that Apple does the same thing?
Anyone else remembers the WALLET APP advertising the F1 movie? Don't kid yourself, whatever you buy - if there is a big company behind it, you'll get stuff shoved down your throat, it might be today, it might be tomorrow, it might be in a year, but it absolutely WILL happen.

EDIT: grammar
 
Last edited:
Upvote
11 (17 / -6)

norton_I

Ars Praefectus
5,917
Subscriptor++
80GB GPU. If that means VRAM, does that mean the 128GB Framework Desktop could theoretically run this monster?

You can allocate up to 112GB of the unified memory to the GPU under Linux after all.

How powerful are the iGPU/NPUs in there? Obviously they aren't earth shattering, but are they good enough to do something useful when you care about the response time? How would it compare to running on something like an RTX5090 with the memory shared over PCIe? My impression is that these large models end up mostly memory bound and the penalty for swapping out over PCIe is a bigger impact than FLOPs, but a slow enough processor might flip that.
 
Upvote
1 (1 / 0)

ERIFNOMI

Ars Legatus Legionis
18,134
Apparently, Google forced Gemini onto my phone last night. It came up this morning when I gave an address to navigate to. Thereafter followed an annoying and stupid conversation.

"I know what's at that address, the Emory Brain Health Center. There are several clinics in the building, are you going to one in particular?"

No. shut up and go.

"Here are some places nearby. Would you like to add any of these to your trip?"

No. Shut up and go.

"It seems you want to navigate to the Emory Brain Health Center, Is that correct?"

Yes. Shut up and go.

"To do this, you need to give me permission to access Maps. Please press the OK button on the screen."

Which I did. And the damn thing forgot the entire conversation we just had. It no longer knew where I wanted to go. I gave it the address and had to go through the whole stupid conversation a second time.

If this is a leaps and bounds improvement, well, compared to what? Assistant is short and to the point. Gemini won't shut up. Ever. It will stop talking when it is damn good and ready. Your desire for it to just shut up and go is immaterial.

Gemini is annoying, stupid, and, like Trump, it has to continually demonstrate how not stupid it is.

If Assistant goes away and Gemini becomes mandatory, I'm buying an iPhone.
Wait til you hear who is powering Siri/"Apple Intelligence."

It's Gemini.
 
Upvote
45 (45 / 0)

coburnjohn575

Smack-Fu Master, in training
4
The Open Source and local AI models are conveniently always forgotten in the arguments against AI. The anti-AI arguments usually focus on the environmental cost of data centers and the idea of a few greedy billionaires pushing AI on the masses. But these locally running open source models are probably the future of AI. They kind of puncture those arguments because things that run locally won't require datacenters and the intense cooling needs. And the Open Source nature shows that technology is not just driven by a few people at the top, but is something that comes from the community tinkering and experiment, many times just for the love of it and not for pure greed.
 
Upvote
23 (35 / -12)
How powerful are the iGPU/NPUs in there? Obviously they aren't earth shattering, but are they good enough to do something useful when you care about the response time? How would it compare to running on something like an RTX5090 with the memory shared over PCIe? My impression is that these large models end up mostly memory bound and the penalty for swapping out over PCIe is a bigger impact than FLOPs, but a slow enough processor might flip that.
The Strix Halo iGPU is roughly equivalent to a 5070 or 5070 dGPU laptop chip in terms of gaming compute performance. I have a 395 128GB framework desktop for my daily machine at home running Linux.

IME, for larger more complex models you run will max out the GPU before the 200GB/second memory becomes a limiter IIRC. But that is because you're dealing with a laptop GPU coupled with a server-level memory pool in terms of size and bandwidth. It is a fun platform to experiment with--particularly if you got one before the price increases hit (went up 25% since before RAMpocalypse), and a very capable machine in a quiet 4L package; thanks to the absurd memory pool I expect it to be in service for a long time.

I've never used the NPU. Some folks have. But you rapidly get into configuration and tweaking hell--as opposed to the simple joy of just opening LMStudio and running a model you downloaded without needing to brainstorm how to even get the NPU and GPU to run at once.
 
Upvote
21 (21 / 0)

Xyler

Ars Scholae Palatinae
1,408
Actually under Strix Halo Linux will dynamically allocate however much of the shared memory pool it needs/can. And yes.

A 80GB model will not be quick. And, probably, a lighter quant will get a similar output and much faster output. My favorite model on LMStudio on my Framework Desktop is a ~30GB(?) Nemotron model from Nvidia....it is one of the newer models that is larger but still can do 60tokens/second without needing to spend my life trying to optimize configuration. There are much larger models--but for casual experimentation and play, the output isn't that much remarkably better worth the much slower output.
Neat. Thanks for the response, :)
 
Upvote
4 (4 / 0)

Resistance

Wise, Aged Ars Veteran
712
The Open Source and local AI models are conveniently always forgotten in the arguments against AI. The anti-AI arguments usually focus on the environmental cost of data centers and the idea of a few greedy billionaires pushing AI on the masses. But these locally running open source models are probably the future of AI. They kind of puncture those arguments because things that run locally won't require datacenters and the intense cooling needs. And the Open Source nature shows that technology is not just driven by a few people at the top, but is something that comes from the community tinkering and experiment, many times just for the love of it and not for pure greed.
What makes you think that the huge environmental impact of AI so far and until local becomes predominant is irrelevant?

What evidence do you have that the environmental impact of cooling an office building filled with workstations is less than cooling the same compute done in a datacenter?

What makes you so confident that there will come a point when either: all the very best AI models can be run locally, or, the use of the very best AI models will be insignificant?

What is your source that the anti AI arguments usually focus on the environmental cost?

Name one relevant LLM that purely comes from the community tinkering and experiment.
 
Upvote
-18 (9 / -27)

asnelt

Seniorius Lurkius
11
Subscriptor
Does anyone have advice for how to run local models on android phones? Specifically, I'd like to run Gemma E4B on a Pixel in airplane mode.
I haven't looked into the new Gemma 4 models yet. But I've been running a quantized Gemma 3 12B on my Pixel 9 Pro with llama.cpp in Termux. While there is a pre-compiled llama.cpp package in the Termux repositories, it lacks Vulkan support for the Pixel Mali GPU. Compiling llama.cpp with Vulkan support on the Pixel fixes this. With this setup, text is generated at decent speed on the GPU rather than the CPU.

EDIT: For compiling llama.cpp with Mali Vulkan support in Termux, do make sure to set the DVulkan_LIBRARY and DVulkan_INCLUDE_DIR compile flags. Sorry, it seems Ars won't let me share the full compile incantations here. But this GitLab snippet provides all the details.
 
Last edited:
Upvote
18 (18 / 0)

kahn

Wise, Aged Ars Veteran
128
Subscriptor++
Actually under Strix Halo Linux will dynamically allocate however much of the shared memory pool it needs/can. And yes.

A 80GB model will not be quick. And, probably, a lighter quant will get a similar output and much faster output. My favorite model on LMStudio on my Framework Desktop is a ~30GB(?) Nemotron model from Nvidia....it is one of the newer models that is larger but still can do 60tokens/second without needing to spend my life trying to optimize configuration. There are much larger models--but for casual experimentation and play, the output isn't that much remarkably better worth the much slower output.

I'll try the gemma model tonight. It really looks like it could be a great model to run on Strix Halo, if Google's claims end up being correct.

So far my favourite model on Strix Halo is the GPT-OSS-120B model, which should end up at a similar size and is quite good at a variety of tasks. I get 30-40 tokens/s as well.

The main problem is prompt processing, which is quite slow compared to running models on a GPU.
 
Upvote
8 (8 / 0)
I'll try the gemma model tonight. It really looks like it could be a great model to run on Strix Halo, if Google's claims end up being correct.

So far my favourite model on Strix Halo is the GPT-OSS-120B model, which should end up at a similar size and is quite good at a variety of tasks. I get 30-40 tokens/s as well.

The main problem is prompt processing, which is quite slow compared to running models on a GPU.
Cool.

My biggest complaint...the models that are downloadable on LMStudio are pretty ancient for how fast the sector has been moving. Also given how Google has made AOSP "open source" but basically useless to anyone not a professional development house--part of me wonders if this Gemma model will be the same.Apache and FOSS, but useless without a massive amount of legwork. I think I tried a few of the Gemma models on LMStudio and they just crashed.

Of course...LMStudio, and Strix Halo LLM support is very very new and unstable on the best of days. I've never actually seen ROCm work IME, although I've chatted online with folks that did.
 
Upvote
6 (6 / 0)

uhuznaa

Ars Tribunus Angusticlavius
8,751
The Open Source and local AI models are conveniently always forgotten in the arguments against AI. The anti-AI arguments usually focus on the environmental cost of data centers and the idea of a few greedy billionaires pushing AI on the masses. But these locally running open source models are probably the future of AI. They kind of puncture those arguments because things that run locally won't require datacenters and the intense cooling needs. And the Open Source nature shows that technology is not just driven by a few people at the top, but is something that comes from the community tinkering and experiment, many times just for the love of it and not for pure greed.

Why should the power needs be less just because it runs distributed locally? If the local AI should be more efficient running the same in datacenters will be just as efficient, probably more efficient even. You also will have to have the hardware there and powered on everywhere even if it will be idle most of the time.

I mean, I fully agree with the privacy reasons for that, but the hardware costs and energy needs will be at least the same.
 
Upvote
8 (9 / -1)

Boskone

Ars Legatus Legionis
13,180
Subscriptor
E2B and E4B are more interesting to me than most AI models, as I just want a locally-run digital assistant. Being able to run a model basically as-needed on a smallish computer would be fine.

E.g. answer "What's the weather?" with a "dumb" assistant, but "I'm traveling from A to B, what should I expect en route?" could fire up the AI model.

We're pretty much there, but a more capable small model would still be nice.
 
Upvote
4 (4 / 0)

darkowl

Ars Tribunus Militum
2,057
Subscriptor++
I wish they would have benchmarked with a 5080, i assume the performance would be somewhere between the mac and the 5090.

I am not able to find the requirements needed for their 26B and 31B models in this release.

https://blogs.nvidia.com/blog/rtx-ai-garage-open-models-google-gemma-4/
If you’re running on a 5080 you’ll most likely be relying on community quants, and you’ll probably need at least 24GB for the dense model at any reasonable precision. More if you use a longer context window without limiting it to say 32k. (I think there’s a 32GB 5080 variant?)
 
Upvote
2 (2 / 0)

CrisR82

Wise, Aged Ars Veteran
125
You can't, though. Gemini is the default for Android Auto and you can't revert once they force you onto it.

I tried. If they've changed that I'm all ears but last I checked once you got moved...you're stuck with Gemini.
Honestly not sure what to tell ya, for all I know, we're both right - I AM aware that some OEMs do work differently and there ARE per-region software weirdness with phones, but I just checked on both my S25 Ultra and my parents' S21 FE and S24 FE - the S25U/S24FE both allow you to disable it and uninstall its updates and the S21FE allows you to fully uninstall it, all 3 of them are purchased in the EU, 2 from Samsung directly, and the other 1 from a local tech store.

I don't have access to a car with Android Auto so can't check that, but I did notice the default apps menu has a selection toggle for Virtual Assistant that has Gemini pre-selected, maybe if you change it there to something else, that'll help? (if possible do reply with info on this, I'm genuinely curious)
 
Upvote
1 (2 / -1)