Running local models on Macs gets faster with Ollama’s MLX support

newbie question:
Are there security and privacy issues with this?

Thanks
Seems like a perfectly good newbie question to me.

I can't see that there would be any security and privacy issues with something that is run entirely locally and doesn't reach out to the internet. But who really knows? Is there any surreptitious code running in the background that reports activity that we don't want reported? But I doubt that this presents much of a privacy/security threat.

My follow-up question is, Why would anyone downvote the OP's question, a question that's prefaced by "newbie question"?

Edit: Added verb
 
Last edited:
Upvote
15 (15 / 0)

Still Breathing

Ars Centurion
258
Subscriptor
Also remmeber that Macs top out at 256GB RAM and AMD tops out at 128GB RAM, and both can be clustered up to 4x machines.
Macs (currently) go to 512GB with the Ultra-3. I just sold one for twice what I paid for it. I'm still expecting the M5 Ultras to support 512GB as well when they launch even if the largest M3 on sale right now is 256GB
 
Last edited:
Upvote
3 (4 / -1)

RecycledHandle

Wise, Aged Ars Veteran
184
Subscriptor++
None. This solves all the security and privacy issues with cloud based AI.
Nope. Maybe what you say is sarcastic but else, tools will search anything on your computer and may make web searches with that to solve your problem. It can be your username, your code with your access tokens and anything else.

Saying everything will stay local is untrue.
 
Upvote
-2 (1 / -3)
Nope. Maybe what you say is sarcastic but else, tools will search anything on your computer and may make web searches with that to solve your problem. It can be your username, your code with your access tokens and anything else.

Saying everything will stay local is untrue.
So, a newbie question here:

Can an air-gapped computer run a local AI?
 
Upvote
2 (2 / 0)

RecycledHandle

Wise, Aged Ars Veteran
184
Subscriptor++
So, a newbie question here:

Can an air-gapped computer run a local AI?
Yes, all the inference happens locally. Make sure you have good doc accessible in your environment if what you try to do is coding or documentation. You also can't match the performance of a frontier model running by one of the big providers (neither in speed or quality). But it still can be very useful.
 
Upvote
3 (4 / -1)
Not for Mac users ;)
Apple RAM pricing makes sense--for the Studio series. Unlike PC memory which is varying degrees of surpassing JEDEC spec...The Studio run has had memory far beyond what any desktop PC in terms of speed can even POST never mind have stable (800GB/second??!). The only way a PC gets remotely close, or surpasses 800GB/s, are workstation.server threadripper or Epyc systems with quad channel or higher memory which cost more. So yes they're expensive--but you're getting a faster desktop product for the money.

Catch being--that is the Studio. The Mini and other devices have much more JEDEC spec like memory you'd find in a PC. Which is absolutely highway robbery.
 
Upvote
4 (6 / -2)
Yes, all the inference happens locally. Make sure you have good doc accessible in your environment if what you try to do is coding or documentation. You also can't match the performance of a frontier model running by one of the big providers (neither in speed or quality). But it still can be very useful.
So, the entire dataset needs to also be on the local machine. Thus any local AI would seem to be very specialized and limited.
 
Upvote
0 (3 / -3)
So, the entire dataset needs to also be on the local machine. Thus any local AI would seem to be very specialized and limited.
Well, it depends on what model you're wanting to run and what you're wanting to do with it--and the hardware you have to throw at the problem. There are low-end models that will run "fine" on an 8GB GPU and be pretty fast. But if you're wanting to run a fat 120b parameter model, you need to have a lot of memory to throw at it. Bad metaphor time--a Chromebook is a pretty awful computer but if all you're doing is email and watching YouTube it is "fine".

On the PC side...Strix Halo platforms are nowhere near the fastest---but for $1500-$2000ish bucks you cannot beat the memory allocation available for inference. If you have $4000, a Mac Studio is a faster system with a similar or more RAM available.
 
Upvote
1 (1 / 0)

SuperOuss

Smack-Fu Master, in training
55
What tool(s) would you recommend in place of Ollama on macOS? LMStudio or some other option?
I use vLLM to load and test models locally. I also use it to deploy and serve models on external infra. It has been solid for our testing

When I tried ollama, I encountered a degradation of the performance the more the model was used (locally and on external infra, lots of different models)
 
Upvote
0 (0 / 0)
Not sure why you’re being downvoted. I ordered the same earlier today with a delivery date of 4/20.
Here you go, one for you and one for the OP:

1775063452124.png
 
Upvote
2 (4 / -2)

Resistance

Wise, Aged Ars Veteran
549
Not necesarily. Training the models in the first place is significantly harder than running them. Orders of magnitude harder. So, companies still need humongous datacenters to iterate more and more refined models.

After the AI bubble pops or deflates, some, but not all of those AI datacenters (both the training and the inference ones) will be repurposed to HPC, VDI or normal cloud, depending on the hardware within.
It is unclear how much demand there will be for new models for desktop use.

To my knowledge, training and inference datacenters are pretty much the same.

Both VDI and "normal cloud" datacenters are so different that (when considering TCO) it may well be cheaper to scrap an AI datacenter and build a new datacenter somewhere else than retrofit an AI datacenter.

As for HPC assuming your workload is suitable for an AI datacenter as is (a big assumption), you're still going to have problems. AI GPUs are rapidly depreciating assets, both in that they quickly become outdated, and that they wear out. Also, there appear to be a vast quantity more than could ever economically be used for HPC.

Reusing the planned AI datacenter capacity on any meaningful scale is a wild fantasy. Doing so with existing capacity is also unlikely.
 
Upvote
0 (0 / 0)

salinmooch

Smack-Fu Master, in training
68
Subscriptor
I would not call it, "almost as good." The things that Claude Opus 4.6 and GPT 5.4 can do is insane.
Definatley depends on the use case. I have a local model I use for local home assistant control and it has been good enough for the last few months (meets spousal certification).

I really use it so the rest of the family can use casual requests to query the house and not need to be to pedantic, and I don't need it to do a conversation or coding or web related request. Exposing entities carefully and defining intents makes for a tight, flexible, and private system.
 
Upvote
5 (5 / 0)

Control Group

Ars Legatus Legionis
19,312
Subscriptor++
So, the entire dataset needs to also be on the local machine. Thus any local AI would seem to be very specialized and limited.
Not sure what you mean - I'm running Qwen 2.5 at 32 billion parameters and a...4 bit? quant (I'm not sitting at that machine right now, so I'm not 100% positive) on a 4090 at home, and it works fine isolated from the internet. It's using thirty-ish GB of disk, which isn't exactly tiny, but it's a half to a third the size of a modern AAA game, so it's not exactly enormous either.

I've not run into any particular limitations, at least for my uses - some hobby JS coding, some TTRPG scenario development, some fiction writing. It's not as good as the frontier models, unsurprisingly, but it's plenty good enough to be useful.

Unless the limitation you're talking about is "doesn't have access to all of the internet's information," in which case...yeah. Being disconnected does impose that limitation.
 
Upvote
4 (4 / 0)
Apple RAM pricing makes sense--for the Studio series. Unlike PC memory which is varying degrees of surpassing JEDEC spec...The Studio run has had memory far beyond what any desktop PC in terms of speed can even POST never mind have stable (800GB/second??!). The only way a PC gets remotely close, or surpasses 800GB/s, are workstation.server threadripper or Epyc systems with quad channel or higher memory which cost more. So yes they're expensive--but you're getting a faster desktop product for the money.

Catch being--that is the Studio. The Mini and other devices have much more JEDEC spec like memory you'd find in a PC. Which is absolutely highway robbery.
Oh the Apple system is faster but it is a much more expensive machine. Those top Studio RAM configs are so expensive to RAM upgrade because it is ~800GB/second memory which is 4x faster than strix halo.

Which the memory speed isn't an AMD problem--it is a PC problem. Strix Halo is unique because it is one of the only ways to actually get guaranteed 8000MT/s AKA 200GB/second memory in a desktop sans buying workstation gear. You can buy 8000+ MT/s claimed RAM kits--but good luck finding a PC CPU platform that can reliably hit those speeds and timings and be production stable.The only other way is a Threadripper with quad channel memory--which just the mainboard and the CPU will cost more than the strix halo.
Sorry, but you two are wrong. There's nothing special about Apple's memory. They are still same memory chips others are buying. To get that bandwidth they are using same trick as GPUs. Large number of memory channels. (64 16-bit channels) While there is no desktop or workstation CPU with similar bandwidth, there is server-class CPU with such bandwidth. Granite Rapids AP from Intel. Example: Xeon 6979P It can use MRDIMM DDR5 at 8800MT/s in 12 channel configuration. Workstation version has at most 8 channels with same memory speed, though.
 
Upvote
-3 (3 / -6)
Not sure what you mean - I'm running Qwen 2.5 at 32 billion parameters and a...4 bit? quant (I'm not sitting at that machine right now, so I'm not 100% positive) on a 4090 at home, and it works fine isolated from the internet. It's using thirty-ish GB of disk, which isn't exactly tiny, but it's a half to a third the size of a modern AAA game, so it's not exactly enormous either.

I've not run into any particular limitations, at least for my uses - some hobby JS coding, some TTRPG scenario development, some fiction writing. It's not as good as the frontier models, unsurprisingly, but it's plenty good enough to be useful.

Unless the limitation you're talking about is "doesn't have access to all of the internet's information," in which case...yeah. Being disconnected does impose that limitation.
Yeah, "doesn't have access to all of the internet's information" is what I'm getting at.

I don't doubt that a single, powerful computer can locally run sophisticated AI applications. I'm just thinking that the only data the AI has to work with are local files, which (I surmise) limits the AI's range of output.
 
Upvote
0 (0 / 0)

Resistance

Wise, Aged Ars Veteran
549
Yeah, "doesn't have access to all of the internet's information" is what I'm getting at.

I don't doubt that a single, powerful computer can locally run sophisticated AI applications. I'm just thinking that the only data the AI has to work with are local files, which (I surmise) limits the AI's range of output.
I suppose it depends what you mean by "sophisticated AI applications", the only LLM one I can think of off the top of my head that can't be done locally is "deep research", which requires an internet connection and the ability to circumvent rate limiting and other bot mitigations. What are you talking about?
 
Upvote
0 (0 / 0)
Yeah, "doesn't have access to all of the internet's information" is what I'm getting at.

I don't doubt that a single, powerful computer can locally run sophisticated AI applications. I'm just thinking that the only data the AI has to work with are local files, which (I surmise) limits the AI's range of output.
Copies of Wikipedia are available for downloading. I’m not familiar with the specifics but entities are doing this around the clock

That’s a start for a set of general data
 
Upvote
3 (3 / 0)

kmcmurtrie

Ars Centurion
223
Subscriptor
It says "Please make sure you have a Mac with more than 32GB of unified memory."

Not sure if it will work with exactly 32GB.
I ran a 405b parameter model on my Linux computer with only 128GB RAM and 500GB swap. It works.

It also needed 30+ hours to answer complex questions, so it's all a matter of how long you're willing to wait 🙃
 
Upvote
4 (4 / 0)

uhuznaa

Ars Tribunus Angusticlavius
8,683
I still hope and wish for a future that is local. The stupid datacenters sucking up all the RAM is ruining it, but I really do think the road forward is local, home-based LLM evaluation servers. Then they can safely hold your entire life - your calendar, your texts, your emails, your search history - and really be your personal assistant, toolchain, and more. We just need slightly more powerful local setups, and someone to figure out how to get a frontier-quality model onto your local system, either through piracy, distillation, or some sort of licensing agreement.

While I share the sentiment (personal AI assistants are only really useful when they have access to as much of your personal data as possible and this is not something I would want to do with any LLM running out there) this will NOT help the RAM situation. Any local/personal setup will be idle most of the time while cloud based systems can run at basically 100% utilization.

The AI/privacy situation is heavily leaning towards giving up on privacy, for reasons. And just look at the good old Cloud for personal data storage and syncing: This is actually quite easy and cheap to do locally and STILL most people prefer to throw everything just at the services Google, MS, Apple or others offer, just for convenience.

I mean, if your email, calendars and documents live at Google anyway, why not use Google's AI on them then?

Even in the best case any halfway useful local setup will be expensive enough that hardly anyone will bother with it.
 
Upvote
1 (1 / 0)
I haven't played with Ollama per se. It does look like it's got more features than what Apple includes with the MLX code. Which makes sense as more of the scripts in the MLX setup are basically demos. I'm kinda surprised that Ollama only wants to run a single model though. Using mlx-lm you can have most models from Hugging Face running as a local chatbot in two or three commands.
Ollama supports a ton of different models. It is only limited to a single model (so far) when it comes to supporting this new feature, and only in preview.

I have 30+ models installed using ollama on my M2 Max Studio at the moment.
 
Upvote
4 (4 / 0)
Nope. Maybe what you say is sarcastic but else, tools will search anything on your computer and may make web searches with that to solve your problem. It can be your username, your code with your access tokens and anything else.

Saying everything will stay local is untrue.

Is that really a problem with ollama per se? As opposed to using ollama as the AI back-end to various tools that you separately decide to give access to the internet?

If you use ollama with, let's say, open webui to run it as a chatbot only, it is not making requests to the internet. If you use ollama from CLI, same thing.

Yes, installing openclaw on your primary system with all of your data, using local models which are less capable (and more prone to doing dumb things), probably increases your risks. I'm not sure it is fair to say this is an ollama issue though, as opposed to an openclaw issue.

(BTW, for those who are interested in using openclaw without giving it access to everything on your mac, there are numerous workarounds. Docker/Podman locally as an example. Right now, I run openclaw on a dedicated raspberry pi which still uses the mac studio for AI processing by sending API calls across my LAN. It uses my Mac's GPU and unified memory for processing AI tokens, but otherwise only has access to what is on the Raspberry Pi itself. ie. nothing)
 
Upvote
2 (2 / 0)
So, a newbie question here:

Can an air-gapped computer run a local AI?
Yes. Also, the comment you are replying to is conflating risks with tools you may use that connect to Ollama (ie. openclaw) with privacy issues with ollama proper.

You can use ollama without connecting it to tools (depends on your use case). You can use ollama on an air gapped computer. You can even use ollama with tools on an air-gapped computer, although that obviously won't support any internet use cases.
 
Upvote
2 (3 / -1)
So, the entire dataset needs to also be on the local machine. Thus any local AI would seem to be very specialized and limited.
Upvoted because you are trying to understand. I don't get the downvotes.

To answer: Sort of. The model itself will have some capabilities it "learned" from its training data, which doesn't require the reference material (aside from the model itself) to be local. You can also use tools like openclaw to give your local AI the ability to reach out to the internet (although that will defeat or weaken the privacy stance you were asking about earlier).

Overall though, you are not going to get the speed, model capability and breadth of access to reference information by running a model locally, versus the huge models running in the cloud. If your requirements are light and you only need it to reference private data, it could still meet the mark.
 
Upvote
1 (2 / -1)

numerobis

Ars Tribunus Angusticlavius
50,882
Subscriptor
What's the J/compute ratio for home compute versus giant datacenter compute?

I assume it's slightly worse, though maybe the lower volume of cooling at home makes up for it (indeed, in the heating season, any electricity use I have at home is free, I'd otherwise just be lighting up the radiators).
 
Upvote
0 (0 / 0)

numerobis

Ars Tribunus Angusticlavius
50,882
Subscriptor
Yes. Also, the comment you are replying to is conflating risks with tools you may use that connect to Ollama (ie. openclaw) with privacy issues with ollama proper.

You can use ollama without connecting it to tools (depends on your use case). You can use ollama on an air gapped computer. You can even use ollama with tools on an air-gapped computer, although that obviously won't support any internet use cases.
I look forward to openclaw on an air-gapped computer using social engineering to cross the air gap.
 
Upvote
3 (3 / 0)