Yeah I get the feeling that this is almost entirely hype based. AI is all the rage right now so this lets AMD do a press release that basically says see we're selling more AI related thingies. Yay. It's fine if you want to make money on AI but if that's your goal you probably need to convince more hyperscalers to fork over the big money for your data center GPUs. So far they're making some progress on that front, but a lot less than I think most of their investors were expecting.The NPU is weaker, but supposedly far more efficient for the imagined use case. The idea is that an integrated NPU is able to perform a relative (basic) LLM/inference -related task more efficiently than the usual "APU" design of CPU + iGPU. On paper, sure, this checks out. In reality... I think there is little utility for the average consumer or business in the imagined scenario of accelerating compute for a small local model efficiently.
Why is it useless? Well, the tiny local models that could make sense would have to be highly-tailored for any real utility at a size that can fit (like an open 7B model or less). A story by the Economist last month covered this and explained (with survey data) that only a small number of workers use "AI" daily and that the vast majority of these users only use LLMs via cloud providers and as a glorified search / reference system. The tiny local models which fit within 4-16gb are terrible for this particular task - few are web-search/agent enabled, plus small models lack context and hallucinate more.
So the stated goal of power efficiency for a local model is all but pointless when the average model that might actually provide utility is larger than the device can realisticlaly run. Generally speaking, I think you are correct in that NPUs are wasted silicon. This funcitionality would make more sense baked into a power-hungry iGPU on all of these processors - even if that scenario means less efficiency for the rare user that actually needs the NPU functionality. At least the silicon would be more likely to be utilized in that scenario.
If they really wanted to ramp up AI use for local models on desktops and laptops, what they really need to do IMO is develop a new memory interface connecting to both the CPU and discrete GPU that allows for a unified memory architecture. So your 5090 would come with zero memory and would just use whatever system ram you have. That might be 16GB for a budget system or it might be 256GB for a high end system. THAT would allow running local LLM models big enough to actually be good even on lower performance GPUs. The performance of your GPU would basically determine how fast you got your result while the amount of memory would determine which model you could run and those 2 things would be unrelated to each other. I don't really see a situation where an NPU is relevant for an AI model big enough to give good results. Can it run copilot? Probably. Is Copilot worth using? Not from what I've noticed.Err, why not use the NPU for other ML stuff? Why are we discussing LLMs here? Running an LLM on an NPU is kinda dumb for all the reasons that you explained so well. So why even consider it.
Yeah if they tweak things so that the NPU and GPU can work together then I can see an NPU being a value add. But not if it's either or. And yeah Apple is kind of the proof of concept. If they can do it then obviously it can be done. That would open up options for better customization of PCs, especially desktops where it's easier to mix and match parts. A gamer might want a 5090 and only bother with 32GB of system memory. Someone wanting to play around with AI on a shoestring budget might opt for a 5070 but spring for 128GB of unified memory allowing them to make use of larger models even if performance was a bit lower.That's literally what Apple did. And MLX will split up tasks between GPU and NPU depending on which is more suitable or use both. NPUs are still more efficient at those kinds of things computationally, they just generally don't have the memory bandwidth that the GPU has. Fix that bandwidth problem, and the NPU will be pretty clearly better. Apple CPUs are dual channel for the base, 4x for Pro, 8x for Max, and 16x for Ultra with all cores having equal access. That's why they have the option to use the NPU for DLSS, since the GPU just passes it a pointer, and the NPU just passes one back. No need to copy over a PCI bus.
There's 2 problems with Copilot IMO.I've yet to meet someone who actually wants a Copilot+ system. Like, I guess in theory they must exist somewhere. But I haven't seen one in person.