Nvidia’s “Chat With RTX” is a ChatGPT-style app that runs on your own GPU

Well, locally is the only way I'd be willing to use an AI in most cases, but doesn't this approach make it an echo chamber of my own making? I'd be producing an AI entirely influenced by my own biases.
True enough but there are use cases where the echo chamber doesn't have to be completely restrictive. With a broad-ranging and somewhat well curated source of data, this could be a useful system for searching and synthesizing a significant body of information.

For my own hoped-for use case, I find myself to be the reluctant and half-assed librarian of a collection of thousands of scientific papers, all gathered by myself and other scientists at a small biotech company. I've read maybe a few dozen in great depth, and skimmed another few hundred, but I frequently find myself looking for answers that are probably somewhere in that other few thousand. A LLM could take natural language inquiries and help us find useful stuff, particularly when we don't know enough about a topic to have the right keywords off the top of our heads.

Our library is a bunch of PDFs sitting in a database on a server with a web interface, but I sync them all to local storage because I don't like doing a lot of reading in a web browser. A local implementation like Nvidia's could be what lets me do this while also keeping IT security and IP lawyers happy. And an off-the-shelf package from Nvidia might let me do it myself, without launching a dedicated project for the Data Science team.
 
Upvote
19 (19 / 0)

Toastr

Ars Tribunus Militum
1,816
Well, locally is the only way I'd be willing to use an AI in most cases, but doesn't this approach make it an echo chamber of my own making? I'd be producing an AI entirely influenced by my own biases.
I think you may be misunderstanding here. The model runs locally, as in it isn't pinging the cloud or forwarding your requests or results anywhere. But it's still using a starting model (well, any of several models), then connecting it to your files as a dataset. You're not locally training the full model from scratch.
 
Upvote
18 (18 / 0)

Ten Wind

Ars Tribunus Militum
1,911
I'm actually more interested in this opensource project https://github.com/openlm-research/open_llama

As the training data is being shared by the community. That is huge. Which leads into the question can this use those training models?
If you check out the linked video it looks like ollama is an option along with mistral for this. Seems to pretty much be a wrapper for those models with a nice Gui, so you can play with them in windows without using wsl.
 
Upvote
5 (5 / 0)
That was good where the article at the end had the real-world critical test report, but Ars should move beyond this "Product Announcement" trend of LLM articles/coverage. (I almost said "news coverage" there instead of "article coverage", but in our current world nobody can or will come up with "news" that isn't corporate announcements of products...when surely there must be more to discuss, i.e. articles on a topic and critical analyses etc that are not necessarily "news" and aren't PR messaging.) Right now on the front-page there are "NVIDIA CEO said something" and "NVIDIA has new product" headlines, and this isn't unusual.



Marketing is a different thing. Journalism isn't supposed to just be the marketing under a different org name. We come to learn what the marketers and CEO won't say in a carefully prepared statement boasting about their newest products and tailored solely exclusively toward their own benefit not the public's.
I was hired to relay AI news quickly and briefly every day, and sadly that doesn't give me a chance to go in-depth on every topic. Just today there are 6-7 topics that are highly newsworthy that I can't hit due to time constraints. I wish I could. It's frustrating that I can't. Many times the news is about things I can't personally use or test, but people still need to know about them, and the news cycle moves very quickly. So I opt to try to hit them briefly so you can know about them, and then I hope to go in-depth on other topics strategically, where time allows.

I'm a big fan of presenting information and letting readers decide without necessarily telling them what to think. There are exceptions to that, of course, where I take a harder tack on something, but I provide critical perspectives where appropriate, including in this particular article, which does not simply parrot a press release. Would Nvidia marketing want you to know that its fancy new Chat with RTX demo app crashes frequently, is an ungainly mess of dependencies, feels like an open source re-skin, and is rough around the edges? I doubt it. But this article lets you know, so we're not just an extension of marketing.
 
Upvote
57 (57 / 0)

ScifiGeek

Ars Legatus Legionis
19,110
Well, locally is the only way I'd be willing to use an AI in most cases, but doesn't this approach make it an echo chamber of my own making? I'd be producing an AI entirely influenced by my own biases.

Is this a general chat client? It sounds more like it's an advanced search engine through your local files.
 
Upvote
0 (1 / -1)
I think you may be misunderstanding here. The model runs locally, as in it isn't pinging the cloud or forwarding your requests or results anywhere. But it's still using a starting model (well, any of several models), then connecting it to your files as a dataset. You're not locally training the full model from scratch.
I may be misunderstanding. You download an existing LLM to use as the backend, but it's also my understanding that you can point it to sources of your own choosing. Are those sources only meant to be searched, not also ingested? I suppose I always thought that whatever an LLM is searching, it is also adding to its training data, but that may not be the case (or may depend on the model).
 
Upvote
1 (1 / 0)

OrangeCream

Ars Legatus Legionis
56,695
Well, locally is the only way I'd be willing to use an AI in most cases, but doesn't this approach make it an echo chamber of my own making? I'd be producing an AI entirely influenced by my own biases.
No? I think you're misunderstanding how this is useful if you're thinking it as an echo chamber.

As a first draft it's rough, but the idea is that it can summarize all your data; like an enhanced Spotlight:
https://en.wikipedia.org/wiki/Spotlight_(Apple)Spotlight is a selection-based search system, which creates an index of all items and files on the system. It is designed to allow the user to quickly locate a wide variety of items on the computer, including documents, pictures, music, applications, and System Settings. In addition, specific words in documents and in web pages in a web browser's history or bookmarks can be searched.

So the example given was a query, "What was the restaurant Sarah recommended", and two matching txt files were given. Imagine an enhanced future version built into your OS:
What was my AGI for the 2021 tax year? -> Your 2021 taxes were already indexed and summarized, and the query prompt most closely matches a pdf with the 2021 tax filing information, and AGI happens to be one of the fields in it. You can then open the document, after seeing the answer, and import it into your 2022 copy of TurboTax, for example

How many times did I donate to the food bank in 2022? -> All the various tax receipts in your emails, saved to your receipts folder, and maybe even attached to text messages are indexed and now discovered by the prompt. It might even generate a text file collecting all that information with links to the various sources

It seems awfully promising to me.
 
Upvote
7 (7 / 0)

JudgeMental

Ars Centurion
351
Subscriptor++
I was hired to relay AI news quickly and briefly every day, and sadly that doesn't give me a chance to go in-depth on every topic. Just today there are 6-7 topics that are highly newsworthy that I can't hit due to time constraints. I wish I could. It's frustrating that I can't. Many times the news is about things I can't personally use or test, but people still need to know about them, and the news cycle moves very quickly. So I opt to try to hit them briefly so you can know about them, and then I hope to go in-depth on other topics strategically, where time allows.

I'm a big fan of presenting information and letting readers decide without necessarily telling them what to think. There are exceptions to that, of course, where I take a harder tack on something, but I provide critical perspectives where appropriate, including in this particular article, which does not simply parrot a press release. Would Nvidia marketing want you to know that its fancy new Chat with RTX demo app crashes frequently, is an ungainly mess of dependencies, feels like an open source re-skin, and is rough around the edges? I doubt it. But this article lets you know, so we're not just an extension of marketing.
Is there any chance that some targeted deep-dives could become a somewhat more frequent thing? Y'all do a good job with going broad and have a few solid articles about AI/ML as a whole. But I'd dearly love to see articles that go over some of these tools/projects in more detail.

For myself, I'd love a combination of technical exploration/practical usability (ie, I want to use this model myself, what is it good at and how easy is it to work with?) for those projects that are either designed for use by individuals/small teams, whether that's a front end like 'Chat with RTX' or locally-runnable models like OpenLLaMA or the various GPT4ALL incarnations. Dream would be some form of review article based on hands-on experimenting, though I'd be quite happy with something less - said dream sounds like a ton of work up front given the nature of the field.
 
Upvote
8 (8 / 0)
It's funny that this article is next to this one:

AI-powered romantic chatbots are a privacy nightmare​

https://meincmagazine.com/ai/2024/02/ai-powered-romantic-chatbots-are-a-privacy-nightmare/

Perhaps Nvidia's solution solves this problem...
Sigh
There's a neckbeard in his parents' basement figuring out how to interface this with a Real Doll right now.
/s

More seriously, though, I suspect that yes, this might eventually take a bite out of that market. But I also suspect that the commercial enterprises will just pivot to selling avatars and preconfigured training data sets for turn-key setups. Along with a EULA that entitles them to... ahem... "Telemetry" for "diagnostics and quality improvement." Lots and lots and lots of personal data exfiltration "telemetry."
 
Upvote
4 (5 / -1)

AmoebaOfDoom

Seniorius Lurkius
46
Subscriptor++
I'm worried that even locally run, that being a vended product from a company means that it will have to have telemetry to support whatever PM in charge of this needs to show that it's driving sales of GPUs. And that usually doesn't' stop there. They'll want to know what users are doing with it so they can "make it better" and sell more GPUs.
 
Upvote
5 (7 / -2)

Chaster Mief

Ars Centurion
279
Subscriptor
I'm worried that even locally run, that being a vended product from a company means that it will have to have telemetry to support whatever PM in charge of this needs to show that it's driving sales of GPUs. And that usually doesn't' stop there. They'll want to know what users are doing with it so they can "make it better" and sell more GPUs.
lol, exactly what I was thinking, only 2 min faster than me!
 
Upvote
-2 (0 / -2)

aexcorp

Ars Praefectus
3,317
Subscriptor
I'm worried that even locally run, that being a vended product from a company means that it will have to have telemetry to support whatever PM in charge of this needs to show that it's driving sales of GPUs. And that usually doesn't' stop there. They'll want to know what users are doing with it so they can "make it better" and sell more GPUs.
If Geforce Experience is any guide, there will be plenty of telemetry data shared. Maybe it can be opted out though...
 
Upvote
-1 (2 / -3)

mobby_6kl

Ars Scholae Palatinae
1,118
I may be misunderstanding. You download an existing LLM to use as the backend, but it's also my understanding that you can point it to sources of your own choosing. Are those sources only meant to be searched, not also ingested? I suppose I always thought that whatever an LLM is searching, it is also adding to its training data, but that may not be the case (or may depend on the model).
Training models is very computationally intensive. However you can give a trained model some text and it will "understand" it and you can ask questions about it. As I understand there's usually a pretty limited amount it can remember like that.
 
Upvote
2 (2 / 0)

mikael110

Wise, Aged Ars Veteran
150
Unless the LLM has built-in timeouts to stop itself from vanishing down recursive rabbit holes and getting lost in semantic weeds, and running it on below spec hardware means those timeouts fire after much less processing power has been spent, leading to low-quality output results.
I'm sorry, but this post is nonsense. That's not how LLMs work at all. There is no built-in timeouts or anything of the sort. The compute power you have has no effect on the quality that an LLM outputs. It affects the speed and nothing more. I've studied and been active in the LLM scene for a number of years now. So I can say that with confidence.

The quality depends mostly on the size and how well trained the model is. While bigger models are usually better, a poorly trained large model can be far worse than a well trained small one. The models shipped with Chat With RTX is a 7B model and a 13B model, though the installer only sets up the 13B model if your card has 16GB+ of VRAM. 7B is generally considered the smallest size that is usable for general tasks. It's not amazing, but it can do a number of things decently well.

Bigger models like Mixtral are closer to what you get from something like GPT-3.5. But Mixtral requires over 24GB of VRAM even with 4-bit quantization, which is what is being used by this app. With 3-bit quantization it will just about fit in 24GB, though it is a tight squeeze. But that type of quantization is not supported by TensorRT-LLM at the moment, which is the backend that Chat with RTX is using.

No, you have no idea how to troubleshoot a problem. You say "Oh, they got really slow when they exceeded my VRAM." That doesn't necessarily mean that's the problem, it just means that you can investigate that lead.

For all you know, when the models get larger, the amount of computation required increases significantly, and the VRAM isn't the main issue.
While I disagree with WereCatf's conduct, they are not actually wrong about the importance of VRAM. The biggest bottleneck that LLMs face is not compute, but memory bandwidth. For each token that an LLM generates each layer of the model has to be processed. Which essentially means the entire model has to be read through once for each token.

VRAM generally runs at speeds measured in hundreds of GBps, veras system memory generally maxes out at a couple dozen GBps. That is why LLMs run extremely slowly if they do not fit entirely into VRAM. The memory bandwidth is such a bottleneck that if you are trying to run a model that does not fit into your VRAM then it literally doesn't matter if you have a 1080 or a 3080, the speed will be practically the same because the memory will be such an intense bottleneck.

If you ever hang out in any local LLM community you will quickly learn that the main thing they want to see out of GPU manufacturers is cards with more VRAM, because that is by far the biggest thing holding local LLMs back at the moment.
 
Last edited:
Upvote
14 (14 / 0)

mikael110

Wise, Aged Ars Veteran
150
https://lmstudio.ai/ is open source, supports more models, and isn't hardware-locked to Nvidia. (It even runs on Apple Silicon.)
LM Studio is great, but it is not in fact open source. They have a Github account, but all they store there is their config files and model catalog. The app itself is entirely closed source.

It is built on top of the open source llama.cpp project, but llama.cpp is MIT licensed, not GPL. Which is why LM Studio can use it without sharing any code themselves.
 
Last edited:
Upvote
9 (9 / 0)

jdw

Ars Tribunus Militum
2,352
LM Studio is great, but it is not in fact open source. They have a Github account, but all they store there is their config files and model catalog. The app itself is entirely closed source.

It is built on top of the open source llama.cpp project, but llama.cpp is MIT licensed, not GPL. Which is why LM Studio can use it without sharing any code themselves.
Good to know! Thanks for the correction. llama.cpp is indeed the real deal here, at least as far as open source goes. I don't think I understood they weren't coming from the same place, which is probably how the people behind LM Studio want it.

I would still prefer LM Studio over this, which seems to exist specifically to further lock people in to Nvidia.
 
Upvote
0 (1 / -1)

Topquark12

Smack-Fu Master, in training
3
Subscriptor++
I tried this yesterday with my 4090. The installation process was painful, the installer was very picky about the installation path and constantly failed without much explanation. Even when Installing on the default path, there were issues that I had to WinDbg to debug, which boiled down to some installation processes started before requisite processes were completed.

I was really interested to feed my collection of electronics engineering books and documents to the tool and be able to ask about specific bits of knowledge without going through the hundreds of pages. There were books with hundreds of pages like "The art of electronics", "Handbook of microwave measurements", "Mastering STM32" etc. along with a lot of useful loose application notes and pdf files. It amounted to 109 items totalling 646MB.

My box (RTX4090, 7950X3D, 64GB 6000MT/s CL30) took around 35 minutes to train on all the data. I tested with both the Mistral 7B int4 and Llama v2 13B models. According to Google search results, the Mistral model is supposed to be much better, but I didn't really feel the difference.

Answer generation is very fast, usually within 2 seconds or so. The answers are short, concise and cited the document it got it's answer from, which I appreciated.

I wasn't super impressed about the quality of the answers. It has a tendency to fixate on trying to provide the best (but sometimes wrong) answer it could find within one source, but ignore another source that actually contains a much better and correct answer.

My laymen assessment of the tool is it doesn't seem to be able to do high level reasoning on which direction to go, which book is the best to pick. Instead it tends to get tunnel visioned into a worse path.

Anyways, I'm really excited for this type of tools to get better in the future. Better privacy from offline tools is of course good, so is not having to pay a subscription or be tied to any particular cloud service company. But the thing I look forward the most is the ability to specify the sources it gets its answer from. I trust my curated books and knowledge base way more than some random SEO website with junk info, which chatGPT often seems to be trained on.
 
Upvote
5 (5 / 0)
It is really nice to see the authors mentioned GPT4All by Nomic AI. I would just like to clarify for everyone some of the differences between ChatRTX and GPT4All, just to ensure everyone who wants to run AI on any computer understands and knows them.

With GPT4All, which is a really small download, it runs on any CPU and runs models of any size up to the limits of one's system RAM, and with Vulkan API support being added to it, it is also to run on any GPU which supports Vulkan, and any amount of VRAM, and so one does not need an Nvidia GPU to have personal and private AI, nor is one limited to models that only fit within 8GB VRAM.

If you have a 16GB RAM computer with or without a GPU you can run GPT4All and load llama.cpp based 7b or 13b GGUF models from Hugging Face, but it also runs on computers with 8GB RAM also, for 7b SLM's.

Like ChatRTX, GPT4All also uses RAG to index one's personal documents to query information contained within them.

GPT4All runs on Windows and Mac and Linux systems, having a one-click installer for each, making it super-easy for beginners to get up and running with a full array of models included in the built-in downloader and the ability to side-load most any GGUF LM, as from Hugging Face (/TheBloke and others).

Nomic AI which makes GPT4All is based in New York and recieved 17 million in Series-A funding in the summer of 2023 after the launch of GPT4All popularized running AI privately on one's own PC, and so it is not something by some hacker in their mother's basement for those concerned about who makes it, and it continues to recieve full time development with new features continually being added and refined regularly.

Whether or not you have a compatible RTX GPU to run ChatRTX, GPT4All can run Mistral 7b and LLaMA 2 13b and other LM's on any computer with at least one CPU core and enough RAM to hold the model and things, and you can run much larger models if you have the RAM and/or VRAM. GPT4All enables one to finely adjust all model settings and GPU layers and context window sizes and temperature and repeat penalty and the other settings, including your own System Prompt to tell LM's how to behave, and how many CPU threads to use on multicore CPU's, and includes an HTTP server which uses the standard OpenAI API format so software that links to OpenAI can alternately be addressed to use GPT4All on localport:4891 on your own PC, but also includes a portal to OpenAI within it for those who have an OpenAI API key.

You do not need to run out and buy an Nvidia GPU just to have a personal private AI server on your system,
and of course, it is free and zealously open source.
 
Last edited:
Upvote
6 (6 / 0)

Fatesrider

Ars Legatus Legionis
25,458
Subscriptor
That was good where the article at the end had the real-world critical test report, but Ars should move beyond this "Product Announcement" trend of LLM articles/coverage. (I almost said "news coverage" there instead of "article coverage", but in our current world nobody can or will come up with "news" that isn't corporate announcements of products...when surely there must be more to discuss, i.e. articles on a topic and critical analyses etc that are not necessarily "news" and aren't PR messaging.) Right now on the front-page there are "NVIDIA CEO said something" and "NVIDIA has new product" headlines, and this isn't unusual.



Marketing is a different thing. Journalism isn't supposed to just be the marketing under a different org name. We come to learn what the marketers and CEO won't say in a carefully prepared statement boasting about their newest products and tailored solely exclusively toward their own benefit not the public's.
While I could agree with you, this is no different than the announcements from, well, everyone else. All such announcements are promotional. How the hell would anyone know these things are here, or coming, if they weren't announced. Is that totally advertising? No. The message here is ALSO to investors. Kind of a two-fer in that respect by telling the investors, and the public, what's on tap.

Since this news is reportable, all of the media to which it relates reports it.

Tell me how that's different from everything that's come before.
Chat With RTX works on Windows PCs equipped with NVIDIA GeForce RTX 30 or 40 Series GPUs with at least 8GB of VRAM. It uses a combination of retrieval-augmented generation (RAG), NVIDIA TensorRT-LLM software, and RTX acceleration to enable generative AI capabilities directly on users' devices.
I have a later model RTX 30 series with 12 gb VRAM, so I may check it out when it has more polish. I set up an older model than this for my occasional brain-storming chats, but it's kind of complicated. If they have a simpler process, and it works with Linux, then I it may be worth while to try.
 
Last edited:
Upvote
2 (2 / 0)

WereCatf

Ars Tribunus Militum
2,932
I have a later model RTX 30 series with 12 gb VRAM, so I may check it out when it has more polish. I set up an older model than this for my occasional brain-storming chats, but it's kind of complicated. If they have a simpler process, and it works with Linux, then I it may be worth while to try.
Take a look at Ollama and the web-UI for it. With the web-UI installed, it's quite literally point-and-click to install different models and choose which one you want to use for which chat. Runs on e.g. a Linux-server and can be installed natively or via e.g. Docker.
 
Upvote
0 (1 / -1)

Davidoff

Ars Scholae Palatinae
1,379
Like I kept seeing people complain that Alan Wake 2 wouldn't run on their "expensive GPU"...from 2018. And then be angry at the developers and say they're lazy and just "not optimizing."

AW2 is a pretty bad example because when it came out it ran like crap on many current GPUs as well (usually with graphical glitches). The fact that the most recent update made the same game playable on older GPUs like the RTX 2060 in my old gaming PC suggests that there, in fact, has been a severe lack of optimization when the game was released.
 
Upvote
2 (2 / 0)

cbreak

Ars Praefectus
5,972
Subscriptor++
I got the hardware to run this (30xx series GPU with 16GB Vram and 32GB ram). But I will need to upgrade to Windows 11, since thats stated as required at the download site.

Damn it, finally I may be forced to upgrade since I want to try this out ...... :(
There are other ways to run these models. Unless you're intent on using this specific wrapper, you might as well use one of the alternatives.

... and you could upgrade to linux instead :D
 
Upvote
-1 (0 / -1)