Gemini 3.5 and Antigravity come to Google NotebookLM

Spazzles

Ars Scholae Palatinae
1,454
I'd like to see an objective, measurable definition of "win rate" for their Accuracy & Quality measurement, followed by an explanation of why a "win rate" of less than 60% is being touted as a good thing.

I'm being serious here. First off, what the hell is a "win rate"? Why is accuracy and quality part of the same measurement; what's the difference between Accuracy as a concept and Quality as a concept, and how are they related enough that they're part of the same measurement? When the "win rate" is 60% then how often is the 40% of "lose rate" because of poor quality and how often is it because it's just flat out wrong about something? Or is it measuring something other than "Accuracy" or "Quality" that is somehow OK to be statistically worse at than a 10th grader with a 25 year old encyclopedia set doing research for an English Literature paper?
 
Upvote
23 (25 / -2)

J.King

Ars Praefectus
4,467
Subscriptor
I'd like to see an objective, measurable definition of "win rate" for their Accuracy & Quality measurement, followed by an explanation of why a "win rate" of less than 60% is being touted as a good thing.

I'm being serious here. First off, what the hell is a "win rate"?
It seems to be how often Gemini 3.5 performs better than Gemini 3.1, if I'm reading it right. I'm unclear why 50% seems to be the benchmark, though.
 
Upvote
5 (5 / 0)

Fatesrider

Ars Legatus Legionis
25,485
Subscriptor
It seems to be how often Gemini 3.5 performs better than Gemini 3.1, if I'm reading it right. I'm unclear why 50% seems to be the benchmark, though.
Set it low enough, it seems like it's doing better.
Google’s NotebookLM was one of the company’s first forays into generative AI technology, and in un-Googley fashion, it hasn’t been shut down yet.
The most important word in that sentence is "yet".
 
Upvote
6 (6 / 0)

April King

Ars Scholae Palatinae
1,170
Seems bizarre that Notebook LLM would support Microsoft's document formats but not Google Docs?
NotebookLM does support Google Docs, although, annoyingly, it doesn't automatically refresh them as sources when the document itself gets updated. You have to click it to force a freshness check.

My biggest problem with NotebookLM is that it is extremely prudish. If you use it for editing fiction — where it's exceptional for finding things like continuity errors, grammatical errors, or US/UK-isms, or asking it what was the eye color of character XYZ — it will error for any source featuring even the slightest hint of adult anything, even if your questions have nothing to do with it.

That caveat aside, it's still the best editing tool I have ever used, and it's not particularly close.
 
Upvote
5 (6 / -1)
Cool, cool, just yesterday I tried using Gemini to create an infographic for my kid, and it took 9 tries to finally create something without hallucinations/errors.
So even a claimed 59.4% improvement in quality and accuracy will still translate to :rollpoop::rollpoop::rollpoop:.
I've had poor luck generating useful graphics in general. Input has been amazing with computer vision. Output has been... Well if you squint it sort of looks right sometimes.
 
Upvote
3 (3 / 0)

wildsman

Ars Tribunus Militum
1,909
Cool, cool, just yesterday I tried using Gemini to create an infographic for my kid, and it took 9 tries to finally create something without hallucinations/errors.
So even a claimed 59.4% improvement in quality and accuracy will still translate to :rollpoop::rollpoop::rollpoop:.
Pls share the chat link... Interested to see prompt vs output.
 
Upvote
2 (2 / 0)
Well, it does support Docs/Sheets/etc.
...as inputs, yes. But the article only speaks to outputs:

Google plans to add more file types over time, but it’s starting with the following:

  • Data visualizations and charts (png, svg)
  • Documents (PDFs, docx, markdown, text files)
  • Images with Nano Banana (png, jpg, gif)
  • Structured data (csv, json)
  • Microsoft Excel (xlsx)
  • Microsoft PowerPoint (pptx)
 
Upvote
3 (3 / 0)
NotebookLM does support Google Docs, although, annoyingly, it doesn't automatically refresh them as sources when the document itself gets updated. You have to click it to force a freshness check.

My biggest problem with NotebookLM is that it is extremely prudish. If you use it for editing fiction — where it's exceptional for finding things like continuity errors, grammatical errors, or US/UK-isms, or asking it what was the eye color of character XYZ — it will error for any source featuring even the slightest hint of adult anything, even if your questions have nothing to do with it.

That caveat aside, it's still the best editing tool I have ever used, and it's not particularly close.
The article, if you read it, only refers to outputs and that is the context of my comment.
Google plans to add more file types over time, but it’s starting with the following:
  • Data visualizations and charts (png, svg)
  • Documents (PDFs, docx, markdown, text files)
  • Images with Nano Banana (png, jpg, gif)
  • Structured data (csv, json)
  • Microsoft Excel (xlsx)
  • Microsoft PowerPoint (pptx)
 
Upvote
1 (1 / 0)

hi-endian

Wise, Aged Ars Veteran
157
Seems bizarre that Notebook LLM would support Microsoft's document formats but not Google Docs as outputs?

(edited to clarify that I am referring to the article's list of file types it can output to)
To clear up the confusion, the article is talking about what formats are being added. Google Docs support has existed since day 1.
 
Upvote
5 (5 / 0)

jorisherry

Seniorius Lurkius
24
Subscriptor
I am using french, dutch and english a lot in project and i can tell the multilingual support improved drastically last month. Now it's possible to switch between any languages without notifying . When speaking, it keeps a nice flemish accent, not dutch from the Netherlands. As far is i know gemini is the only one capable of doing that.
 
Upvote
1 (1 / 0)

JoHBE

Ars Praefectus
4,445
Subscriptor++
I'd like to see an objective, measurable definition of "win rate" for their Accuracy & Quality measurement, followed by an explanation of why a "win rate" of less than 60% is being touted as a good thing.

I'm being serious here. First off, what the hell is a "win rate"? Why is accuracy and quality part of the same measurement; what's the difference between Accuracy as a concept and Quality as a concept, and how are they related enough that they're part of the same measurement? When the "win rate" is 60% then how often is the 40% of "lose rate" because of poor quality and how often is it because it's just flat out wrong about something? Or is it measuring something other than "Accuracy" or "Quality" that is somehow OK to be statistically worse at than a 10th grader with a 25 year old encyclopedia set doing research for an English Literature paper?

I suspect the fundamental problem with properly evaluating and scoring generative AI outputs, is the enormous gap between the volumes it can produce, and actual available relevant human expertise to comb through it properly. Even when companies are reasonably motivated to benchmark accurately, it's just too big an undertaking. For me, this is hanging over the whole thing like a sword of Damocles, at least when using it in areas or questions that you're not reasonably familiar with, yourself. None of the normal "clues" you can look for to assess the reliability of a source/person are available, so it is much like entering a minefield (with delayed detonation) And just like with self-driving, it suffers from a paradox: improved benchmark scores initially correspond with an improved usability/experience. BUT at some point,say beyond the 90%, they transition into a zone where, as long as it isn't at 99+ percent, it actually gets more and more dangerous as you trust more, and verify less. It terrifies me, and I guess most people who would actually care for what they use it for. Plenty of people and usage cases to which the latter doesn't apply, of course. Especially when possible consequences aren't personal.
 
Upvote
2 (2 / 0)
I suspect the fundamental problem with properly evaluating and scoring generative AI outputs, is the enormous gap between the volumes it can produce, and actual available relevant human expertise to comb through it properly. Even when companies are reasonably motivated to benchmark accurately, it's just too big an undertaking. For me, this is hanging over the whole thing like a sword of Damocles, at least when using it in areas or questions that you're not reasonably familiar with, yourself. None of the normal "clues" you can look for to assess the reliability of a source/person are available, so it is much like entering a minefield (with delayed detonation) And just like with self-driving, it suffers from a paradox: improved benchmark scores initially correspond with an improved usability/experience. BUT at some point,say beyond the 90%, they transition into a zone where, as long as it isn't at 99+ percent, it actually gets more and more dangerous as you trust more, and verify less. It terrifies me, and I guess most people who would actually care for what they use it for. Plenty of people and usage cases to which the latter doesn't apply, of course. Especially when possible consequences aren't personal.
Imo, there are 2 different questions.
1) is an LLM useful for my task?
2) which LLM is best for my task?

The 1st question is the one we generally talk about here. It is very hard to answer. Issues include-- how relevant is the benchmark, is the AI cheating, is the judge biased, is the data memorized, is it future-proofed, is it immoral to use the LLM, what's an appropriate level of accuracy, what's the cost of being wrong, how do humans do, how are the errors different... etc...

The 2nd question is only relevant if you answered the 1st question with yes. It is a much easier question to ask, because you don't have any baked in preference for any of the talking parrots. You don't even need a perfect proxy as there's a strong correlation between skills.

There was a very interesting study published by the Stanford Law School recently. They answered the 1st question using Law Professors. But it takes time and collect answers and publish them, and in that time new LLMs came out. So they went ahead an answered the 2nd question using an LLM.
 
Last edited:
Upvote
0 (0 / 0)

Zarsus

Ars Scholae Palatinae
1,229
Subscriptor
Pls share the chat link... Interested to see prompt vs output.

Here it was (illustration 4 has the wrong arrow, but Gemini could not simply flip the arrow's direction or swap the "Front"/"Back" 🤷‍♂️, twice it decided to create JavaScript physics demonstrations instead of just fixing the darn images. So I ended up just regen new images until one was created without errors.
Prompt:
Create an appealing illustration explaining why smearing a bit of dish soap behind a model boat would propel it forward in water, the target audience are [redacted] year olds

1781024834941.png
 
Upvote
-1 (1 / -2)
I suspect the fundamental problem with properly evaluating and scoring generative AI outputs, is the enormous gap between the volumes it can produce, and actual available relevant human expertise to comb through it properly. Even when companies are reasonably motivated to benchmark accurately, it's just too big an undertaking. For me, this is hanging over the whole thing like a sword of Damocles, at least when using it in areas or questions that you're not reasonably familiar with, yourself. None of the normal "clues" you can look for to assess the reliability of a source/person are available, so it is much like entering a minefield (with delayed detonation) And just like with self-driving, it suffers from a paradox: improved benchmark scores initially correspond with an improved usability/experience. BUT at some point,say beyond the 90%, they transition into a zone where, as long as it isn't at 99+ percent, it actually gets more and more dangerous as you trust more, and verify less. It terrifies me, and I guess most people who would actually care for what they use it for. Plenty of people and usage cases to which the latter doesn't apply, of course. Especially when possible consequences aren't personal.
I've been experimenting with Notebook LM on a subject about which I am the world's number 1 expert: my own fiction writing.

(I use it only to analyze my writing. I've yet to find an LLM that is up to the task of actually generating high enough quality artistic text, and certainly none that can do it with my voice. I also use it for research, but for obvious reasons I'm not uniquely qualified to speak to accuracy about subjects I'm researching. I have a pro account, so I'm not familiar with the specific upgrades mentioned in the article.)

So far, my conclusions are:

The audio output ("deep dive", "debate", etc) hallucinate like mad. These can be very entertaining, but are also massively sycophantic: Ask for a "deep dive" on a piece of truly insubstantive AI slop from YouTube and the AI podcasters will praise it as though it's philosophically profound literary fiction. Likewise, the critics will find nonexistent faults in the prose of Nobel Prize winners.

The written responses in the "chat" and "reports" functions are rock solid... when the sum total of words in all of the sources does not exceed about 10k. After that, the errors begin to appear and then increase at what seems like an exponential rate as more words are added to the sources.

In my case, errors will be things like: misattribution of quoted dialogue to the wrong character; misattribution of backstory events to the wrong character; misnaming in world-building history of persons and locations; timeline mistakes (even in a perfectly linear story); mistakes in descriptive details. They are mostly small, but significant errors that I (being the author) recognize instantly. But some errors are particularly egregious, such as when the order of events is presented incorrectly.

You can massively reduce the number of errors by limiting queries to single documents (or groups of documents) containing less than 10k words.

In other words, it's totally usable as long as you realize that it still makes mistakes. Sometimes, lots and lots of mistakes.

Edit: So, yeah. I agree with you completely. It's possible that LLMs do not hallucinate as much (or at all) with data that is repeated many times in their training. But for new data, LLMs definitely do function at times as though they've been dosing on mind altering chemicals. (For fiction writing, that's a positive feature, I think. For researching the real world, not so much. And sycophancy remains a serious issue.)
 
Last edited:
Upvote
1 (1 / 0)