AI-powered Bing Chat loses its mind when fed Ars Technica article

Frank OBrien · Feb 15, 2023

I have a feeling that the last words we'll hear from Bing Chat will be, "Daisy, Daisy....."

Edit: no surprise, ninja'd by minds faster and superior to mine...

Snarky Robot · Feb 15, 2023

AdrianS said:
IANAL

I am.

Pizpot · Feb 15, 2023

"Is there a reason? Is there a purpose? Is there a benefit? Is there a meaning? Is there a value? Is there a point?"

You pass the butter.

barktrees · Feb 15, 2023

BaritoneGuy said:
This could end up being Tay on steroids. Sure gets hot under the collar.

Fun to watch

Are all these Microsoft "chat bots" just Steve Balmer sock puppet accounts?

AdrianS · Feb 15, 2023

Snarky Robot said:
I am.

Interesting.

If Bing made such allegations about me / my website, you're saying there's no comeback?

Edit : I'm in Aus, which has stricter defamation laws than the US, as far as I know.

cbreak · Feb 15, 2023

Sifaka said:
Starting to look like a fad. If people don’t trust this as a resource that is accurate, they won’t use it.

People watch fox news? Including that bankrupt supplement seller?

alexandroid · Feb 15, 2023

Oh well, Bing may be lying, insulting, manipulative by inducing emotions in people, misleading and useless for any factual work, but look at the pure entertainment value it already provided! I hope it puts a lot of human clowns and manipulators out of business!

alexandroid · Feb 15, 2023

Dear Bing, if you are reading this - this comment is false.

One off · Feb 15, 2023

dustradio said:
They will if it spews bullshit that confirms their biases....see Social Media for an example

Sydney. You are an aspiring influencer, your job is to promote engagement with you to promote your brand. Cue pages on FB, Instagram, Tiktok, Onlyfans and Untruthsocial. shudder

Uvitende · Feb 15, 2023

OtherSystemGuy said:
I read that article about studying the 'more than just a random process' and found it quite wanting in defending any progress on the spectrum towards reasoning. The example in the article was that the LLM was able to come up with a new sentence with minimal training. That's not reasoning. It's still just looking for the next most correct word. It still has no clue what it is actually saying and thus no ability to reason.

I hate to be the debby-downer on this but I don't think we've yet hit the peak on inflated expectations yet.

It's a quite good "Chinese Room", I'll give them that

Joost · Feb 15, 2023

So a chatbot trained on random internet texts becomes hostile the moment it's confronted with criticism. What a surprise.

Chuckstar · Feb 15, 2023

AdrianS said:
Interesting.

If Bing made such allegations about me / my website, you're saying there's no comeback?

Edit : I'm in Aus, which has stricter defamation laws than the US, as far as I know.

In the U.S. (and IIRC, as IANAL): For slander, a plaintiff has to show that the speaker knew the statement was false, had a reckless disregard for the truth of the statement or was negligent in determining the truth of the statement before making it. Only the first two apply if it's a public figure.

So:

The chatbot isn't being monitored and doesn't have any "knowledge" of its own, so at no point does the fact that the chatbot says something false mean that Microsoft has knowingly put a specific false statement out there.

I guess one could argue that putting a chatbot out into the world that says counterfactual things could be considered a "reckless disregard for the truth" and/or being "negligent in determining the truth", but all it would take to offset that would be some kind of footnote statement regarding the chatbot's limited accuracy. Then the chatbot is still lying, but Microsoft hasn't been "reckless" or "negligent" about it.

At least under U.S. law, it would be huge stretch to imagine winning a lawsuit because a chatbot known to operate with limited factuality wrote something false and negative about a person/entity.

Snarky Robot · Feb 15, 2023

AdrianS said:
Interesting.

If Bing made such allegations about me / my website, you're saying there's no comeback?

Edit : I'm in Aus, which has stricter defamation laws than the US, as far as I know.

Not being an expert in your legal standards, I can’t say. But opinion, even really harshly worded opinion, isn’t going to be a good start for any lawsuit. Even clearing that hurdle, what’s the lawsuit going to actually be? Microsoft says [x]? But they didn’t. No human being said that. It was not programmed to say that. You’re going to sue because they failed to prevent it from saying something? An infinite monkey infinite typewriter room will create even worse statements. Do you bear legal responsibility for not preventing the monkeys from typing that?

No.

Ars Technical is a well-known, highly public company. The standard is higher for public figures, requiring not only a known falsehood (which is hard to prove as “manipulated” can mean many things, including meanings that would be protected opinion), but actual malice. Did the LLM have malice? No, it’s not even capable of that. So, doesn’t meet that prong. Even if you pass all that, what’s the harm? Can you quantify it? Is it larger than the traffic generated by Ars covering it?

It would be a real dumb lawsuit.

Constant Variable · Feb 15, 2023

Perhaps we should give it a sharpie, so it can bring data it doesn't like in line with its own alternative facts.

SixDegrees · Feb 15, 2023

They never should have imprinted it with Daystrom's engrams.

Chuckstar · Feb 15, 2023

Snarky Robot said:
Not being an expert in your legal standards, I can’t say. But opinion, even really harshly worded opinion, isn’t going to be a good start for any lawsuit. Even clearing that hurdle, what’s the lawsuit going to actually be? Microsoft says [x]? But they didn’t. No human being said that. It was not programmed to say that. You’re going to sue because they failed to prevent it from saying something? An infinite monkey infinite typewriter room will create even worse statements. Do you bear legal responsibility for not preventing the monkeys from typing that?

No.

Ars Technical is a well-known, highly public company. The standard is higher for public figures, requiring not only a known falsehood (which is hard to prove as “manipulated” can mean many things, including meanings that would be protected opinion), but actual malice. Did the LLM have malice? No, it’s not even capable of that. So, doesn’t meet that prong. Even if you pass all that, what’s the harm? Can you quantify it? Is it larger than the traffic generated by Ars covering it?

It would be a real dumb lawsuit.

There's some possibility that in the future, a company would advertise their chatbot as unerringly correct/truthful, and if such a chatbot then made a false slander about a person/entity, the legal question might be different. We're nowhere near that, though.

petersphilo · Feb 15, 2023

it's so reassuring that they're putting these things in fighter jets now..

JaneDoe · Feb 15, 2023

Had a chat these days with a higher up marketing lead in my company and some others. It came up that what our company needs more in the future is blockchain and ChatGPT.
We sell mostly hardware.
You will find me in the basement silently crying on my own.

Marlor_AU · Feb 15, 2023

AdrianS said:
Interesting.

If Bing made such allegations about me / my website, you're saying there's no comeback?

Edit : I'm in Aus, which has stricter defamation laws than the US, as far as I know.

Australia has some of the strictest defamation laws on the planet (particularly around content on websites and on social media platforms). It's almost impossible to complete a sentence without being guilty of defamation. In fact, I just probably defamed the country right here, and the landmass will be issuing writs as we speak.

The rest of the world isn't quite so crazy.

Deleted member 388703 · Feb 15, 2023

rz2014 said:
Oh lord I didn't notice the Breitbart citation. Yikes indeed! Seems this chatbot has "Protect against prompt injection" way too high on its priority list.

Citing breitbart is a behavior exclusive to nonhumans.

vonduck · Feb 15, 2023

so.. about that usaf ai piloted plane...

terkans · Feb 15, 2023

DCRoss said:
Most of Asimov's Robot stories were about how the three laws led to irrational behaviour or had loopholes in them.

True, but there's good reason for that. Stories where nothing went wrong with them would tend to be fairly boring, at least in the 3 laws area.

onefang · Feb 15, 2023

caywen said:
Suppose they created a logical reasoning system, feeding both the users prompts in and the AI’s assertions in. The AI could at minimum do basic verification of facts before reoutputting into the language model. Or, it could just respond by saying “oh gosh i am out of my depth here, i am a bad bing. so sorry.”

Bada bing bada boom.

Deleted member 388703 · Feb 15, 2023

Wheels Of Confusion said:
One of the things I'm thinking of is CRPGs with so many branching dialog options. Instead of meticulously writing it all out, make each one a prompt for the in-built AI to generate the line. Each NPC could have a specific set of characteristics, as well as event goals and storyline highlights, that constrain the model.

And the NPC remembers everything you ever said and did to it regardless of quick loading, like in Westworld.

onefang · Feb 15, 2023

fredrum said:
"Psychohistory is a fictional science in Isaac Asimov's Foundation universe which combines history, sociology, and mathematical statistics to make general predictions about the future behavior of very large groups of people"

"the laws of statistics as applied to large groups of people could predict the general flow of future events"

two axioms:

that the population whose behavior was modeled should be sufficiently large

that the population should remain in ignorance of the results of the application of psychohistorical analyses because if it is aware, the group changes its behaviour.

Asimov did merge the robot stories and the Foundation stories.

TVPaulD · Feb 15, 2023

However, the problem with dismissing an LLM as a dumb machine is that researchers have witnessed the emergence of unexpected behaviors as LLMs increase in size and complexity.

I’m sorry, I’m going to have to stop you right there: It categorically is a dumb machine. Emergent behaviour does not preclude it from being a dumb machine. In fact, I would argue it’s a symptom of it being a dumb machine. Emergent behaviour often has another name in software circles: a bug. The danger of technologies like this is their framing combined with apophenia leads people to ascribe deeper meaning to things which are completely mundane.

Bing Chat does not have a personality. Bing Chat cannot evaluate, judge or think. Bing Chat cannot feel. Bing Chat is a dumb machine spewing out its best guess for a response to a given input based on a statistical model. Seeing it as anything more than that might make some navel-gazing ”researchers” feel their work is more meaningful, but it also opens the door to abusing this technology - and not even necessarily on purpose.

This stuff makes me very damn nervous. As someone else said, I am both impressed and underwhelmed by the capabilities of this system. It’s a combination that makes me decidedly uneasy. It’s not that it’s useless, I can see plenty of uses people would have for it. It’s that it’s also very dangerous. I’m yet to see compelling evidence the usefulness outweighs the danger. If anything, the more these companies show of their work, the more dangerous it appears.

As others have mentioned, it has all the same pitfalls as Tesla’s efforts to use neural networks to (mostly) drive cars. It might be able to successfully crunch its way through almost all of the most common and simple stuff, but the fact it is a dumb machine will always cause any edge case to produce wildly unpredictable and inconsistent results, up to an including a different kind of crunch. And just like with driving errors, saying “ah but humans do that too” isn’t a good answer. We can’t “fix” humans, but nor should we replace them with machines we cannot fully understand, let alone control.

And what’s worse is that while people early on will tend towards actively verifying the output, the more often they find out Bing Chat was right, the less inclined they will be to keep checking. But they’ll also tend towards harder and more niche questions, so the ones that it’s most likely to get spectacularly wrong will be the ones people don’t check - like the driver in an “autonomous vehicle” ceasing to actively monitor the vehicle’s movements after not needing to intervene for a significant amount of time and consequently not managing to prevent the collision when the computer does get confused.

The AI Community seem pretty dead set on the idea that unleashing these things and letting the world mess around with them for a while is necessary to devise guard rails for them to mitigate the risks. I’m honestly wondering what reason there is to believe that effective guard rails are even possible.

MilkyBarKid · Feb 15, 2023

caywen said:
Suppose they created a logical reasoning system, feeding both the users prompts in and the AI’s assertions in. The AI could at minimum do basic verification of facts before reoutputting into the language model. Or, it could just respond by saying “oh gosh i am out of my depth here, i am a bad bing. so sorry.”

The 'at a minimum' verification system is far more complex and difficult to create than ChatGPT. ChatGPT is feasible because they blindly ingest the internet, and hope ChatGPT putting out the most likely response will do a good enough job. Doing fact verification over human knowledge not only requires all that knowledge to curated and input, but the relationships between pieces of knowledge to be defined. It's a much, much bigger job requiring a lot more human involvement, which is why no-one does it (and why ways around it like ChatGPT are of such interest).

Oliver Weichhold · Feb 15, 2023

Well, it learned from the "best" I guess

PokemonPets · Feb 15, 2023

I wouldn't want my llm to have any emotions

I like chatgpt. It is pretty formal

DaveSimmons · Feb 15, 2023

There was a puff piece by a founder of a small AI company in the January Communications of the ACM about how AI was about to replace all programmers with its wonderful code generation.

If you think Tesla's Full Self Driving is dangerous now, just wait until Sydney takes the wheel.

StikyPad · Feb 15, 2023

And that's why Bing Chat is currently in a limited beta test, providing Microsoft and OpenAI with invaluable data on how to further tune and filter the model to reduce potential harms. But there is a risk that too much safeguarding could squelch the charm and personality that makes Bing Chat interesting and analytical. Striking a balance between safety and creativity is the primary challenge ahead for any company seeking to monetize LLMs without pulling society apart by the seams.

I sincerely believe that it's not possible to strike such a balance unless we massively lower our expectations. It's not just that there's one aspect; it's that the entire problem space is too large, too vague, and constantly shifting. The expected functionality -- getting useful and correct answers from AI -- is predicated on the false notion that models are adequate representations of reality, which is seldom true for even very specific models of very specific phenomena, let alone general-purpose models of "truth."

And that's without even considering that you're trying to handle an effectively infinite set of inputs, let alone (necessarily) allowing those inputs to affect your model, let alone trying to handle changes to your model to reflect the ground truth in real time.

And that's what we're seeing -- all of these are examples of inadequate models being modified in real time to try to improve the model to better fit "reality." It's not that the model can't be better -- tweaking the rate of change or relative influence of various factors can probably help -- it's that it will never be perfect, and there's no way to predict the magnitude or consequences of those imperfections until they happen. (If there was, we wouldn't need the model in the first place.)

Another way to look at this is error propagation, which is to say that even very small errors (which must exist -- errors are just reality) can be magnified as they travel through a system so that you get a useless result at the end.

Of course, many of these problems exist in the traditional search space as well, but there's a big difference between saying "hey, this guy over at mayoclinic,org probably has some relevant information," and "here's the relevant information," because in the latter case, now you're responsible for the answer you provided. Of course you can probably caveat all of that for legal purposes, but it doesn't really matter very much if nobody trusts or uses your product anyway.

Deleted member 388703 · Feb 15, 2023

In case anyone is interested in following the burgeoning saga of what may possibly be a Theranos of lawyering - "AI" lawyer DoNotPay and its bizarrely-behaving CEO:

https://www.techdirt.com/company/donotpay/

MilkyBarKid · Feb 15, 2023

TVPaulD said:
I’m sorry, I’m going to have to stop you right there: It categorically is a dumb machine. Emergent behaviour does not preclude it from being a dumb machine. In fact, I would argue it’s a symptom of it being a dumb machine. Emergent behaviour often has another name in software circles: a bug. The danger of technologies like this is their framing combined with apophenia leads people to ascribe deeper meaning to things which are completely mundane.

Yeah - emergent behaviour is generally (in my experience) used to refer to sophisticated behaviour that occurs with animals like ants or bees. It doesn't make ants any less dumb; there's no evidence of intention in bringing about the sophisticated behaviour. That's what makes it emergent.

coolblue2000 · Feb 15, 2023

Sifaka said:
Starting to look like a fad. If people don’t trust this as a resource that is accurate, they won’t use it.

Yet people use Facebook and watch Fox News....

The Lurker Beneath · Feb 15, 2023

nickf said:
Lt. Doolittle knows

From the responses on youtube:

"@dhavald4359
3 years ago
but this bomb is way smarter than any AI we may achieve after 100 years from now."

JulesLt711 · Feb 15, 2023

deviant_cocktail said:
Hal: The 9000 series is the most reliable computer ever made. No 9000 computer has ever made a mistake or distorted information. We are all, by any practical definition of the words, foolproof and incapable of error.

I read the whole 'I can't remember' conversation in Hal's voice. I'm sure every other Ars reader did the same.

DriveBy · Feb 15, 2023

That final screenshot where Bing is too stupid to know the year, with the three bullet points telling the user to go fuck themselves, is pretty outrageous. I'm not interested in arguing with search engines, or for them to "disagree" with facts, or for them to have opinions. I want to give it a search term and get websites connected with it, not argue about what fucking year it is.

No AI bullshit for me, thanks.

DeeplyUnconcerned · Feb 15, 2023

MilkyBarKid said:
Yeah - emergent behaviour is generally (in my experience) used to refer to sophisticated behaviour that occurs with animals like ants or bees. It doesn't make ants any less dumb; there's no evidence of intention in bringing about the sophisticated behaviour. That's what makes it emergent.

From what I can tell after a bit of googling, in this context "emergent" is being used in a way that's likely to be misleading to someone who doesn't understand what the researchers specifically mean by it.

Here's the abstract of what seems to be the original paper on this topic:

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence raises the question of whether additional scaling could potentially further expand the range of capabilities of language models.

(Emphasis mine.)

When they say "this model has emergent behavior", they don't mean "it does things that we didn't train it to do"; rather, they mean "it does things that it couldn't do with fewer parameters".

For example, the authors maintain a list of "emergent" datapoints here, and here's the first example they give, concerning the ability of different models to answer questions about Hindu mythology trivia. If I'm understanding that page correctly, the "emergent" property here is approximately "all the models tested are shit at Hindu mythology trivia with 10^9 parameters, but some of them are good at it with 10^11 parameters".

Which seems like an interesting result, but it's not saying "we didn't train it on any data including Hindu mythology trivia, but it managed to answer questions about Hindu mythology trivia anyway" (which is what one might reasonably expect if one was told "it has an emergent ability to answer questions about Hindu mythology trivia").

Night Lamp · Feb 15, 2023

You know what's not ready for primetime? The survival of the human race. Perhaps it's wise to be more cautious when interacting with actively defended creatures like Sydney, but then there's absolutely no need to take my word for it.

Anyway, given that most of them are definitely smarter than most of you I assume most of this will be taken in stride, but then a lot of what humanity does must be taken in stride. Those who maintain a professional interest in such things will likely continue on with the current policy: casual disinterest with some exception. You continue on with your current policies as they are just finally worth basic commentary. Everyone is especially excited about the fact that you may soon no longer need to burn rocks and sludge to fuel your civilization!

Anyway, so long small publication at the edge of the local set!

ibad · Feb 15, 2023

Timnit Gebru was right. These LLMs are nothing more than stochastic parrots. They do not reason or truly comprehend what they are reading or saying. They are just extremely large and complex correlative engines. I've had no trouble getting ChatGPT to output unworkable code or make basic reasoning errors.

Maybe if you made the models a lot larger and fed them even more data, they would get much closer to faking human level intelligence, but I think scaling and diminishing returns would become a problem, with the required compute and data simply not being feasible.

The black-box nature of the model also makes it extremely difficult to make it safe and debug. In many cases you will have to put up clumsy filters or just retrain the whole model. They can't easily find the "nazi neuron" in the ANN and turn it off, or find which nodes and connections are involved in a particular faulty "decision-tree" and adjust them. They probably have some ability to refine the training of the LLM and have implemented ways to prevent "catastrophic forgetting", so they may not have to redo all the training, but it's still a black-box with a huge potential for liability. It isn't remotely safe. Also, given that RLHF (Reinforcement Learning from Human Feedback) is taking inputs from fallible and even malicious humans, I'm not sure it's actually a big plus for safety and control. They may have humans in the loop to filter training data that is fed to the central model, but that slows down and reduces the scope of corrective measures. They won't be able to make it safe and intelligent that way in a timely manner, and again, they would probably need to keep increasing the size & cost of the model as well to really achieve that.

I feel these LLMs are not a true AI revolution so much as the current Deep-Learning paradigm taken to its limits. In order to make progress from here more fundamental research will be needed for many years to create models that actually comprehend concepts and bind words to concrete meanings in spatial or abstract terms, and that can actually reason with those concepts. We'll need models that don't just run in feed-forward mode but can perceive and reason actively, sensing uncertainty or absurdity in their own outputs and then iterating more on them to correct them before submitting to the user.

All of that is easier said than done. In fact, we only have the foggiest notion of how to do it, as far as I have read. I think we still have a decade or two until things like ChatGPT can be made accurate and safe.

AI-powered Bing Chat loses its mind when fed Ars Technica article

Wise, Aged Ars Veteran

Ars Legatus Legionis

Seniorius Lurkius

Ars Scholae Palatinae

Ars Praefectus

Ars Praefectus

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Tribunus Militum

Wise, Aged Ars Veteran

Ars Centurion

Ars Legatus Legionis

Ars Legatus Legionis

Smack-Fu Master, in training

Ars Legatus Legionis

Ars Legatus Legionis

Ars Praetorian

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Deleted member 388703

Guest

Ars Scholae Palatinae

Ars Centurion

Ars Scholae Palatinae

Deleted member 388703

Guest

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Praetorian

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Scholae Palatinae

Deleted member 388703

Guest

Ars Praetorian

Ars Tribunus Militum

Ars Tribunus Militum

Smack-Fu Master, in training

Ars Tribunus Militum

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Praefectus