fluid reality

From toy to tool: DALL-E 3 is a wake-up call for visual artists—and the rest of us

AI image synthesis is getting more capable at executing ideas, and it’s not slowing down.

Benj Edwards – Nov 16, 2023 7:20 am | 429

A composite of three DALL-E 3 AI art generations: an oil painting of Hercules fighting a shark, a photo of the queen of the universe, and a marketing photo of "Marshmallow Menace" cereal. Credit: DALL-E 3 / Benj Edwards

In October, OpenAI launched its newest AI image generator—DALL-E 3—into wide release for ChatGPT subscribers. DALL-E can pull off media generation tasks that would have seemed absurd just two years ago—and although it can inspire delight with its unexpectedly detailed creations, it also brings trepidation for some. Science fiction forecast tech like this long ago, but seeing machines upend the creative order feels different when it’s actually happening before our eyes.

“It’s impossible to dismiss the power of AI when it comes to image generation,” says Aurich Lawson, Ars Technica’s creative director. “With the rapid increase in visual acuity and ability to get a usable result, there’s no question it’s beyond being a gimmick or toy and is a legit tool.”

With the advent of AI image synthesis, it’s looking increasingly like the future of media creation for many will come through the aid of creative machines that can replicate any artistic style, format, or medium. Media reality is becoming completely fluid and malleable. But how is AI image synthesis getting more capable so rapidly—and what might that mean for artists ahead?

Using AI to improve itself

We first covered DALL-E 3 upon its announcement from OpenAI in late September, and since then, we’ve used it quite a bit. For those just tuning in, DALL-E 3 is an AI model (a neural network) that uses a technique called latent diffusion to pull images it “recognizes” out of noise, progressively, based on written prompts provided by a user—or in this case, by ChatGPT. It works using the same underlying technique as other prominent image synthesis models like Stable Diffusion and Midjourney.

You type in a description of what you want to see, and DALL-E 3 creates it.

ChatGPT and DALL-E 3 currently work hand-in-hand, making AI art generation into an interactive and conversational experience. You tell ChatGPT (through the GPT-4 large language model) what you’d like it to generate, and it writes ideal prompts for you and submits them to the DALL-E backend. DALL-E returns the images (usually two at a time), and you see them appear through the ChatGPT interface, whether through the web or via the ChatGPT app.

Many times, ChatGPT will vary the artistic medium of the outputs, so you might see the same subject depicted in a range of styles—such as photo, illustration, render, oil painting, or vector art. You can also change the aspect ratio of the generated image from the square default to “wide” (16:9) or “tall” (9:16).

OpenAI has not revealed the dataset used to train DALL-E 3, but if previous models are any indication, it’s likely that OpenAI used hundreds of millions of images found online and licensed from Shutterstock libraries. To learn visual concepts, the AI training process typically associates words from descriptions of images found online (through captions, alt tags, and metadata) with the images themselves. Then it encodes that association in a multidimensional vector form. However, those scraped captions—written by humans—aren’t always detailed or accurate, which leads to some faulty associations that reduce an AI model’s ability to follow a written prompt.

To get around that problem, OpenAI decided to use AI to improve itself. As detailed in the DALL-E 3 research paper, the team at OpenAI trained this new model to surpass its predecessor by using synthetic (AI-written) image captions generated by GPT-4V, the visual version of GPT-4. With GPT-4V writing the captions, the team generated far more accurate and detailed descriptions for the DALL-E model to learn from during the training process. That made a world of difference in terms of DALL-E’s prompt fidelity—accurately rendering what is in the written prompt. (It does hands pretty well, too.)

In addition, DALL-E 3 is very good at rendering accurate text compared to DALL-E 2 and some other image synthesis models. This is another effect of the highly detailed captioning created by GPT-4V: “When building our captioner, we paid special attention to ensuring that it was able to include prominent words found in images in the captions it generated,” writes the DALL-E 3 team in its paper. “As a result, DALL-E 3 can generate text when prompted.”

An example of DALL-E 3 rendering text: "A muscular barbarian with weapons stands confidently beside a CRT television set displaying the text 'Ars Technica'. The scene is cinematic with 8K resolution and dramatic studio lighting." — An example of DALL-E 3 rendering text: “A muscular barbarian with weapons stands confidently beside a CRT television set displaying the text ‘Ars Technica’. The scene is cinematic with 8K resolution and dramatic studio lighting.” Credit: DALL-E 3 / Benj Edwards

DALL-E’s text rendering ability isn’t perfect—some words have extra or missing characters, and others seem garbled at times. The team speculates that this is due to the token encoder they used. Tokens are fragments of words (and sometimes whole words) used to represent words in machine learning models such as GPT-4 and the prompt interpreter for DALL-E 3. The reliance on tokens sometimes creates a type of blindness for certain words or spellings when chunks of words get lumped into a single token together.

For example, the word “dog” is represented to DALL-E 3 as a single token and not three characters (D-O-G), as you might expect. “When the model encounters text in a prompt, it actually sees tokens that represent whole words and must map those to letters in an image,” the team writes. “In future work, we would like to explore conditioning on character-level language models to help improve this behavior.”

What does it mean for artists?

By now, you’ve seen some of what DALL-E 3 can do. It vastly exceeds the state of the art from a year ago, and it dwarfs DALL-E 1 from 2021. It’s a technical triumph. But it’s still very difficult to write about AI image generators without an asterisk. The technology is highly divisive. For some people, the tech represents an exciting development in creative expression, but for others, it symbolizes insensitivity and corporate greed.

“I don’t want to use the technology. It’s literally a giant stew of plagiarism,” says Gwendolyn Wood, an illustrator and graphic designer based in Seattle who replied to questions from Ars Technica. She frequently works in non-digital media, such as watercolors. “I’m sympathetic to people who designed it, but I think it will further rob our world of joyful experiences. It just makes me so sad. I hope that handmade art will remain something that matters to people.”

By referring to plagiarism, Wood is referencing the fact that many AI image generators have been trained on scrapes of copyright-protected works downloaded from the Internet without artist permission. The issue of whether training an AI model on scraped work falls under fair use has not yet been resolved in US law, but the practice has made image synthesis feel like an antagonistic technology to many artists who feel like they might be replaced by it. We have covered this angst in detail at Ars Technica over the past year through dozens of articles.

To address this issue, some companies like Adobe (with Firefly) and Getty Images have trained AI image models solely on public domain art and images they have the rights to use, licensed through their stock photo archives. But it’s not a perfect solution because some artists and photographers don’t feel that the company’s use of their work to train their replacements is fair or equitable.

In some ways, AI art is a continuation of an ancient trend. For thousands of years, innovations in art have made it easier for humans to make complex art faster—metal chisels, paper, paintbrushes, mass-produced paints, pencils, cameras, airbrushes, digital photo editors, and vector illustration software were all revolutionary in their day. Many advances have gone hand-in-hand with improvements in the mass production of artistic works—a development that some critics once thought would diminish the value of art. Each advance allowed for expressing ideas faster and communicating them more broadly. Image synthesis builds upon that tradition by reducing friction between idea and execution.

But faster isn’t always better, according to Wood, who prefers the satisfying and often therapeutic process of creating art by hand.

“I have heard the argument in favor of [AI art] is that it will save time and that it allows people without artistic training to make art they’re proud of and bring their creativity to life,” says Wood. “My response to that is that time spent creating is not wasted time. It’s incredibly important time that is nourishing to the soul.” And she emphasized the handmade element of her art: “Everyone should create art with their hands! [AI art] is not giving people a skill, it’s taking away a more joyful experience they could have physically creating art, if they could be humble enough to enjoy the process of learning.”

While AI art can feel like a continuation of previous advances in art technology, this time feels different because we’re outsourcing a portion of the creative process to a machine. Prior to that, ideas only came from people—whether your own brain, in conversation, in books, or through audiovisual communications. Now, machines also have ideas. It’s the first step in an uncharted type of partnership.

“Taste is the new skill”—AI art as an accessibility tool

As wholesome as handmade art sounds, not everyone can physically create art due to mental or physical limitations. Over the past year, we’ve heard from several people with disabilities who enjoy using image synthesis to express themselves in ways they could not otherwise. You can find their stories through searches on Reddit or social media.

“I have stage 4 cancer and aiart [sic] actually gave me reason to keep fighting it,” wrote a user named bodden3113 on Reddit in December 2022. “I no longer have to wait till I’m not doing chemo to figure out how to draw hair or decide if i should even learn if i won’t live long enough to see on paper what i had in mind… Why do we have to do things the same way it was always done?”

One artist, Claire Silver, has been using AI art collaboratively since 2018 and gained renown from it, becoming the first AI-augmented artist to sign with the WME talent agency. “I have a chronic, disabling illness, an experience that has galvanized my love for augmenting skill in favor of expression,” she told Ars Technica in an interview. “I grew up in poverty and have changed my family’s life with my AI art.”

"Skins," a collaborative AI-human artwork created by Claire Silver. — “Skins,” a collaborative AI-human artwork created by Claire Silver. Credit: Claire Silver

Silver sees most criticism of AI art as being shortsighted. The technology exists, and it will have an undeniable impact, both positive and negative, based on how it is used. “AI is transformative in the same way that cavemen discovering fire was transformative. Fire isn’t good or bad. It just is. It’s a homo-sapiens-sapiens moment for our species, and for better and worse, we can’t go back to the darkness.”

Given technology’s evolving role in augmenting human art throughout history, perhaps we’ve confused the meaning of the process with the meaning of the content. “In my opinion, adapting in the age of AI means finding what makes us human and leaning into it,” says Silver. “Artists have learned that skill is what matters, and while admirable, skill has been king for millennia. There’s room for new perspectives.”

Perhaps our focus on the value of time-consuming labor and a high degree of artistic skill unduly keeps people from expressing themselves, she wonders. “Art isn’t about skill for me. It’s about emotion, imagination, meaning—the things that make us human,” she says. “Is a future where we value those things over technical ability really so terrible? Given an infinite answer machine, those with the imagination and discernment to ask the right questions will succeed. With AI, taste is the new skill.”

Aurich Lawson sees the issue of skill in AI art differently, with some being better at using generative tools than others. “There may be a new field of people who are AI wranglers, using a lot of tools and experience to navigate those waters, but they’re ultimately going to be developing skill sets and processes much like anyone who is a pro in tools like Photoshop.”

AI art and jobs

Even if AI art makes human expression effortless, to borrow from the fire analogy, it can potentially destroy just as easily as it can create. “The Industrial Revolution ended a lot of jobs. It also created a lot of new ones,” says Silver. “That’s the nature of progress. You adapt.”

Technology has always made commercial production cheaper and faster. If it’s cheaper for a company to type words into a machine and wait 10 seconds than hire a human artist who takes six months and costs $20,000, they will likely pick the machine every time.

“I feel very lucky I’m not a corporate artist. I sell directly to individuals who appreciate that my art is made the way I make it,” traditional artist Gwendolyn Wood says. “The jobs of folks selling art to public companies focused on short-term profit could easily dry up because of this.”

An AI-generated DALL-E 3 image of an artist that sits by while a robot paints for him. — An AI-generated DALL-E 3 image of an artist sitting by while a robot paints for him. Credit: DALL-E 3 / Benj Edwards

Ars Technica’s Lawson agrees. “I think if your job is just being an illustrator, you have clients who aren’t terribly picky about results and just want to fill space—you’re probably right to feel threatened,” says Lawson. But he still sees a place for humans in the loop. “I do think the abilities of the AI generation are still being overstated, in that working with real clients and the cycles of commercial work aren’t served by just being able to get a polished-looking result,” he says. “When you’re going through approvals and making changes, prompts aren’t going to save you.”

Are we ready, structurally, for the changes that replacing creative humans might bring? Wood thinks the technology could potentially benefit people, but right now, “our society doesn’t have the infrastructure to support the reduction in paid work that will result from it,” she says. “In another world, where people were guaranteed comfort and free time, and the ability to make rent, it’s a damn interesting technology.”

A potential boon for the public domain

In the United States, purely AI-generated art cannot currently be copyrighted and exists in the public domain. It’s not cut and dried, though, because the US Copyright Office has supported the idea of allowing copyright protection for AI-generated artwork that has been appreciably altered by humans or incorporated into a larger work.

For some companies that may utilize AI art in the future, lack of copyright protection may not be a problem because AI-generated artwork may serve a specific commercial advertising purpose that would be worthless to duplicate, or it may incorporate content that is protected by trademark law. Currently, DALL-E 3 attempts to block commercial intellectual property from inclusion in generations, but open source image synthesis models like Stable Diffusion can work around those issues.

For everyone else, there’s suddenly a huge new pool of public domain media to work with, and it’s often “open source”—as in, many people share the prompts and recipes used to create the artworks so that others can replicate and build on them. That spirit of sharing has been behind the popularity of the Midjourney community on Discord, for example, where people typically freely see each other’s prompts.

An AI-generated illustration created by DALL-E 3 that showcases the concept of "Open Source Creative Media." — An AI-generated illustration created by DALL-E 3 that showcases the concept of “Open Source Creative Media.” Credit: DALL-E 3 / Benj Edwards

When several mesmerizing AI-generated spiral images went viral in September, the AI art community on Reddit quickly built off of the trend since the originator detailed his workflow publicly. People created their own variations and simplified the tools used in creating the optical illusions. It was a good example of what the future of an “open source creative media” or “open source generative media” landscape might look like (to play with a few terms).

Someday, this reality may be modified by statute or judicial action, but until then, AI artwork may serve as an antidote to a copyright system that some see as onerous and overly restrictive. On the other hand, as we’ve published, others argue that AI artwork should be subject to copyright protection. The issue has not been fully resolved.

In the future, everyone may be a creative director—or “CEO”

It’s easy to be negative about new things, but perhaps there is another path. Instead of AI making human artists extinct, an artist could use AI-driven capabilities to build new complexities into their works. For example, creative humans may someday command armies of creative AI agents to execute their visions, much like Andy Warhol relied upon underlings at The Factory to execute his famous artworks. Instead of being replaced, artists might become more potent.

If progress in AI continues along the path we’re seeing, it’s possible that every creative person at home with an AI-running machine will be able to command labor resources like the CEO of a major creative company today if they know how to wield it. In this hypothetical scenario, each human could spin up a large artificial workforce beneath them, doing their bidding. That raises the stakes of human potential to an almost unfathomable new level, and it’s difficult to predict where that kind of capability will take us as a civilization. But it’s the next layer of complexity building upon the last, as we’ve always done throughout history.

For thousands of years, we’ve told ourselves that we as humans are unique and special among animals because we are creative—we are toolmakers. We have language and grammar. We can reason. We’ve seen in the past year that our place as the center of the intelligent universe is no longer assured, seemingly being chipped away month by month due to new machine learning research. It’s been a Copernican moment, akin to the demotion of the Earth from the center of the universe. It doesn’t sit well with everyone. “It almost feels like [AI] developers are comfortable crushing the joy of human creation for some reason I genuinely can’t fathom,” says Wood.

An AI-generated illustration depicting AI enhancing human potential, created by DALL-E 3. Credit: DALL-E 3 / Benj Edwards

Heady predictions aside, some artists like Lawson still see limitations ahead. “I’ve been doing commercial design for decades, and my job is ultimately about problem-solving,” says Lawson. “AI can’t do that. And I don’t see a future where it can right now. Not without some absolute generational leap. Digital tools have traditionally both empowered people and made things easier and faster. I have no doubt AI will continue that trend. Some jobs will take fewer people, or people will need to adapt to new realities. But I don’t see AI creating an apocalypse in creative fields. The human touch hasn’t been replaced by anything we’ve seen so far.”

That takes us back to DALL-E 3. While its output isn’t perfect, its ability to quickly combine cultural references across the scope of human history already feels superhuman. Right now, DALL-E 3 is capable enough for personal creative entertainment (if your idea of fun is generating images of fake 1980s consumer products) and illustrating posts on social media, and it can likely replace basic human-created illustrations for simple tasks (as seen in this VentureBeat article).

Perhaps the most eye-opening realization about DALL-E 3 is that it’s not hard to see a future where the current imperfections are ironed out and you get AI agents that can easily generate visual imagery in any style that is completely indistinguishable from what a human could create. This will have profound side effects, such as accelerating disinformation, enabling abuse, making us question our shared cultural reality, and potentially threatening the historical record, as I have discussed elsewhere. As images are generated in unlimited quantity and fidelity, there will someday likely be far more indistinguishable fake photos of historical events than real ones.

And DALL-E 3 isn’t the only game in town. Stable Diffusion, Adobe Firefly, and Midjourney are all continuously improving in quality and prompt fidelity over time.

Whatever comes our way, we’ll likely have to take the good with the bad and deal with each appropriately. In 1999, French cultural theorist Paul Virilio wrote, “When you invent the ship, you also invent the shipwreck; when you invent the plane you also invent the plane crash; and when you invent electricity, you invent electrocution… Every technology carries its own negativity, which is invented at the same time as technical progress.”

Listing image: DALL-E 3 / Benj Edwards

Benj Edwards Senior AI Reporter

Benj Edwards was a reporter at Ars Technica covering artificial intelligence and technology history.

429 Comments