Automation and Accessibility: Imagining What’s Possible for Alt Text

POV: It’s 2023. Technology is developing at what feels like the speed of light, self-driving cars are basically a reality, and generative AI is taking the internet by storm. Incredible, right? Yes…and no. As mind blowing as some of this progress is, our “do now, ask questions later” approach to tech development comes with a cost—a cost that can be counted in the millions (of people) and billions (of dollars).

We’re talking, of course, about accessibility, and web accessibility is no exception. Despite 1 in 6 people in the world having a disability (source: World Health Organization), and despite people with disabilities and their loved ones having over $13 trillion in annual disposable income (source: Return on Disability Group), over 96% of the top 1 million web pages are not accessible (source: WebAIM).

Clearly this is a problem of massive scale, one that requires solutions that can likewise tackle it in scale, i.e. technology. For image alt text, the adoption of generative AIs like ChatGPT and Google Bard has seemed to be the answer we desperately need. (Note: Generative AIs like ChatGPT, which seek to intelligently create new things, are newer and the next generation of AI development, whereas traditional AIs like Siri and Amazon Alexa, which seek to intelligently perform tasks within a given set of parameters, have existed much longer.) Simply ask ChatGPT to write alt text for your image—or all your images—then copy and paste, and you’re done!

One problem: Many companies (Microsoft, Google, Meta, etc.) have already implemented generative AIs for auto-generating alt text, and it’s still bad. A lot, if not most, of the existing alt text on the web fails to satisfy WCAG requirements for providing an equivalent text alternative—aka the entire point of WCAG SC 1.1.1.

Alt Text: Person wearing a colorful smock uses a wide brush to paint a horizontal blue rectangle onto a large white canvas leaning against a white wall. A text box displays auto-generated alt text, which says, “A person painting a wall.”

The word “equivalent” is important here. Alt text exists for the purpose of providing an alternative method—text—of accessing the information contained within an image. If it doesn’t provide that information, then it’s not equivalent, and by definition, it’s not even alt text. It’s just…text. Noise. Useless.

In our 2023 E-Commerce Content Accessibility Report, we analyzed the image accessibility of the top 100 e-commerce sites and found:

For home pages, 27 had some descriptive alt text, only 1 had meaningful and descriptive alt text for all images.
For products, 97% had inaccurate, incomplete, or missing alt text, only 1 had high-quality alt text for every image.
For social media, 12% of accounts had descriptive alt text on Twitter, and 3% had descriptive alt text on Instagram.
Ultimately, not a single company possessed a home page with fully accessible content.

So why haven’t we made more progress on web accessibility? The reasons given by companies and teams can be grouped into five categories.

Awareness. Many aren’t aware of this issue, despite it being 2023.
Expertise. They don’t know how to implement accessibility, even though there are many experts for hire or guides and tutorials available online for free.
Technology. Some platforms don’t make accessibility easy, but this is quickly changing.
Budget. Money isn’t being allocated for accessibility like it is for copy, SEO, social media, etc.
Time. It takes a lot of time to accomplish…but so do lots of things that matter.

Whatever the reason may be, many top organizations with high budgets are using some form of alt text automation with minimal human review. But taking shortcuts doesn’t change the rules. Auto- and AI-generated alt text still needs to provide an equivalent text alternative for the image.

Providing an equivalent text alternative is the point of alt text, but most shortcuts don’t actually help you do that. This is problematic for accessibility. So let’s first look at how alt text is done wrong, then look at how it’s done right, and ultimately analyze the quality of results generated by different AIs on offer today.

Alt Text Nonsense

Here are the top five no-nos we see with alt text today. We’ll use the image here as our example.

Alt Text: Portrait of a person wearing a pastel pink, blue, and yellow wig and bold red and orange eye makeup as they prop their head up with one hand against their cheek and look at us with a wistful or wearied expression.

Keyword Stuffing. This is done primarily for SEO purposes. Example: <img alt=“person, rainbow wig, bold, makeup.”>‍
Overly Simplistic Descriptions. These tend to blend together and not be very distinguishable from one image to the next. Example: <img alt=“person with colorful hair.”>‍
Image File Name. Does not describe the image at all. Example: <img alt=“5tzy3850pluo908mcd43.jpg”>‍
Formulaic Descriptions. Something along the lines of “Product + title + color.” Example: <img alt=“Long Wavy Rainbow Wig, Heat Resistant.”>‍
AI-Generated. Typically this is based on object recognition, so it’s often inaccurate or incomplete. Example: <img alt=“May be an image of joy, sorrow, headwear.”>

If any of these descriptions were read aloud to you, would you be able to form a mental picture that’s anywhere close to the actual image? Of course not. Yet these types of descriptions are only becoming more and more common, (no) thanks to automation.

Here's Scribely's alt text for this image for comparison: "Portrait of a person wearing a pastel pink, blue, and yellow wig and bold red and orange eye makeup as they prop their head up with one hand against their cheek and look at us with a wistful or wearied expression."

How to Write Equivalent Text Alternatives

Because alt text is a short, meaningful description that captures the “why” of an image, exceptional alt text describes the essence of what makes an image interesting and distinctive.

We recommend keeping these descriptions to fewer than 250 characters so that it gets straight to the point without adding in unnecessary elements—just like our brains do when we see images on the web.

So what’s the secret to writing great alt text? It’s this: Don’t write. Instead, pause and consider the following first:

Context, or everything that surrounds the image. Text, images, the page, the audience.
Purpose, or the role the image plays for the audience. Why is the image here in the first place?
Meaning, or the message the image sends to the audience. E.g. Look at this, buy the thing, like me.

(And if you’re ever stuck in dreaded writer’s block, this will help you get through it. You can’t skip these steps!)

Once you’ve done that, you’re ready to write. Use the following to help you structure your alt text:

Identify. Classify the image to help users visualize the type right away (do this for everything except photos; if there’s no ID, it’s assumed that it’s a photo).
Focus. Describe the main subject of the image (i.e., person, place, object, effect).
Details. Add meaningful and relevant visual details that make the image unique.

Finally, always read the description to yourself. Does the alt text sufficiently describe the image?

Automated Alt Text “Shortcuts”

Automating the alt text process, or generating alt text according to a preset formula or algorithm, can be tempting (just set it and forget it!). But as we’ve discovered, the technology currently doesn’t deliver a final product that meets standards, and most of the options are geared toward boosting SEO with keywords, not accessibility. In other words, they hijack accessibility, which is meant to improve user experience (UX), as a means for garnering more attention.

Alt Text: A white robot vacuum is illuminated by a sunbeam as it cleans a floor of natural wood tones laid in a herringbone pattern. A small desk and chair are visible in the background against a white wall.

But here’s a question that, depending on how we answer it, changes how we approach accessibility: Is “better than nothing” actually good enough?

If it is, then having something—anything—in the <img alt> space is better than having a blank. Let’s look at an example using the image to the right to test this.

CMS platforms (e.g. Wordpress, Hubspot, Shopify) often offer alt text tools to make it easier to generate descriptions for images as they’re uploaded. They often follow a generic formula, e.g. “image file name + site description + page title.” Users can often integrate with SEO plugins to then incorporate a primary keyword or phrase.

This yields something like:

Formula: Product Name + Number in Sequence
Image 1: “Robot vacuum – 1”
Image 2: “Robot vacuum – 2”
Image 3: “Robot vacuum – 3”

What’s your reaction? Would you be able to understand the images and their purpose by reading these descriptions? Better yet, would you feel ready to make a purchase?

At Scribely, we think that better than nothing is not good enough, for a few reasons.

First, it still doesn’t satisfy WCAG. Quick fix tools make it even easier to bulk produce alt text that doesn’t actually improve accessibility. That’s like ordering printer paper for your office and having a tree delivered instead. You received the raw material for paper, which has the potential of becoming paper (only with a lot of processing and work), so would you say, “Eh, it’s better than nothing”? No, because it’s not what you need.

Second, the problem just grows. Having bad alt text only means you still need good alt text. So instead of making a dent in the ever-growing problem, we’re wasting the resources we devote to bad alt text, while we continue to add inaccessible images by the millions every day. And that means, third, that we’re delaying the actual work. So yes, progress matters…as long as it’s actual progress.

The Current State of AI and Alt Text

As of this writing, there are three main types of AI alt text solutions. We’ll start with what we think is the least promising and work our way to the option that seems most promising.

Image Recognition + Confidence Scores

Alt Text: Screenshot displaying 7 objects, moods, and styles with their perceived likeliness, along with the roll, tilt, and pan of the subject and an overall confidence score of 62%. They are: Joy, very unlikely. Sorrow, unlikely. Anger, very unlikely, Surprise, very unlikely. Exposed, very unlikely. Blurred, very unlikely. Headwear, very unlikely.

This type of AI identifies objects, places, people, text, and actions in digital images, then gives one a confidence score.

This approach is useful for object identification and keywording, but it’s difficult to visualize just a list of words, and those confidence scores are often quite wrong.

The top providers of this type are:

Google Cloud Vision API. Face and landmark detection, optical character recognition (OCR), object localization, explicit content.
Amazon Rekognition. Objects, scenes, activities, landmarks, faces, colors, image quality, text, celebrities, inappropriate content.
Clarifai. Identifies concepts in images including objects, themes, moods, trained with over 10K concepts and 20M images.

Ask yourself:

Does this word list provide a text equivalent?
Do you trust “very unlikely” and “62% confidence”?

Computer Vision + Natural Language Processing

This type provides AI-generated human-like image descriptions by utilizing:

Computer Vision, or AI that enables computers to derive information from images, videos, and other inputs.
Natural Language Processing (NLP), or ML technology that helps computers interpret, manipulate, and comprehend human language.
Image Captioning, or the intersection of computer vision and NLP. The image is encoded into features, then decoded into descriptive text.

The top providers of this type are:

Midjourney. Users enter the command “/describe,” then upload an image to receive four descriptions based on the image. Users can remix the image based on the description.

Microsoft Azure. “Caption” generates one-sentence descriptions, and “Dense Caption” generates one-sentence descriptions of up to ten regions of the image. This method keeps alt text short and succinct, but descriptions are often incomplete and are unable to describe images in context.

Alt Text: Overhead view of 4 glasses with different colored liquids on a table top with sunlight shining through. A text box shows the dense caption for the image: "A group of glasses with different color liquid. A close up of a glass. A glass with pink liquid in it. A glass with blue liquid. A glass of liquid on a saucer. A glass with orange liquid. A close up of a glass. A close up of a glass."

Alt Text: A tiny metal toy of a yellow school bus with 2 pieces of luggage strapped on top. Part of a life-size bus is blurred in the background. A text box displays an auto-generated caption, which says, "A toy bus with luggage on top."

‍

Multimodal + Large Language Models (aka Generative AI)

Multimodal is AI that combines multiple types of data to create more accurate determinations and draw conclusions. Large Language Models (LLM) is a deel learning model trained on massive datasets to recognize, translate, or generate text in a human-like fashion.

The top providers of this type are:

ChatGPT. A conversational AI that uses prompt engineering to optimize and fine-tune LLM for particular tasks and desired outputs.
Bing Chat. Similar to ChatGPT. Allows you to choose a conversation style (more creative, more balanced, or more precise).
Google Gemini. Newly released, Gemini is an assistant designed from the ground up to be multimodal with text, images, videos, and other data types. It’s powerful because Google has a wealth of data to train better models and produce diverse results.

We tested both ChatGPT via BingChat with our previous image of the person with the colorful wig, and a prompt to “write a description for this image.” On BingChat, we had to add a follow up prompt to add more details. The results are fascinating.

Bing Chat: A person with a blurred face and a cheerful expression has their hair dyed in various bright colors, such as pink, blue, and yellow. The hair falls over their shoulders in loose curls. They are wearing a blue tank top that matches one of the colors in their hair. The background is white and blurry.

ChatGPT: This is a portrait of a person who loves to express themselves through their hair. Their face is not visible, but their hair speaks volumes about their personality. They have a rainbow of colors in their hair, from pink to blue to yellow, creating a vibrant contrast with their blue tank top. Their hair is also styled in loose curls, adding some texture and movement to their look. This person is not afraid to stand out and show off their creative side.

Let’s look at the alt text ChatGPT generated for a screenshot and a chart, since those tend to contain a lot of information and can be difficult to describe well.

Alt Text: Screenshot of a search results page on the Shopify App Store for the term "alt text." Among the 15 resulting apps, the term "SEO" appears 25 times.

First is a screenshot of the results of searching “alt text” on the Shopify App Store. Here’s ChatGPT’s description:

A screenshot of a search result page showing 9 apps for optimizing image alt text for SEO. The apps have different icons, titles, and descriptions. The page has a white background, a search bar, a sign in button, and a footer with app store links.

Positives

Correct ID of screenshot

Negatives

Incorrect on number of apps in results (15, not 9)
Didn’t mention the search term used (relevant detail)
Missed the highlighted word “SEO” (relevant detail)

Next is a 3D pie chart showing responses to the question, “How does thinking about what might happen in the future due to the development of artificial intelligence make you feel?”

Alt Text: 3D pie chart showing distribution of answers to the question, "How does thinking about what might happen in the future due to the development of artificial intelligence make you feel?" The answers are: 26.6% confused, 10.3% indifferent, 28.2% optimistic, 11.2% enthusiastic, and 23.7% concerned.

Here’s the description from ChatGPT:

The image is a bar chart that shows the percentage of people who have different feelings about the development of artificial intelligence (AI). The chart has four categories: excited, optimistic, worried, and fearful. The data is based on a survey of 2,000 adults in the UK conducted by YouGov in 2018. The chart shows that:

14% of the respondents are excited about the development of AI, meaning they are looking forward to the benefits and possibilities that AI can bring to society.
36% of the respondents are optimistic, meaning they are hopeful that AI will have a positive impact on the world and improve human lives.
34% of the respondents are worried, meaning they are concerned about the potential risks and challenges that AI may pose to society, such as ethical issues, job losses, or social inequalities.
16% of the respondents are fearful, meaning they are afraid that AI will harm or threaten humanity, such as by becoming hostile, taking over, or destroying the world.

The chart also shows that more people have positive feelings (excited or optimistic) than negative feelings (worried or fearful) about the development of AI, with a total of 50% versus 50%. However, the difference is not very large, indicating that there is still a lot of uncertainty and mixed emotions among the public regarding AI.

The data points are all wrong and, therefore, the conclusion that "more people have positive feelings than negative feelings about the development of AI is incorrect.

Clearly, this approach yields results that can be anywhere from basic to wildly interpretive. It also doesn’t disclose its confidence, which is pretty reckless. Ultimately, the biggest critique is that it may take more time and effort to hone and refine the prompting process than it would to just learn how to write alt text.

Takeaways

When it comes to generative AI and alt text, we still have a long way to go. Alt text must be equivalent (auto-formulaic doesn’t cut it and needs to stop!), which massively changes what we should expect from AI. Yes, we desperately need better solutions to the growing accessibility problem, but those solutions must actually deliver. With that in mind, we need to be honest about AI’s current strengths and weaknesses, realizing that it’s not yet ready to do what we want it to do. Let’s keep collecting and sharing data on generative AI in order to get it there. But in the meantime, don’t ditch the humans.