AI tools are running out of training data

Artificial intelligence (AI) needs training data, but that data is limited. So, how else can AI be trained so that it continues to develop and be useful to us?

You might think that the Internet and its data are an endless resource, but the truth is that AI tools are running out of data to exploit. However, that won't stop AI development - there's plenty of data still available to train AI systems.

1. There is always more data being added online

In short, AI research institute Epoch says the high-quality data on which AI is being trained could run out by 2026.

The key here is "can". The amount of data added to the Internet increases every year, so something drastic could change before 2026. Still, it's a reasonable estimate — no matter what, AI systems will run out of good data at some point.

However, keep in mind that approximately 147 zettabytes of data are added online each year. One zettabyte is equivalent to 1,000,000,000,000,000,000,000 bits of data. Realistically speaking, that's over 30 billion 4K movies. That's an incredible amount of information for AI to sift through.

However, AI consumes data faster than humanity can produce…

2. AI can forget low-quality data

AI tools are running out of training data Picture 1AI tools are running out of training data Picture 1

Of course, not all of that 147 zettabytes of data is good data. It is estimated that AI will also use up low-quality language data by 2050.

Reuters reports that Photobucket, once one of the world's largest photo repositories, is in talks to license its vast library to AI training companies. Image data has trained systems like DALL-E and Midjourney, but even that could run out by 2060. There's a bigger problem here: Photobucket contains images from network platforms society in the 2000s like Myspace, meaning they weren't of the same high standards as photography today. This leads to low quality data.

Photobucket is not an isolated case. In February 2024, Google reached an agreement with Reddit, allowing the search giant to use the social media platform's user data in its AI training. Other social media platforms are also providing user data for AI training purposes; some are using it to train internal AI models, such as Meta's Llama.

However, while some information can be gleaned from low-quality data, Microsoft is said to be developing a way for AI to selectively "discard" data. Essentially, this will be used for many IP problems, but it also means that tools can forget what they learned from low-quality data sets.

We can feed AI more data without being too selective; Those AI systems can then choose what is most beneficial to learn.

3. Voice recognition opens up video and podcast data

Data fed to AI tools to date has primarily consisted of text and, to a lesser extent, images. That's bound to change, as voice recognition software will mean that the countless videos and podcasts available today can also be used to train AI.

Notably, OpenAI developed an open-source automatic speech recognition (ASR) artificial neural network, Whisper, using 680,000 hours of multilingual and multitasking data. OpenAI then fed more than a million hours of information from YouTube videos into its large language model, GPT-4.

This is ideal for other AI systems that use voice recognition to transcribe video and audio from multiple sources and run that data through their AI models.

According to Statista, more than 500 hours of video are uploaded to YouTube every minute, a number that has remained fairly stable since 2019. That's not to mention other video and audio platforms like Dailymotion and Podbean. If AI can turn its attention to new data sets like this, there will be a huge amount of information left to mine.

4. AI is largely stuck in the English language

OpenAI trained the model using 117,000 hours of non-English audio data. This is especially interesting because many AI systems have been trained primarily in English or view other cultures through a Western lens.

By their nature, most tools are limited by the culture of their creators.

Take ChatGPT for example. Shortly after its release in 2022, Jill Walker Rettberg, professor of Digital Culture at the University of Bergen, Norway, tried ChatGPT and concluded:

'ChatGPT doesn't know much about Norwegian culture. Or rather, whatever it knows about Norwegian culture was probably learned mainly from English sources… ChatGPT is clearly aligned with US values ​​and laws. In many cases these values ​​are close to the Norwegian and European values, but perhaps this is not always the case'.

AIs can then evolve as more multinational people interact with them – or more diverse languages ​​and cultures are used to train such systems.

Currently, many AIs are limited to a single library; they could flourish if given the keys to libraries around the world.

5. Publishers can help develop AI

AI tools are running out of training data Picture 2AI tools are running out of training data Picture 2

IP is clearly a big issue, but some publishers can help develop AI by making licensing agreements. This means providing the tools with high-quality, i.e. reliable, data from books instead of low-quality information gathered from online sources.

In fact, Meta, owner of Facebook, Instagram and WhatsApp, is said to have considered buying Simon & Schuster, one of the "Big Five" publishers. The idea is to use material published by the company to train Meta's own AI. The deal ultimately fell through, perhaps due to the ethics of the company handling the IP without the writers' prior consent.

Another option that has clearly been considered is purchasing individual licensing rights for new titles. This will cause great concern for creators, but it will still be an interesting way for AI tools to develop if the data can be exhausted.

6. Aggregated data is the future

All other solutions are still limited, but there is one option that can help AI thrive in the future: Synthetic data. And it is a very real possibility.

So what is aggregated data? In this sense, it is data generated by AI; Just as humans generate data, this approach will see artificial intelligence generate data for training purposes.

In fact, AI can create a convincing deepfake video. That deepfake video can be fed back to the AI ​​so it learns from what is essentially an imaginary scenario. After all, it's a primary way humans learn: We read or watch something to understand the world around us.

AI may have used aggregated information. Deepfakes have spread false information online, so when AI systems scan the Internet, fake content is inevitable. It can corrupt or limit AI, reinforcing and propagating mistakes made by those tools.

AI is controversial. In addition to its many disadvantages, it still has benefits. For example, auditing and consulting network PwC shows that AI could contribute up to $15.7 trillion to the world economy by 2030.

Furthermore, AI is already being used worldwide. You're probably already using it in one form or another today, perhaps without even realizing it. Now, it's important to train it on quality, reliable data so we can use it properly.

4 ★ | 1 Vote