AI tools are running out of training data

You might think that the Internet and its data are an endless resource, but the truth is that AI tools are running out of data to exploit. However, that won't stop AI development - there's plenty of data still available to train AI systems.

1. There is always more data being added online

In short, AI research institute Epoch says the high-quality data on which AI is being trained could run out by 2026.

The key here is "can". The amount of data added to the Internet increases every year, so something drastic could change before 2026. Still, it's a reasonable estimate — no matter what, AI systems will run out of good data at some point.

However, keep in mind that approximately 147 zettabytes of data are added online each year. One zettabyte is equivalent to 1,000,000,000,000,000,000,000 bits of data. Realistically speaking, that's over 30 billion 4K movies. That's an incredible amount of information for AI to sift through.

However, AI consumes data faster than humanity can produce…

2. AI can forget low-quality data

AI tools are running out of training data Picture 1

Of course, not all of that 147 zettabytes of data is good data. It is estimated that AI will also use up low-quality language data by 2050.

Reuters reports that Photobucket, once one of the world's largest photo repositories, is in talks to license its vast library to AI training companies. Image data has trained systems like DALL-E and Midjourney, but even that could run out by 2060. There's a bigger problem here: Photobucket contains images from network platforms society in the 2000s like Myspace, meaning they weren't of the same high standards as photography today. This leads to low quality data.

Photobucket is not an isolated case. In February 2024, Google reached an agreement with Reddit, allowing the search giant to use the social media platform's user data in its AI training. Other social media platforms are also providing user data for AI training purposes; some are using it to train internal AI models, such as Meta's Llama.

However, while some information can be gleaned from low-quality data, Microsoft is said to be developing a way for AI to selectively "discard" data. Essentially, this will be used for many IP problems, but it also means that tools can forget what they learned from low-quality data sets.

We can feed AI more data without being too selective; Those AI systems can then choose what is most beneficial to learn.

3. Voice recognition opens up video and podcast data

Data fed to AI tools to date has primarily consisted of text and, to a lesser extent, images. That's bound to change, as voice recognition software will mean that the countless videos and podcasts available today can also be used to train AI.

Notably, OpenAI developed an open-source automatic speech recognition (ASR) artificial neural network, Whisper, using 680,000 hours of multilingual and multitasking data. OpenAI then fed more than a million hours of information from YouTube videos into its large language model, GPT-4.

This is ideal for other AI systems that use voice recognition to transcribe video and audio from multiple sources and run that data through their AI models.

According to Statista, more than 500 hours of video are uploaded to YouTube every minute, a number that has remained fairly stable since 2019. That's not to mention other video and audio platforms like Dailymotion and Podbean. If AI can turn its attention to new data sets like this, there will be a huge amount of information left to mine.

4. AI is largely stuck in the English language

OpenAI trained the model using 117,000 hours of non-English audio data. This is especially interesting because many AI systems have been trained primarily in English or view other cultures through a Western lens.

By their nature, most tools are limited by the culture of their creators.

Take ChatGPT for example. Shortly after its release in 2022, Jill Walker Rettberg, professor of Digital Culture at the University of Bergen, Norway, tried ChatGPT and concluded:

'ChatGPT doesn't know much about Norwegian culture. Or rather, whatever it knows about Norwegian culture was probably learned mainly from English sources… ChatGPT is clearly aligned with US values and laws. In many cases these values are close to the Norwegian and European values, but perhaps this is not always the case'.

AIs can then evolve as more multinational people interact with them – or more diverse languages and cultures are used to train such systems.

Currently, many AIs are limited to a single library; they could flourish if given the keys to libraries around the world.

5. Publishers can help develop AI

AI tools are running out of training data Picture 2

IP is clearly a big issue, but some publishers can help develop AI by making licensing agreements. This means providing the tools with high-quality, i.e. reliable, data from books instead of low-quality information gathered from online sources.

In fact, Meta, owner of Facebook, Instagram and WhatsApp, is said to have considered buying Simon & Schuster, one of the "Big Five" publishers. The idea is to use material published by the company to train Meta's own AI. The deal ultimately fell through, perhaps due to the ethics of the company handling the IP without the writers' prior consent.

Another option that has clearly been considered is purchasing individual licensing rights for new titles. This will cause great concern for creators, but it will still be an interesting way for AI tools to develop if the data can be exhausted.

6. Aggregated data is the future

All other solutions are still limited, but there is one option that can help AI thrive in the future: Synthetic data. And it is a very real possibility.

So what is aggregated data? In this sense, it is data generated by AI; Just as humans generate data, this approach will see artificial intelligence generate data for training purposes.

In fact, AI can create a convincing deepfake video. That deepfake video can be fed back to the AI so it learns from what is essentially an imaginary scenario. After all, it's a primary way humans learn: We read or watch something to understand the world around us.

AI may have used aggregated information. Deepfakes have spread false information online, so when AI systems scan the Internet, fake content is inevitable. It can corrupt or limit AI, reinforcing and propagating mistakes made by those tools.

AI is controversial. In addition to its many disadvantages, it still has benefits. For example, auditing and consulting network PwC shows that AI could contribute up to $15.7 trillion to the world economy by 2030.

Furthermore, AI is already being used worldwide. You're probably already using it in one form or another today, perhaps without even realizing it. Now, it's important to train it on quality, reliable data so we can use it properly.

Artificial intelligence AI tools

Kareem Winters

Update 05 July 2024

You should read it

May be interested

5 great open source tools that keep your personal data safe
whether windows is watching you or your browser affects privacy, there are many reasons for you to be cautious about your personal data. fortunately, there is a solution that helps you manage and store personal data that is open source software and tools. this article will introduce some open source tools that help keep your data safe.
Features in the Rules of Survival Training Manual
the training manual in rules of survival is a new system that is updated in the game, tied during the season and will expire at the end of the season.
5 Linux tools to recover data from damaged drives
when the hard drive fails, it will often be unusable, but don't hurry to throw it away immediately, try the 5 linux tools introduced by tipsmake.com to restore the data. and put your digital life back on track.
Latest Dragon Pow Dragon Trainer Code and how to redeem code
dragon pow code, dragon training code provides players with a series of attractive rewards such as dragon balls, gold, crystal evolution stones, praying star stones...
10 useful software to help you train your brain every day
there are many ways to train your brain like practicing through puzzles and logic games. and the following software will be a tool to help you train your brain in the most effective way.
The best Python tools for Machine Learning and Data Science
python has many great libraries and frameworks that are good for coding and developing computer science. quantrimang invites you to discuss some useful python tools for both machine learning and data science applications.
How to use Photoshop CS5 - Part 15: Remove wrinkles with the Healing Brush tool
in the next article in the photoshop tutorial series, we will explore and use the healing brush - this is one of the best editing tools, especially for objects like lines. skin, corners of eyes ...
7 brain training exercises that help you improve your memory and maintain a clear mind
besides the memory enhancement products, you can also do some simple exercises below to help boost your brain, protect your memory.
Brain training, improve focus with ReaderPro
readerpro application will help us improve our ability to focus, train the brain, improve the speed of reading quickly through tests.
5 requirements to build strong data culture
you may not notice but nearly all businesses in the world are increasingly interested in investing more in data analysis, big data, and especially implementing projects. ai relates to the field of activity.