Apple, Nvidia, and many large companies are caught up in AI training scandals using controversial YouTube sources
Many major tech companies, including Apple, Nvidia, Salesforce and Anthropophic, are embroiled in a new controversy related to AI training data.
Many major tech companies, including Apple, Nvidia, Salesforce and Anthropophic, are embroiled in a new controversy related to AI training data. According to a report published by ProofNews, the dataset used by these companies to train their in-house AI models includes subtitles from YouTube videos.
The dataset is titled "YouTube Subtitles", created by EleutherAI and published in 2020. Inside the dataset are subtitles from 173,536 YouTube videos downloaded from over 48,000 different channels.
The problem was that the dataset appeared to go against YouTube's terms and conditions, which prohibit accessing videos by "automated means." According to ProofNews, YouTube Subtitles is a training dataset of 5.7 GB (489 million words) and includes subtitles from more than 12,000 videos that have been removed from the platform. Notably, in this dataset there are video subtitles of many famous content creators on YouTube, with a large number of subscribers:
Proof News found material from popular YouTube creators, including MrBeast (289 million subscribers, 2 videos), Marques Brownlee (19 million subscribers, 7 videos), Jacksepticeye (nearly 31 million subscribers, 377 videos) and PewDiePie (111 million subscribers, 337 videos). Among them, there are many documents used to train AI that contain inappropriate content, even conspiracy theories.
In fact, this 'YouTube Subtitles' dataset belongs to a group called "The Pile", which includes several other training datasets. Most of Pile's datasets are open to anyone with enough space and computing power to access.
The companies named did not respond to press requests for comment on the findings and allegations about the use of license training data. ProofNews searched through online posts and white papers to find evidence and determine whose creative materials were used to train which specific AI models. However, it is quite possible to create a complete list of companies using this dataset, since AI companies do not typically disclose the data they use to train their models.
Marques Brownlee, one of the creators whose content was used illegally, said he paid to use the captioning feature on YouTube. Therefore, it is a 'blatant violation' for companies to use this type of data without permission or payment.
Note that Apple and other tech companies don't download subtitles themselves, but rather train their AI models using them. However, this action is an example of the unintended consequences of AI. Some creators say they are uncertain about the possibility that AI could be used to mimic their content in the future.
You should read it
- Apple added a series of new applications to Apple TV
- Differentiate Apple ID and iCloud
- Visit the 8 most majestic stores in the world of Apple
- Fix Apple ID error disabled
- 12 interesting products for Apple
- Will Apple's slander 'i' disappear?
- Create an Apple ID, register an Apple ID account for less than 3 minutes
- How to change Apple ID password?
- Apple is about to encroach into the creative AI segment with the 'super project' Apple GPT
- How to Set Up Apple TV
- Instructions for installing and using Apple Music on Android
- How to Use Apple Pay on a Mac
Maybe you are interested
Video call feature will soon appear on social network X (Twitter) This is the world's first consumer product to support Wi-Fi 7 Many serious vulnerabilities have been discovered that allow attackers to take full control of the 4G router Learn 'real' online with 3D technology 3D laptop does not have a 'big' configuration in Vietnam Apple developed Safari 3D browser