Apple, Nvidia, and many large companies are caught up in AI training scandals using controversial YouTube sources

By Lesley Montoya Update 19 July 2024

Many major tech companies, including Apple, Nvidia, Salesforce and Anthropophic, are embroiled in a new controversy related to AI training data..

Many major tech companies, including Apple, Nvidia, Salesforce and Anthropophic, are embroiled in a new controversy related to AI training data. According to a report published by ProofNews, the dataset used by these companies to train their in-house AI models includes subtitles from YouTube videos.

The dataset is titled "YouTube Subtitles", created by EleutherAI and published in 2020. Inside the dataset are subtitles from 173,536 YouTube videos downloaded from over 48,000 different channels.

The problem was that the dataset appeared to go against YouTube's terms and conditions, which prohibit accessing videos by "automated means." According to ProofNews, YouTube Subtitles is a training dataset of 5.7 GB (489 million words) and includes subtitles from more than 12,000 videos that have been removed from the platform. Notably, in this dataset there are video subtitles of many famous content creators on YouTube, with a large number of subscribers:

Proof News found material from popular YouTube creators, including MrBeast (289 million subscribers, 2 videos), Marques Brownlee (19 million subscribers, 7 videos), Jacksepticeye (nearly 31 million subscribers, 377 videos) and PewDiePie (111 million subscribers, 337 videos). Among them, there are many documents used to train AI that contain inappropriate content, even conspiracy theories.

images 1 of Apple, Nvidia, and many large companies are caught up in AI training scandals using controversial YouTube sources

In fact, this 'YouTube Subtitles' dataset belongs to a group called "The Pile", which includes several other training datasets. Most of Pile's datasets are open to anyone with enough space and computing power to access.

The companies named did not respond to press requests for comment on the findings and allegations about the use of license training data. ProofNews searched through online posts and white papers to find evidence and determine whose creative materials were used to train which specific AI models. However, it is quite possible to create a complete list of companies using this dataset, since AI companies do not typically disclose the data they use to train their models.

Marques Brownlee, one of the creators whose content was used illegally, said he paid to use the captioning feature on YouTube. Therefore, it is a 'blatant violation' for companies to use this type of data without permission or payment.

Note that Apple and other tech companies don't download subtitles themselves, but rather train their AI models using them. However, this action is an example of the unintended consequences of AI. Some creators say they are uncertain about the possibility that AI could be used to mimic their content in the future.

Lesley Montoya

Update 19 July 2024

Related Articles