OpenAI transcribes millions of hours of YouTube videos to train GPT-4
In an effort to secure high-quality data to train massive artificial intelligence models, major AI companies such as OpenAI, Google and Meta are now promoting the use of 'shady' data collection tactics. '. A recent New York Times report said that OpenAI intentionally transcribed more than a million hours of YouTube videos into data to train its most advanced large language model (LLM): GPT-4.
Accordingly, OpenAI has developed the Whisper audio transcription model, helping the company collect data from YouTube videos. The NY Times reported that OpenAI was fully aware this method could come under scrutiny, but continued to do it anyway because they believed the use was completely legal. Interestingly, Google, the company that owns YouTube, is also accused of being involved in similar activity with its AI models, i.e. directly violating the copyright of video creators.
Agreeing with the NY Times, The Information's report emphasized that OpenAI scraped data from YouTube videos and podcasts to train its two AI systems, and hinted that OpenAI president Greg Brockman also knew and agreed with this approach.
In a recent interview with Bloomberg, YouTube CEO Neil Mohan said that the company's policy "doesn't allow downloading of content like transcripts or bits of video, and that's a clear violation our terms of service". However, when asked whether YouTube data was being 'abused' by OpenAI, the CEO only gave a relatively vague answer: "I have seen reports that YouTube data may have been misused." used or not. I have no information myself."
The NY Times report further claims that some Google employees were aware of OpenAI's YouTube data copying activities, but they were unable to do anything because the Mountain View company itself used Similar method to train your own AI models. However, Google told The NY Times that it only collects video data after the video creator has given consent.
Google even reportedly "adjusted its privacy policy" in June 2023, "to allow data mining of publicly available Google Docs, Google Maps reviews, and many types of documents online to train the company's AI products".
You should read it
- Everything you need to know about OpenAI
- Difference between Google PaLM 2 and OpenAI GPT-4
- OpenAI announces ChatGPT app for Android
- OpenAI artificial intelligence defeated 5 professional Dota 2 players
- GPT-5 is coming, significantly better than GPT-4
- OpenAI develops voice reconstruction technology from just 15 seconds of recording
- OpenAI artificial intelligence defeated the current world champion Dota 2
- Behind OpenAI's voice imitation tool
- OpenAI launches ChatGPT app for iPhone users
- Based on GPT-3, they have developed a tool that can turn your commands into lines of code
- Elon Musk sued Sam Altman and OpenAI
- Jukebox, AI can compose a complete song
Maybe you are interested
How to get data from web into Excel
What information does a VPN hide? How does it protect your data?
How to transfer data between 2 Google Drive accounts
6 Data Collecting Apps You Need to Delete for Better Privacy
How to master numerical data in Google Sheets with the AVERAGE function
How to delete white space in a table in Word - Appears right below the data