In the latest example of a troubling industry trend, NVIDIA appears to have removed a trove of copyrighted content for AI training. 404 Media’s Samantha Cole reported on Monday. The $2.4 trillion market cap company asks employees to download videos from YouTube, Netflix and other data sets to develop commercial AI projects. The graphics-card maker is one of several tech companies that appear to be embracing a “move fast and break things” mentality as they race to establish an advantage in this frenetic and all too often shameful AI gold rush.
The training was reportedly for developing models for products such as the company’s Omniverse 3D world generator, self-driving car systems and “digital humans” efforts.
NVIDIA defended its practices in an email to Engadget. A company spokesperson said its research “fully complies with the letter and spirit of copyright law,” but argued that intellectual property law protects certain expressions, “but not facts, ideas, data, or information.” The company equated the practice to a person’s right to “learn facts, ideas, data, or information from another source and use them in original expression.” Humans versus computers… what’s the difference?
YouTube doesn’t seem to agree. Spokesman Jack Maron said: Bloomberg article Starting in April, CEO Neil Mohan reportedly said that using YouTube to train AI models was a “clear violation of our terms.” YouTube’s policy communications manager wrote to Engadget that “our previous comments still stand.”
Mohan’s comments in April were in response to reports that OpenAI had trained its Sora text-to-video generator on YouTube videos without permission. Last month, it was reported that startup Runway AI had done the same.
NVIDIA employees who raised ethical and legal concerns about the practice were reportedly told by their superiors that the company’s highest-ranking executives had already given the go-ahead. “This is a senior executive decision,” said Mingyu Liu, NVIDIA’s vice president of research. “We have comprehensive approval for all of the data.” Other company employees reportedly described the scraping as an “open legal issue” that the company plans to address in the future.
All of this is similar to Facebook’s (Meta) old motto: “move fast and break things,” which broke a lot of things and was incredibly successful at it, including the privacy of millions of people.
Nvidia reportedly instructed employees to train it on videos from YouTube and Netflix, as well as the movie trailer database MovieNet, an internal library of video game footage, and the Github video datasets WebVid (now removed due to a cease and desist order) and InternVid-10M, the latter of which is a dataset containing 10 million YouTube video IDs.
Some of the data NVIDIA allegedly used for training was marked as eligible for academic (or non-commercial) use only. The HD-VG-130M, a library of 130 million YouTube videos, includes a license for use solely for academic research. NVIDIA has reportedly dismissed concerns about the academic research-only condition, insisting that its batch could be used in commercial AI products.
To avoid detection from YouTube, NVIDIA reportedly used virtual machines (VMs) with rotating IP addresses to download content and circumvent the ban. In response to a suggestion from an employee to use a third-party IP address rotation tool, another NVIDIA employee reportedly wrote: “We [Amazon Web Services]When you restart [virtual machine](#) The instance will get a new public IP[.]So, for now, it’s no problem.
404 MediaA full report on NVIDIA’s efforts is available here. Worth a read.