Is Google Scraping Public Data for Training Bard?
New Jersey, 26th July’23: On Monday, Gizmodo spotted that Google recently updated its privacy policy to disclose that its various AI services, such as Bard and Cloud AI, may be trained on public data that the company has scraped from the web.
In an interview with The Verge, a Google Spokesperson said, “Our privacy policy has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate. This latest update simply clarifies that newer services like Bard are also included. We incorporate privacy principles and safeguards into the development of our AI technologies, in line with our AI Principles.”
Google’s privacy policy was updated on July 1st, 2023, indicating that “Google uses information to improve services and develop new products, features, and technologies that benefit users and the public.” and that the company may “use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.”
The updated privacy policy of Google specifies that the company uses “publicly available information” to train its AI products, but it does not provide details on how they prevent copyrighted materials from being included in the data pool. Many publicly accessible websites have policies prohibiting data collection or web scraping for training AI models.
This approach raises questions about compliance with global regulations like GDPR, which safeguard against data misuse without explicit permission. The legal implications of using such data for AI training, including social media posts and copyrighted works, remain uncertain, resulting in lawsuits and calls for stricter regulations. The processing of this data to prevent adverse effects on AI systems is also a concern, as workers sorting through vast amounts of training data often face long hours and challenging working conditions.
Gannett, the largest newspaper publisher in the US, has filed a legal suit against Alphabet and Google, alleging monopolistic practices in the digital advertising market related to their AI technology. Additionally, Google’s AI search beta and similar products have faced criticism, with claims of being “plagiarism engines” and causing reduced traffic to websites.
In a similar vein, Twitter and Reddit have implemented significant changes to prevent unrestricted data harvesting, which has sparked a backlash from their user communities due to negative impacts on core user experiences.