AI FACES DATA DEFICIT BY 2026

A new study reveals that artificial intelligence (AI) systems are projected to deplete all free knowledge resources available on the Internet by the year 2026, prompting concerns about the future advancement of technologies. AI models like GPT-4 and Claude 3 OPUS utilize trillions of words sourced from the Internet for training purposes. Predictions suggest that publicly accessible data stocks may run out between 2026 and 2032.

In order to enhance these models further, tech companies will need to explore new data outlets. This could involve generating synthetic data, utilizing lower quality sources, or accessing private data stored on servers containing messages and emails. Research, published on the ARXIV preprint server validates this emerging trend.

A lack of new data could impede progress in the field of AI, causing models to advance slowly and rely on novel algorithmic advancements and naturally occurring data. For instance, the ChatGPT training process involved approximately 570 GB of text data, incorporating 300 billion words extracted from books, articles, Wikipedia, and other outlets.

Inaccurate or low-quality data can yield flawed outcomes. For instance, Google’s Gemini AI mistakenly suggested adding glue to pizza, based on data sourced from Reddit and the satirical website The Onion.

To estimate the volume of text available on the Internet, researchers used the Google index to determine that there are roughly 250 billion web pages, each containing an average of 7000 bytes of text. Projections indicate that high-quality data will be depleted by 2032, while low-quality data may persist until 2050. Image resources are expected to be exhausted by 2060.

Despite the data shortage potentially hindering AI progress, companies have various strategies at their disposal to address this challenge. One approach involves leveraging private data, similar to META’s plans as of June 26, which involve using interactions with chatbots to train AI generative models.

Another possible tactic is the utilization of synthetic data, although its successful application has primarily been limited to educational systems focused on games, coding, and mathematics. However, unauthorized collection of personal or intellectual property data could lead to legal disputes.

Aside from data insufficiency, there are other obstacles in the path of AI development. For instance, a Google search powered by ChatGPT consumes nearly ten times the electricity consumed by a traditional search, prompting tech companies to explore nuclear fusion startup ventures to address the data processing center’s energy needs, despite this energy

/Reports, release notes, official announcements.