Researchers from Truffle Security published the results of the analysis href=”https://commoncraftl.org/”> Common CRAWL used in teaching large language models (for example, deepseek ). The study uses the December archive of Common Crawl, including 400 data on data with the contents of 2.67 billion Web pages.
Scan showed that there are 2.76 million Web pages in the set, which built passwords and access keys to the API. In total, 11908 unique keys and passwords were recorded in the archive, built into the HTML damp or JavaScript code of Web-pages and successfully tested (only current accounting data were taken into account, which made it possible to successfully connect to the services related to them). 63% of keys and passwords were re -used on several pages. For example, the key to the Walkscore API was present at 57 thousand pages associated with 1871 subdomains.
Of the most interesting finds, the use of the AWS S3 storage of the AWS S3 storage and the presence of 17 Webhooks to Slack channels on one of the Web-pages on one of the Web-pages at the Slack channels is noted. The most commonly built-in accounting data turned out to be an API-key to Mailchimp-about 1,500 such keys were identified, which were indicated directly in the HTML forms or the JavaScript code, instead of using the environment variables on the server side. Some developing companies used the same API-key APIs on sites of different customers.


It is assumed that the use of unsafe code in teaching AI models can negatively affect the quality of the model’s work and lead to the generation of unsafe output. The prerequisite for studying the keys built into Web pages was that most popular large language models in response to a query of code for integration with Slack and Stripe issued unsafe examples that use keys directly into the Web page. Researchers became interested in this issue and tried to study how often such a vulnerable code is found in the data involved in training.