In a new research report from Stanford Internet Costs (Stanford Internet Observatory, SIO), it has been found that there is a presence of sexual violence against children (Child Sexual Abuse Material, CSAM) in a large-scale public dataset called Laion-5B. This dataset was used to train popular generative neural networks, including Stable Diffusion. After analyzing over 32 million data points, researchers discovered that the Microsoft tool, PhotoDNA, confirmed the existence of 1,008 CSAM images. However, it is important to note that this number could be much higher.
The Laion-5B dataset does not actually contain the images themselves, but rather consists of metadata that includes:
- Hash image;
- Description;
- Data on the language;
- Information about whether the image could be unsafe;
- URL of the images.
Some of the links to the CSAM photographs in Laion-5B led to various websites, including Reddit, X, Blogspot, WordPress, as well as adult websites like XHAMSTER and XVIDEOS.
In order to identify suspicious images in the dataset, the SIO focused on those that were marked as “unsafe.” These images were then checked using Photodna to determine the presence of CSAM. The results were subsequently sent to the Canadian Center for Children’s Protection (C3P) for confirmation. As of now, the process of removing identified source materials is underway. The URL images have been transferred to both the C3P and the National Center For Missing and Exploited Children (NCMEC) in the USA.
Stable Diffusion version 1.5, which was trained on Laion-5B data, is known for its ability to generate explicit images. Although a direct connection between AI technology and the creation of pornographic images involving minors has not been established, these types of technologies have facilitated the commission of crimes such as revenge porn and other illicit activities.
Despite the widespread criticism from the community, Stable Diffusion 1.5 continues to be popular for creating explicit photos. This is especially concerning as Stable Diffusion 2.0 with additional security filters has already been released. It remains unclear whether Stable AI, the company behind Stable Diffusion, was aware of the potential presence of CSAM in its models due to the use of Laion-5B as the training dataset, as the company did not