Apple has recently unveiled a technical document detailing the models developed to power Apple Ligence, a series of AI generative functions that will soon be integrated into iOS, MacOS, and iPados. The company has denied any unethical behavior in training these models, stating that they did not utilize any private user data. Instead, Apple claims that they relied on publicly available and licensed data, including information gathered through the Applebot web crawler.
An earlier report from Proof News had suggested that Apple used a dataset called The Pile, composed of subtitles from hundreds of thousands of YouTube videos, to train its models. This raised concerns as some video creators were unaware of their content being used. Apple clarified that these models would not be used for generative functions in its products.
The technical document on Apple Foundation Models (AFM) reveals that the training data was obtained ethically. This data set included publicly available web data and licensed data from reputable publishers. According to the New York Times, Apple has secured long-term contracts with publishers like NBC, Condé Nast, and IAC to use their news archives in the models.
However, the training of models on source code without permission has sparked controversy among developers. Some open-source code repositories prohibit AI training under their usage terms. Apple claims to have filtered licenses to only include repositories with minimal restrictions such as MIT, ISC, or Apache licenses.
In an effort to enhance the mathematical abilities of AFM, Apple incorporated math questions and answers sourced from various online platforms. The company also used high-quality public datasets that were carefully filtered to remove any sensitive information.
The training data for AFM models amounts to around 6.3 trillion tokens. By comparison, Meta used 15 trillion tokens to train its LLAMA 3.1 405B model, which was recently released. Apple considered feedback from individuals during the model training process and utilized synthetic data to fine-tune the models and eliminate any potentially harmful behavior.