in recent studies conducted by former Openai employees, and now cooperating with Anthropic, a new The approach to understanding artificial neural networks. These networks, inherently, digital versions of human brains, are able to perform various tasks, from playing chess to translation of languages.
Scientists focused on combinations of neurons, which collectively create distinguishable patterns or features, instead of carefully studying individual neurons. Patterns are more accurate and consistent than their individual neural analogues, which allows you to better understand the behavior of the network.
The main disadvantage of the method is the lack of a clearly defined goal in individual neurons in the system. For example, in a language model, one neuron can respond to various scenarios, varying its activity.
In an article by scientists, a new approach to the analysis of transformer models is presented. The technique involves the use of vocabulary learning to decompose a layer of 512 neurons into more than 4,000 different functions, covering a wide range of topics and concepts, ranging from DNA sequences and legal terminology and ending with web-and-in-laws, and data on nutrition.
Such multifaceted features remain largely hidden in the study of individual neurons. Researchers use two different methods to demonstrate the improved interpretability of these functions compared to neurons.
In the first experiment, researchers evaluated the simplicity of understanding the functionality of each pattern. Characteristics significantly exceed neurons in terms of interpretability.
In the second experiment, a language model was used to create brief descriptions of each feature, and then use another model to predict the degree of activation of each function based on descriptions.
New features allow you to more accurately control the behavior of the network, which is confirmed by the versatility of patterns in various models. The experiments were also carried out to accurately configure the number of features, creating a “handle” to adjust the model research.
Work is an Anthropic desire for mechanistic interpretability, which reflects a long -term desire to promote AI safety. This study creates a bridge between computer sciences and neuronas, revealing new horizons for understanding artificial neural networks.