One first peek into the AI black box

24 May 2024

Anthropic, a major OpenAI competitor, has made an important step towards understanding the inner workings of large language models. This paves the way toward better controllability of such models, removing significant criticism against this technology. Beyond this, adoption could accelerate even further.

Bottom line

Researchers from Anthropic successfully identified some of the inner structures powering the neural network behind the "Claude" large language model (LLM). This development paves the way for a better understanding of how LLMs work, but most importantly for the ability to control their capabilities with a much higher accuracy than today, hence their safety. Doing so would allow removing a major roadblock regarding the penetration rate of state-of-the-art AI applications, as those can currently be quite unpredictable. This would logically be a major trigger for the entire theme, adding another important tailwind to the already strong momentum.

What happened

On May 21st, a team of researchers working at Anthropic AI, one of the most prominent competitors to OpenAI, published a paper showing that they had been able to identify so-called interpretable features from one of their large language model (LLM), dubbed Claude 3 Sonnet. This is the first time that portions of code corresponding to human-defined concepts (e.g., a car) have been identified within the neural network of an LLM. This marks a significant accomplishment, as such systems are typically black boxes which even their developers do not clearly understand.

Impact on our Investment Case

Why all the fuss?

To better understand the significance of this discovery, one needs to apprehend what LLMs are and how they are designed. LLMs are behind every modern AI chatbot, such as ChatGPT, and their conceptual structure is what powers many other types of AI applications. Such models are already very capable, but are ultimately black boxes.

Indeed, they are based on Artificial Neural Networks (see our previous piece for more details), i.e., computer programs that emulate the structure of biological brains and neurons. The most advanced LLMs rely on over 100bn such neurons, organized in dozens of layers of materializing structures enabling any given capability. However, the developers behind such systems have no idea how such structures are designed the way they are: although the high-level organization of the network is consciously designed, its inner structures are self-organized during the training process, which relies on Machine Learning (ML), and any "neuron" can be involved in several different structures. In simpler words, the developers set an objective for the program, which can then use all the means necessary (within the framework decided by the developers) to self-organize in order to reach it.

Ultimately, when LLMs produce content, nobody can explain with certainty how they came up with it. Until now.

What does this paper bring to the table exactly?

Keeping the biological brain analogy, this discovery is equivalent to identifying the precise zone of the brain responsible for identifying a defined object (e.g., a person, a car, etc.). In their paper, researchers gave the example of the structure responsible for the concept of the Golden Gate Bridge: the concurrent activation of this set of neurons meant that the AI model was "thinking" (for lack of a better word, as such systems are not conscious) about the monument. Looking at similar but close groups (in semantic terms) allowed to identify concepts such as Alcatraz Island or Hitchcock's movie Vertigo, set in San Francisco. This method also worked for more abstract concepts: "inner conflict" was close to concepts such as "catch-22", "losing religious faith" or "romantic struggles", allowing the mapping of large subsets of the model.

Even more interesting, researchers found that tuning up or down the structure (i.e., increasing its relative importance within the network) clearly impacted the output. For example, maximizing the importance of the Golden Gate Bridge concept made the model become obsessed with the topic, bringing every discussion to it and even answering that its physical form was that of the monument when asked the question! Conversely, tuning down the structure would make the topic disappear, something of clear interest when the concept is related to safety, be it abstract - e.g., deception - or extremely practical - e.g., biological weapon development.

What's next?

This research is extremely promising as it will enable the auditing of LLMs, hence reinforcing their safety and compensating for embedded biases. Such biases are derived from the training datasets, making it hard to neutralize them due to the gigantic amount of data required to train an AI model. These biases can be actively exploited by malicious actors and totally derail a model's capabilities, as demonstrated by Microsoft Corp's Tay chatbot becoming racist and misogynistic in less than a day when it was launched in 2016, or Bing Chat becoming aggressive and manipulative last year. Suppressing these risks would mark a huge progress for AI systems, and clearly allow to de-risk and accelerate their rollout: regulators would indeed be able to audit and curtail the technology to their liking, and end-users would have the insurance of a product behaving as it should.

This would of course materialize a major driver for the theme, which is already entering at full speed the era of applications. However, this paper is just a first step, as the approach has its limitations and will probably need to be adapted to work for other LLMs. But considering the frantic pace of innovation exhibited in the sector, we believe this first crack in the LLMs' black box could rapidly become a gaping hole paving the way for further progress.

Our Takeaway

This news is a perfect example of innovation at work in the AI sector. Developers are well aware of the limitations of their current models, whether it be just from the pressure of regulators or the defiance of the public, and are actively working to correct them. The stakes are huge, as even mitigating biases in AI models to an acceptable model and allowing an audit of their code would open the floodgates for AI applications to target virtually every segment of the market, which is not yet possible nor desirable for safety reasons (e.g., healthcare diagnostics or interactive robots). This potential safety leap reinforces our conviction that the theme, after focusing on infrastructure, is finally entering its application phase. We have already increased the corresponding exposure in our allocation, and will closely follow further developments for potential adjustments.

Companies mentioned in this article

Anthropic AI (Not listed); Microsoft Corp (MSFT); OpenAI (Not listed)

Back to all articles

Explore:

Disclaimer

This report has been produced by the organizational unit responsible for investment research (Research unit) of atonra Partners and sent to you by the company sales representatives.

As an internationally active company, atonra Partners SA may be subject to a number of provisions in drawing up and distributing its investment research documents. These regulations include the Directives on the Independence of Financial Research issued by the Swiss Bankers Association. Although atonra Partners SA believes that the information provided in this document is based on reliable sources, it cannot assume responsibility for the quality, correctness, timeliness or completeness of the information contained in this report.

The information contained in these publications is exclusively intended for a client base consisting of professionals or qualified investors. It is sent to you by way of information and cannot be divulged to a third party without the prior consent of atonra Partners. While all reasonable effort has been made to ensure that the information contained is not untrue or misleading at the time of publication, no representation is made as to its accuracy or completeness and it should not be relied upon as such.

Past performance is not indicative or a guarantee of future results. Investment losses may occur, and investors could lose some or all of their investment. Any indices cited herein are provided only as examples of general market performance and no index is directly comparable to the past or future performance of the Certificate.

It should not be assumed that the Certificate will invest in any specific securities that comprise any index, nor should it be understood to mean that there is a correlation between the Certificate’s returns and any index returns.

Any material provided to you is intended only for discussion purposes and is not intended as an offer or solicitation with respect to the purchase or sale of any security and should not be relied upon by you in evaluating the merits of investing inany securities.

Contact