Research

There are three key research questions regarding intentions in artificial intelligence (AI):

If science is to make progress on highly complex concepts like intentions, it is critical that we begin with rigorous and precise definitions of intentions, especially as those may relate to AI. This is the focus of the first research question. According to one popular view, to have an intention to do an act is to be rationally committed to performing that particular act as either an end or a means (Bratman 1987). Under this view, a preliminary crucial question is to what extent AI systems can have a flexible commitment to a plan of action in light of multiple goals? Beyond that, human intentions developed in light of the specific constraints that we operate under. AIs operate under different constraints from humans (e.g., AIs can remember more and more accurately, but human brains have much more recurrent connections than the transformer architecture), so how similar are AI intentions to human ones?
The neuroscience of intentions has been devoted to studying intentions in the human brain, before they lead to action. More recently, that field has increasingly joined forces with philosophers—for example as part of this large, interdisciplinary project. Here we will combine the most pertinent insights from that field with insights from AI research—for example from mechanistic interpretability, AI safety, autonomous systems, and AI interpretability. This will be a two-pronged approach—top down and bottom up. For the top down approach, we will combine neuroscience and AI analysis. Neuroscience will provide careful experimental design together with modeling tools (e.g., causal modeling, circuit models, and drift-diffusion models). Then, in conjunction with AI tools (e.g., determination of features, circuits, and motifs; sparse auto-encoders; causal tracing), we will measure, characterize, and intervene on intentions in AI. This combination promises insights into the causal structure of these large networks and will point to ways to construct future models that will be intrinsically and developmentally interpretable for humans. At the same time, we will employ a bottom-up approach, training smaller models on simple, virtual environments. Insights from the top-down analysis will make sure that these smaller models contain the features relevant to intentions and agency; while insights from the smaller models will, in turn, inform the analysis of larger models—resulting in a virtuous cycle. One output from this effort will be an AI-powered probe system to track intentions in large AI models in real time, enabling us to intervene before the AI acts, should those intentions be deemed dangerous.
The type and level of manipulation of intentions that can be carried out in AI models is well beyond anything possible in biological systems. In humans and higher-level animals, neuroscientists have access to some reflection of average neuronal activation over many millions of neurons using neuroimaging or to a small fraction of the overall number of neurons using intracortical recordings. In contrast, in AI models there is complete access to all the weights and activations. Moreover, single neurons or groups of neurons can be ablated or connected differently, their activation can be altered, and so on. So, we can measure, characterize, and intervene on intentions in these models, at the neuronal level, at a different level than in biological systems. Hence, our work on intentions in AI models holds every promise to translate into progress on understanding intentions in humans.