Virtual Lecture by Uri Hasson: Thinking ahead: deep language models predict human neural responses during the processing of natural speech
As scientists we are trained to search for simple, interpretable models that explain the underlying structure and forces that generate observed data. From this perspective, interpretable, concise, rule-based models will be always preferable over models with millions of parameters, such as relatively opaque neural network models. In this talk, I will argue that our desire for interpretability and simplicity has set us on a misleading path as we study the neural bases of our cognitive faculties. Unlike scientists, the brain does not primarily seek to understand the underlying structure of the word. Instead, the brain primarily seeks to generate meaningful actions within the context of a complex ecological niche. The transition from understanding to action turns the problem on its head. Models with millions of parameters may be simpler and more parsimonious for learning to generate meaningful actions than an ideal model with a handful of well-engineered parameters.
I will use language as an example to contrast the two perspectives. Traditionally, the investigation of the neural basis of language relied on classical language models which use explicit symbolic representations of lexical units, described as parts of parts of speech (like nouns, verbs, adjectives, suffixes and prefixes), which are combined with rule-based logical operations embedded in hierarchical tree structures to generate new sentences. Recently, advances in deep learning have led to the development of a new family of deep language models (DLMs), which were remarkably successful in many real-world natural language processing (NLP) tasks. From a linguist’s perspective, the applied success of deep language models (DLMs) is striking because they rely on a very different architecture than the classical models: 1) DLMs do not parse words into parts of speech but rather encode words (or sub-words) as vectors (sequences of real numbers termed embeddings) and rely on a series of simple arithmetic operations (as opposed to complex syntactic rules) to generate the desired linguistic output; 2) embeddings are learned from real-world textual examples, by predicting how language is used in the wild, with minimal prior knowledge about the structure of language; 3) word embeddings are sensitive to the structural (grammatical) and semantic relationships between words in the text; 4) learning is guided by simple objectives, such as optimizing prediction of the next word in a string of sentences.
To test whether biological neural networks and artificial neural networks encode language in a similar way, we developed a behavioral paradigm to assess the human ability to predict each upcoming word in the context of real-life speech, as subjects listen to a 30-min podcast. Our behavioral results revealed that humans’ ability to predict each word in the story closely matched DLM predictions of the same words in the story. Next, we recorded the neural responses of 8 epileptic patients as they listened to the same podcast using electrocorticography (ECoG). Drawing on the behavioral study, we obtained evidence for spontaneous neural prediction of upcoming words in a natural spoken story before they are articulated. These results were obtained as participants listened to the story without interruption and without any explicit instructions to engage in next-word prediction. Finally, we demonstrated that contextual word embeddings better fit neural responses during linguistic processing than static word embeddings. Taken together, these experiments provide compelling new evidence for the deep connections between artificial language models and the human brain, and support a new modeling framework for studying the neural basis of human cognitive faculties.