Probing the black box: What do language models know, and why does it even matter?

Steve Neale – Senior Machine Learning Engineer

As part of the machine learning team, Steve works on the natural language processing tools that underpin AMPLYFI’s technology. He completed his PhD in Computing from the University of Tasmania (Australia) in 2014, and has been working on natural language processing in academia and in industry ever since.

For much of the past decade, the deep learning ‘revolution’ has been in full swing in the fields of machine learning (ML) and natural language processing (NLP). Increasing computational capacity has allowed researchers to iterate on complicated ideas with reduced time and effort. Typically, deep learning refers to the use of artificial neural networks, which are made up of multiple layers each containing large numbers of connected ‘neurons’ inspired by those found in biological brains. Each of these neurons learns to produce an output based on the inputs coming to them from neurons in the previous layer, until the final layer produces an interpretable output.

As a concept that sounds great, and people have certainly sat up and taken notice of the advances made on many NLP tasks by large language models trained using these kinds of architectures (and containing billions of neurons or ‘parameters’). However, large neural networks are often viewed with caution in real-world settings, where many potential users see them as a kind of impenetrable ‘black box’. They’re often constructed using so many parameters – many layers each containing many neurons – that it’s difficult to reasonably explain how and why a model ends up deciding to map a certain input to a certain output in any given context.

Why explainability matters

Why is that important? Let’s consider a machine learning model deployed in some real-world scenario, where people are making decisions based on the model’s outputs. At some point, perhaps the model predicts something that turns out to be incorrect, or difficult to interpret, and whoever is using it makes the ‘wrong’ decision as a result. Perhaps only a little time is lost, or there’s some mild embarrassment. But maybe a substantial amount of money is lost. In the worst case scenario, someone’s life could be drastically changed or put in danger as a result of the decision.

Mathematician and author Hannah Fry provides real examples of this in her book Hello World: a driver’s GPS recommending a shortcut that ended with their car hanging over the top of a steep cliff; a judge doubling an offender’s sentence for the relatively minor crime of lawn mower theft after an algorithm trained on historical questionnaire answers predicted a 70% chance of them committing a violent crime in the future; an innocent man being violently arrested and kept in a maximum security pod for months after facial recognition software mistakenly identified him as a wanted bank robber.

In cases such as these, people usually want to understand how the ML model arrived at its output, in order to rectify or mitigate similar problems in the future. If that’s not possible and the model’s decision-making can’t be readily explained – as is usually the case with the very large language models used today – decision-makers will lose confidence and will eventually stop using it. If only there was a way to unpick what models do and don’t know…

‘Probing’ language models

That’s where probing tasks come in. Although it may not be feasible to interrogate every individual output a language model makes, probing tasks offer us a more generalisable alternative to figuring out what language models do and don’t know. Hence, they provide some level of interpretability about why they might complete a task (or produce an output) in a particular way. The idea behind probing is to set the language model specific tasks whose completion indicates how good the model is at doing specific things, and give ourselves a clearer picture of how well it understands different kinds of syntax and sentence-level semantics.

For example, studies that have tested language models on multiple probing (sub-) tasks (Conneau et al., 2018; Tenney et al., 2019) have found that different architectures and training procedures can result in very different outcomes for how well models solve syntax or semantics-based sub-tasks. Other studies (Michel et al., 2019; Clark et al., 2019) have explored commonplace architectures such as BERT in-depth, looking at how many of its architectural components can be removed before performance is noticeably impaired and which particular syntactic rules and phenomena these components are actually responsible for encoding – for example, specific ‘attention heads’ within BERT seem to be focusing on particular domains of syntax such as direct objects and noun modifiers, or phenomena such as coreference resolution. It’s important to understand this, as a model’s ability to correctly encode what it ‘reads’ during training will have a tangible impact on the language it’s able to reproduce later, and any inherent biases it might suffer from.

On the subject of bias, more recent studies have proposed tasks designed to assess the levels of stereotyping and bias that typically get encoded into large language models during their training processes (Sheng et al., 2019; Tay et al., 2020; Nadeem et al., 2021). This is a well-known issue given that the models are trained on large real-world datasets, and so the stereotypes and biases that are inevitably found on the internet and in literature can end up being reflected in the trained model. A range of specific tasks and datasets are now available to language models for assessing stereotyping and bias in domains such as gender, profession, race, politics and religion, with the outcomes of these tasks often showing that many current techniques produce language models that exhibit strong stereotypical biases and, in some cases, hurtful and toxic content (Gehman et al., 2020; Ousidhoum et al., 2021).

Perhaps we’re more similar to the models than we think

It’s important to understand that both large language models and human beings have their limitations, and perhaps we need to be clear about our expectations of both. Most researchers and practitioners would probably agree that – despite sensationalist headlines around the power of AI to solve everything instantly – most ML models operate well below the level to which a human could be expected to complete a task. If matching human performance rather than outstripping it is the goalpost, we shouldn’t really expect large language models not to be error-prone.

At the same time, human beings are not immune to hard-to-explain mistakes, errors of judgement, and even inherent biases. Fry also describes in Hello World how a historical study showed judges’ decisions based on an identical case with identical evidence to be dramatically different – there was no consensus on whether to rule as guilty or not guilty, and when ruling as guilty there was a spread over whether to issue a fine, probation, or prison time to the defendant. A more recent study also described by Fry showed that not only can a group of judges not reach a unanimous verdict on any of a set of cases, but that individual judges usually disagree with their own decisions – they often made different judgements on the same case when it was presented again with the name of the defendant changed.

Given that large language models are trained on huge quantities of human-curated text representing our encoded knowledge and experience on a wide range of topics, perhaps we just need to expect and be aware that they will sometimes exhibit traits that we ourselves are prone to.

Where next? Gaining confidence from probing

It would be perfect to be able to explain why an ML model does what it does for every data point on which it does something; but this simply isn’t possible with the current crop of large language models based on complicated neural network architectures. Of course, it’s always possible that a new development in NLP is just around the corner, and that we’ll start seeing state-of-the-art results on tasks driven by much more explainable architectures. But in the meantime, it seems appropriate to probe large models carefully, understand what they can and can’t do and take steps to mitigate or address their shortcomings. That way, we can more safely leverage the state-of-the-art performance that they’re undoubtedly capable of.

Probing tasks help us find the goalposts to aim at, and to identify things that our large language models might not be doing so well. They give us the confidence to make generalised explanations of how and why mistakes can happen, and to work on implementing extra steps or collecting additional data to specifically teach the model to mitigate errors and to conquer any inherent biases it might have. In lieu of the (non-existent) perfect model, at least we can shed some light on what’s going on inside that ‘black box’.