In this blog piece, Barry Haddow, head of NLP, introduces us to BERT and how this model allows human expertise and domain-specific intelligence to work together. Find out how we can further use NLP to offer our clients more insightful information and give them a competitive edge within the financial services industry.
If you were to join any academic NLP (Natural Language Processing) conference in the last two years, then you might be wondering if you had wandered into some strange Sesame Street related convention. First, there was ELMO, then there was BERT, then there was a family of BERT offspring: ALBERT, RoBERTa, CamenBERT (from France, obviously), AlBERTo and UMBERTO (Italian), and so on. BERT (and friends) have been used to improve web search (by Google) as well as many tasks in natural language understanding. So what exactly is BERT, and why is it so useful?
To understand BERT, we need to consider how the field of NLP has been revolutionised by neural networks (aka deep learning) since the early 2010s. Deep learning works by converting the task at hand to a complicated mathematical function (the neural network), which depends on many, many parameters. During “training” we find (learn) good values from these parameters, then we can use this set of parameters to apply the function on new data. For instance, suppose we want a system that can decide whether a film review is positive or negative. We construct a neural network which is able to ingest a review and produce a single number (say +1 or -1), then use a training set of reviews (with sentiment marked) to learn the parameters of this neural network. Once we are happy with the parameter set, we can use our neural network to decide the sentiment of any review.
One detail missing from this, though, is how we actually input the text of a film review (or any text, for that matter) into a mathematical function. Doesn’t maths normally work with numbers? Well yes, so that means we need some way for converting words into numbers. That conversion is accomplished by an “embedding” which converts each of the words in the text into a long list of numbers. A good embedding should preserve relationships between words, so not only do we want “France” and “Italy” to have similar embeddings (since they’re both names of European countries) but the relationships Rome–Italy and Paris–France should be somehow expressed by the embeddings.
Good embeddings can be learnt as part of the task learning process, but researchers realised early on that better neural networks could be created by using “pre-trained” embeddings. These were learnt using the huge quantities of text available in Wikipedia and in out-of-copyright books (for instance). The idea is that you come up with some auxiliary task, a neural network to do this task on this large set of data. For instance, you can train a network to predict the next word when given a prefix. The embeddings produced in training this network are general-purpose word embeddings which can be used in any NLP tasks. People trained sets of these word embeddings and then released them for anyone to download and use.
Whilst pre-trained word embeddings turned out to be very useful, they have an obvious flaw. Consider the sentences “I went to the bank to withdraw cash” and “I sat down at the riverbank” – both these sentences contain the word “bank”, but it is used in very different ways. Using a single embedding for both these uses of “bank” doesn’t seem like a very good idea. And in fact, natural language is full of these variations in meaning, generally much more subtle than this example, and a good set of embeddings should distinguish these. This problem led to the idea of a “contextual embedding”.
Using contextual embeddings means that words are never considered out-of-context. We train a neural network on some auxiliary task, for example, predict the next word, or to predict missing words, then we take the whole network and use it to initialise the training of the task we really care about. In other words, once you have your network trained on one of these auxiliary tasks, getting it to work on, say, film review classification is just a matter of “fine-tuning” it. The neural network learns good representations of English words and sentences in the initial training phase and learns to apply them to the task in the fine-tuning phase.
BERT was a contextual embedding model released late in 2018 by researchers at Google. It was based on a transformer model, which is a particular type of neural network that is really effective at representing sentences (when trained in the right way). BERT was trained on two different tasks: predicting missing words, and predicting whether one sentence could plausibly follow another, in a text. BERT models were made available for download, in different sizes to suit your needs, and the released software and examples made it easy to use. This release of BERT spawned a whole industry of creating BERTs for other languages (such as the French CamenBERT) as well as BERTs optimised for efficient deployment (eg distillBERT) and domain-specific BERTs (there’s even a FinBERT for financial reports). It even led to a whole field of study, BERTology.
At Aveni, we are very excited about the possibilities that BERT offers for improving NLP in the finance domain, and for speech analytics. One of the biggest issues we face in this area is the lack of labelled data. If we want to do text classification, entity extraction, sentiment analysis, or any other NLP task, then human-labelled data for our task and our domain is vital but expensive and tedious to collect. Using BERT and friends can drastically reduce the labelling requirements, allowing faster and cheaper development of new NLP-based applications… BERT allows an NLP practitioner to leverage the knowledge in billions of words of text in solving their particular problem.
To learn more about what we offer, visit Aveni Detect