Natural Language Processing (NLP) is a subset of artificial intelligence that enables machines to understand, process and analyse natural language in the way that humans will. The machine analyses data, interprets, measures sentiment and provides the intended inference from it. The data used for Natural Language Processing (and other forms of machine learning) may be labelled. Labelled data is data with predefined tags that provides information that the machine can learn from. This process is called supervised learning. A simple example of labelled data is the bio data of customers with labels indicating that the strings of letters with an ‘@’ symbol is their email address, the two digit numbers is their age, the images are their passport photos, etc. However, with unlabelled data, there aren’t such tags and the machine has to categorise or cluster the data attributes with similar patterns. This process is known as unsupervised learning.
Natural Language Processing has achieved remarkable progress in the past decade on the basis of neural models. Using large amounts of labelled data can help achieve state-of-the-art performance for tasks such as sentiment detection, Named Entity Recognition (NER), Natural Language Inference (NLI) or question-answering. For these tasks, the labels or tags would be the sentiment of a review, or the people, places or organisations mentioned in the text. However, the dependence on labelled data prevents NLP models from being applied to low-resource settings because of the time, money, and expertise that is often required to label large amounts of textual data.
We’re going to take a look at recent advances in NLP, which allow deep learning models to learn from very few examples. This is crucial for speech analytics where labelled examples are often in very short supply.
Pre-train and Fine-tune
In the last few years, a new paradigm – pre-train and fine-tune – has emerged, which allows us to leverage large amounts of unlabelled data for NLP. The premise is that perhaps it’s better to first learn to model the language itself, then once we have a model that understands the language, we can share this knowledge with the many different tasks we care about by fine-tuning it on small amounts of labelled data. Language modelling is a machine learning task where the model needs to learn how to predict a missing word given the context of the rest of the sentence. This is a generic task with abundant naturally occurring data and can be used to pre-train such a generic model.
Arguably, the model that kick-started this trend was the Bidirectional Encoder Representations from Transformers (BERT) model. BERT is a transformer-based machine learning technique for pre-training developed by Google. The model takes input sentences where some words are masked out, and the task is to predict the masked words. The thing that really set BERT apart was the ease of fine-tuning. BERT is cleverly designed so that it’s easy to do this for lots of different tasks. You can download BERT pre-trained on a large English corpus like the BooksCorpus, and then for your task, you fine-tune BERT on labelled data. You can add a task-specific “head” onto BERT to create a new architecture for your task. This approach has led to huge improvements over state-of-the-art, providing a nice off-the-shelf solution to standard problems.
Let’s look at some other ways of using pre-trained language models:
Recently, researchers realised that an alternative paradigm would be to make the final task look more like language modelling. This would mean a fine-tuning step won’t be needed. It would also mean that we’re potentially able to perform new downstream tasks with little or no labelled data. This paradigm is called pre-train and prompt.
The first pre-train and prompt paper, which showed the potential of this approach, was published in 2020 by Google (Raffel et al. 2020). They suggested a unified approach to transfer learning in Natural Language Processing with the goal of setting a new state-of-the-art in the field. To this end, they treated all NLP problems as a “text-to-text” problem. Such a framework allows using the same model, objective, training procedure, and decoding process for different tasks, including summarisation, sentiment analysis, question answering, and machine translation. The researchers call their model a Text-to-Text Transfer Transformer (T5) and train it on the large corpus of web-scraped data to get state-of-the-art results on several NLP tasks. The way to make all NLP tasks text-to-text is by selecting the appropriate prompts. This is so that the pre-trained LM itself can be used to predict the desired output, sometimes even without any additional task-specific training. This allows few-shot (learning from only a few examples of labelled data) and even zero-shot (generalising to a new task with no examples of labelled data) behaviour.
In this example, we see a prompt that takes a prompting function to generate a sentence where the language model needs to predict Z, which in this case, we would expect to be a positive sentiment. This allows us to directly use the language model for a specific task, sentiment detection.
There are many different possible tasks that language models can perform:
Here we can see examples of different prompts for different tasks. T5 was applied to several benchmarks and surpassed previous state-of-the-art results across many different individual Natural Language Processing tasks. T5 caused great interest in prompting and since then various improvements and challenges have been identified.
Finding good prompts is difficult, and recent work has focussed on finding them automatically. Another active research question is how and when to train a model with prompts.
Working with large language models is also challenging. Large language model size has been increasing 10x every year for the last few years. This road leads to diminishing returns, higher costs, more complexity, and new risks. Downsizing efforts are also underway in the Natural Language Processing community, using transfer learning techniques such as knowledge distillation which trains a smaller student model that learns from the original model. This student model can then be used for more efficient inference eg. DistilBERT (Sanh et al. 2019).
Data augmentation is a set of techniques to artificially increase the amount of data by generating new copies from existing data. This includes making small changes to data or using deep learning models to generate new data points. Data augmentation techniques make machine learning models more robust by creating variations that the model may see in the real world. It is widely used in image processing, and augmenting textual data is more difficult due to the complexity and structure of a language. Common methods for data augmentation in Natural Language Processing are:
Token level augmentation
Takes existing data and creates new examples by adding variety at the word level. Common augmentations would be synonym replacement, word insertion, word swap and word deletion.
Sentence level augmentation
Takes existing data and creates new examples by replacing whole sentences. A popular method here is back translation where for example an English sentence is translated into German, and then re-translated back into English. Another method is applying paraphrasing models to original texts. However, of particular interest to us, in the context of prompting, is that we can use a large pretrained language model to generate new examples from prompts of existing instances. GPT3Mix (Yoo et al. 2021) is a prompt-based approach that doesn’t require fine-tuning: a prompt is constructed using a few sample sentences from the task-specific training data as well as the data description. Then a large pretrained language model (GPT3) generates new sentences influenced by the sample sentences.
Here we show an example taken from their paper on automatically generating training data for the sentiment detection task. The authors report a substantial improvement over baselines such as back translation.
At Aveni Labs, we’re experimenting with and leveraging these approaches to produce models that can be trained using very little labelled data. We use prompting to create more labelled data, and use data augmentation to expand our labelled dataset. Our expert understanding of these methods means that we can deliver production ready models with far less data than was previously possible for machine learning solutions. Instead of needing 1,000s of training examples, we can make classifiers that work with only 100s, or in some cases even 10s, of real training examples. With this approach, we’re able to build superior models that need less human supervision, have excellent transcription accuracy, and greater functionality, for example, accurately identifying vulnerable customers.
We work at the forefront of Artificial Intelligence and Natural Language Processing. Our world-class NLP engineers have employed these techniques and approaches to build our product – Aveni Detect – which lets you analyse 100% of customer interaction to power business improvement. Learn more here.