Academic Papers

OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

OpusCleaner and OpusTrainer are free tools created by a team of researchers, including Barry Haddow, Aveni’s Head of Natural Language Processing,  to help make training Machine Translation and Large language models easier. These tools are designed to simplify the process of building high-quality machine translation systems, especially for those new to the field. They tackle challenges like organising data, cleaning it up, and getting it ready for training.


OpusCleaner is like a handy toolkit for downloading, cleaning, and preparing data for translation. It helps researchers quickly get bilingual or monolingual data from different sources, clean it up, and make it ready for training. 


On the other hand, OpusTrainer organises and enhances data for building strong translation systems and language models. It mixes data from various sources, makes adjustments on the go, and more. 


It uses a simple setup file to choose data sources and mix them for different training stages. By adjusting the data mix dynamically, OpusTrainer avoids the limitations of fixed training data and ensures all languages get fair representation. Data mixing involves balancing clean and noisy data to avoid confusion for the model. 


The paper also talks about the importance of having a good plan for training, mixing different types of data, and enhancing the data to build top-notch translation systems. Data enhancement deals with issues like typos, capitalization, emojis, and special characters in the training data.


Download the research paper

Other resources

Assessing the Reliability of Large Language Model Knowledge

AI: Why an executive understanding is so important

Retrieval-augmented Multilingual Knowledge Editing

Agent Performance and Coaching

Hallucinations in Large Multilingual Translation Models

Consumer Duty: Your Machine Line of Defence

The rise of Regtech

Aveni Detect

Vulnerable Customers

Consumer Duty and fintech innovation

Aveni’s platform uses the latest in NLP to transform productivity and risk oversight.

Scale compliance at a fraction of the cost

Cut financial advice admin from hours to minutes with Aveni’s AI assisitant

Aveni Assist

Get up and running with Aveni Assist and how it can help transform productivity and compliance. 

Aveni Detect

Get up and running with Aveni Detect and how it can help transform productivity and compliance. 

Read the latest articles from Aveni

Access our latest whitepapers, webinars, brochures and more

Jargon-bust your way to a better understanding of all things AI