OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

OpusCleaner and OpusTrainer are free tools created by a team of researchers, including Barry Haddow, Aveni’s Head of Natural Language Processing,  to help make training Machine Translation and Large language models easier. These tools are designed to simplify the process of building high-quality machine translation systems, especially for those new to the field. They tackle challenges like organising data, cleaning it up, and getting it ready for training.


OpusCleaner is like a handy toolkit for downloading, cleaning, and preparing data for translation. It helps researchers quickly get bilingual or monolingual data from different sources, clean it up, and make it ready for training. 


On the other hand, OpusTrainer organises and enhances data for building strong translation systems and language models. It mixes data from various sources, makes adjustments on the go, and more. 


It uses a simple setup file to choose data sources and mix them for different training stages. By adjusting the data mix dynamically, OpusTrainer avoids the limitations of fixed training data and ensures all languages get fair representation. Data mixing involves balancing clean and noisy data to avoid confusion for the model. 


The paper also talks about the importance of having a good plan for training, mixing different types of data, and enhancing the data to build top-notch translation systems. Data enhancement deals with issues like typos, capitalization, emojis, and special characters in the training data.


