From Data to Dialogue: The Global Reach of HPLT v2

At Aveni, we’re passionate about driving innovation in AI, and we’re proud to have supported the development of a landmark academic contribution: HPLT v2, a multilingual dataset built to elevate the performance of language technologies across the globe.

But what does that mean in practical terms, and why should it matter to financial services?

The Multilingual Data Challenge

Training modern large language models (LLMs) demands more than just power, it needs data. And not just any data, but clean, diverse, and representative text that covers the world’s languages, not just English.

HPLT v2 tackles this head-on. It’s a follow-up to the original HPLT dataset (which members of the Aveni Labs team also helped shape), now dramatically expanded and refined. We’re talking:

  • 8 trillion tokens across 193 languages
  • 380 million sentence pairs in 50 languages for machine translation
  • Data extracted from over 4.5 petabytes of web archives
  • A transparent and reproducible pipeline, open-sourced for the community

This is arguably one of the most ambitious open-data efforts in language tech to date.

A petabyte (PB) is a unit of digital storage that equals:

    • 1,000 terabytes (TB)
    • 1,000,000 gigabytes (GB)
    • 1,000,000,000 megabytes (MB)

To put that into perspective, if you streamed HD Netflix continuously, it would take you over 13 years to get through 1 PB. Multiply that by 4.5 for HPLT v2, and you’re looking at almost 60 years of non-stop streaming!

What Makes HPLT v2 Different?

Unlike many datasets, which lean heavily on English or a handful of European languages, HPLT v2 emphasises global inclusivity. It pulls from both Common Crawl and the Internet Archive, ensuring broader cultural and linguistic representation.

The paper outlines a meticulous pipeline that cleans, deduplicates and evaluates data quality using a mix of human inspection, statistical analysis, and automatic genre classification. It’s not just about quantity, it’s about quality too.

Importantly for the financial world, the dataset avoids spam, boilerplate and low-value content. That means models trained on HPLT v2 are better equipped to understand professional, domain-specific language, including financial documents.

Real-World Impact for Financial Services

So, why is this exciting for financial services?

Firstly, regulatory reporting, client communications, compliance checks and customer service increasingly rely on LLMs. But these models often perform inconsistently across languages and dialects. HPLT v2 helps fix that by boosting multilingual performance across the board.

Secondly, financial services operate globally. Whether it’s a wealth manager serving high-net-worth clients in Asia or a bank operating across the EU, multilingual support isn’t optional, it’s essential. The ability to deploy high-performing models in Finnish, Arabic or Vietnamese (just three of the 193 languages included) offers a competitive edge.

And finally, trust and transparency matter. HPLT v2 is open. It’s reproducible. That aligns with Aveni’s approach to building ethical, transparent AI, especially in regulated sectors.

What the Results Show

Models trained on HPLT v2 outperform earlier versions across key benchmarks, from part-of-speech tagging to named entity recognition and machine translation. In real terms, this means smarter, more accurate, and more context-aware language models that can be trusted to operate in sensitive environments like finance.

In particular:

  • BLEU scores for translation models improved by over 4 points compared to previous datasets
  • Named Entity Recognition tasks showed significant uplift, especially in mid- and low-resource languages
  • Cross-lingual performance improved, making it easier to deploy a single model across markets

All of this means fewer hallucinations, more accurate outputs, and better client outcomes.

Built by a Global Team, Including Aveni

We’re thrilled that many Aveni team members were contributors to this research. It reflects our deep roots in academia and our commitment to pushing boundaries in AI. The paper was co-authored by a global coalition of researchers from the University of Edinburgh, Helsinki, Oslo, and others, an inspiring reminder of what’s possible when academia and industry work together.

What’s Next?

As the paper notes, there’s still work to be done, especially in underserved languages and further refining machine-generated content filters. But HPLT v2 sets a new standard for public, high-quality, multilingual training data.

At Aveni, we’ll be using insights from HPLT v2 to continue developing cutting-edge language technologies tailored for financial services. We’re not just watching this space – we’re building it!

Karsyn Meurisse

Share with your community!

In this article

Related Articles

Aveni AI Logo