It’s difficult to ignore how rapidly the field of machine translation has changed. For many years, the conventional wisdom was straightforward: training a model solely on French and English would yield the best French-to-English translator. The whole idea was specialization.
Then, in late 2021, Meta entered the WMT competition with a single model capable of handling fourteen language pairs simultaneously and won ten of them. After years of refinement, the bilingual experts suddenly appeared a little out of date.
| Information | Details |
|---|---|
| Project Name | WMT FB AI Multilingual (wmt21-dense-24-wide-en-x) |
| Developed By | Meta AI Research |
| Announcement Date | November 10, 2021 |
| Model Size | 4.7 billion parameters |
| Competition Won | WMT 2021 (Workshop on Machine Translation) |
| Languages Covered | 14 language pairs tested; 10 won outright |
| Successor Project | No Language Left Behind (NLLB) — supports 200 languages |
| Hardware Required | NVIDIA A100 GPU (80GB VRAM) |
| Key Techniques | Large-scale data mining, back translation, Fully Sharded Data-Parallel |
| Comparison Baseline | Marian NMT (7.6M parameters) |
On November 10, 2021, the announcement was made in a blog post that sounded more like a quiet mic drop than a victory lap. Models designed for a single task were outperformed by a multilingual model, which was trained to handle dozens of languages at once. According to the majority of researchers in the field, that wasn’t supposed to occur. It was widely believed that specialists were superior to generalists. It turns out that the conventional wisdom had never encountered a megatransformer with 4.7 billion parameters.
But it’s a different kind of headache to work with this model. Marian, the widely used open-source baseline that most researchers aim for, has roughly 7.6 million parameters and is small enough to fit on a free Colab notebook. The WMT21 model from Meta is about 618 times bigger. An NVIDIA A100 with 80GB of VRAM, which isn’t exactly found in anyone’s home office, is needed to load it. Additionally, the standard fine-tuning scripts that came with the model didn’t function right away and required some advanced tweaking before they would cooperate. Observing enthusiasts struggle with this gives the impression that the democratization of large language models has limitations that no one wants to acknowledge.

The preliminary tests are encouraging, though. Prior to fine-tuning, the Marian baseline’s score on automotive-domain datasets was a modest 7.377 BLEU; after training, it nearly doubled. In contrast, the Meta model performed better right away, indicating that the scale advantage actually applies to specialized fields rather than just the news articles it was trained on.
The intriguing part is how Meta arrived at that conclusion. Instead of depending only on manually selected parallel corpora, the team developed two complementary pipelines, any-to-English and English-to-any, and mined translations from extensive web crawls. In order to create artificial training pairs, they mainly relied on back translation, which involves passing monolingual data through the model in reverse. Additionally, they scaled quickly, increasing the model’s capacity from 15 billion parameters in previous experiments to 52 billion. Without Fully Sharded Data-Parallel, Meta’s GPU memory trick that supposedly speeds up large-scale training by about five times, none of that would have been possible.
Things become interesting when you look at the bigger picture. No Language Left Behind, Meta’s 200-language project that includes Asturian, Luganda, Urdu, and other low-resource languages that commercial translation systems have long disregarded, was the goal of WMT21, not its conclusion. Real-time translation on Facebook, Instagram, and the metaverse—anywhere people wish to connect—is an openly civilizational goal. It remains to be seen if that vision materializes on time or at all.
Similar concerns regarding scale were once raised by Tesla. OpenAI did the same. These innovations have a tendency to appear impossible at first and inevitable in hindsight. Meta’s WMT victory falls somewhere in the middle; it’s proof of concept, but the true test is still to come. The era of bilingualism is not yet over. However, the leaderboards seem to be telling us more than any one benchmark that it’s no longer the only game in town.
London Bilingualism's content on health, medicine, and weight loss is solely meant for general educational and informational purposes. This website does not offer any diagnosis, treatment recommendations, or medical advice.
We consistently compile and disseminate the most recent information, findings, and advancements from the medical, health, and weight loss sectors. When content contains opinions, commentary, or viewpoints from professionals, industry leaders, or other people, it is published exactly as it is and reflects those people's opinions rather than London Bilingualism's editorial stance.
We strongly advise all readers to consult a qualified medical professional before acting on any medical, health, dietary, or pharmaceutical information found on this website. Since every person's health situation is different, only a qualified healthcare provider who is familiar with your medical history can offer you advice that is suitable for you.
In a similar vein, any legal, regulatory, or compliance-related information found on this platform is provided solely for informational purposes and should not be used without first obtaining independent legal counsel from a licensed attorney.
You understand and agree that London Bilingualism, its editors, contributors, and affiliated parties are not responsible for any decisions made using the information on this website.
