Local Knowledge at a Price: Reading LatamGPT Through Its Benchmarks

10 minute read

Published:

In February 2026, LatamGPT was presented to a wave of regional enthusiasm: the first open large language model built from and for Latin America and the Caribbean, coordinated by Chile's CENIA and more than sixty institutions across the region. The promise was sovereignty and cultural relevance—moving Latin America from a consumer of foreign models to a producer of its own. When the model's open weights were released on June 1, 2026, though, they arrived without a single benchmark number. As someone who has spent years working with language models, that absence nagged at me. So I decided to run the comparison myself, and to ask the question the release skipped: what did all that training actually buy, and what did it cost?

What LatamGPT actually is

The single most important fact for reading any of its results is this: LatamGPT is not trained from scratch. It is Meta's Llama 3.1 70B, adapted through continued pre-training (CPT) on roughly 230 billion words of permissioned, regionally-sourced text in Spanish, Portuguese and Indigenous languages, followed by supervised fine-tuning. Several news pieces described it as "built entirely in the region," which is true of the pipeline—data collection, training and post-training were all done locally—but not of the weights, which start life as Meta's model. That distinction is not a technicality. It is exactly why the fair yardstick for LatamGPT is not GPT-4 or some abstract ideal, but the base model it began from. If continued pre-training is working, the regional model should know things Llama 3.1 does not—without losing what Llama 3.1 already knew.

The hidden cost: continued pre-training taxes what the model already knew

Continued pre-training is deceptively hard. You are nudging a finished model's weights toward a new data distribution, and the model has no obligation to keep its old abilities intact while you do. This is catastrophic forgetting, and it has been a central headache in continual learning for years—I've poked at versions of it myself, from keeping a BERT-style model pre-training while bolting on new objectives (EMNLP 2021) to studying adapter strategies that let models learn in sequence without collapsing (Findings of EMNLP 2024). At pre-training scale, with a 70B model and hundreds of billions of new tokens, that tension does not disappear—it gets more expensive to manage.

The numbers show the tax plainly. On general academic benchmarks in Spanish, LatamGPT lands below its own base model almost across the board. On MMLU-ProX (es)—the cleanest, fully-Spanish reasoning benchmark in the set—it scores 45.8 against Llama 3.1's 61.5, a drop of nearly sixteen points. Grade-school math (MGSM) falls from 88.8 to 78.8; emotional-reasoning EQ-Bench from 75.4 to 69.0. The only general task where it roughly holds is HeadQA, a medical-exam QA set (51.2 vs. 50.4). In other words, the adaptation that was supposed to make the model more useful in the region made it measurably worse at general reasoning than the model it was carved from.

On the academic benchmarks (left), LatamGPT (red) trails its own base, Llama 3.1 70B (blue), and the two newer, smaller models lead both—except on HeadQA, where it edges its base (and Gemma's score looks like an evaluation artifact). On CHOCLO (right), the benchmark it was built for, the picture flips.

The exact numbers behind the left side of that chart:

General benchmark (Spanish)ShotsnMetricLatamGPT 70BLlama 3.1 70BGemma 4 31BQwen3.6 27B
MMLU-ProX (es)511,759EM45.8161.4882.8677.82
MGSM native-CoT (es)8250EM78.8088.8090.8086.80
EQ-Bench (es)0168EQ-Bench68.9575.3578.0677.70
HeadQA (es)02,742Acc51.1750.4020.9752.04

EM = exact match, Acc = accuracy; EQ-Bench uses its own 0–100 scale. Bold marks the best score per row. LatamGPT lands below its Llama 3.1 base on every row except HeadQA—and well behind the two newer, smaller models (Gemma's HeadQA score aside; see the note at the end).

Where it pays off: local knowledge

Now the other side of the ledger. CHOCLO—another CENIA release from the same Latam-GPT effort, openly available on Hugging Face—is a 104,847-question benchmark of Latin American cultural knowledge (food, flora and fauna, geography, traditions, public figures) spanning the region's countries. CHOCLO ships a hybrid evaluation (lexical overlap, embedding similarity, and an LLM-as-judge); I report the LLM-as-judge part as a binary call—a single external judge decides, for each answer, whether it is semantically equivalent to the reference, and the score is simply the share judged equivalent. Because the same judge grades all four models, the numbers are directly comparable. Here, LatamGPT is the best model in the table: 39.5% of its answers are judged equivalent to the gold reference, edging out its base (37.6) and clearly ahead of the newer models (Gemma 4 31B at 31.3, Qwen3.6 27B at 27.1). And the lead is consistent across difficulty levels:

CHOCLO — binary semantic equivalence (%)LatamGPT 70BLlama 3.1 70BGemma 4 31BQwen3.6 27B
Overall39.537.631.327.1
Fácil (easy)59.857.148.445.1
Intermedia (intermediate)33.932.324.320.7
Difícil (hard)24.923.621.315.6

This is the result that justifies the effort. Local cultural knowledge does not fall out of scale or recency—Gemma and Qwen are newer and strong, and they still lose here, because the relevant facts about a Chilean dish or a Peruvian tradition were simply never in their training data. You cannot reason your way to knowledge you were never shown. The regional corpus put that knowledge in, and CHOCLO can see it. The honest summary of the two tables together: continued pre-training bought a real, measurable gain on exactly the thing it was built for—and paid for it with a broad regression everywhere else.

The uncomfortable comparison

There is a harder fact lurking in the same chart. Gemma 4 31B and Qwen3.6 27B are less than half the size of LatamGPT and roughly two years newer, and they dominate the core academic benchmarks—often by twenty points or more. This is the structural problem with adapting a base model: you inherit its ceiling. CPT on a 2024-era Llama cannot out-run pre-training recipes from 2026, no matter how good your regional data is. And the target keeps moving: in 2026 the open-weight frontier ships a major model every few weeks (one tally counted five frontier-class open-weight LLMs in a single month). By the time you finish adapting last year's base, this year's smaller model has lapped you on everything except the local knowledge you added by hand.

So what is the real contribution?

This is where I want to be a friendly critic, because I genuinely admire the ambition and know some of the people behind it. The results invite a strategic question rather than a verdict: what is the durable contribution here—the model, or the data? The weights will be superseded within months; that is just the cadence of the field now. But the regional corpus and a benchmark like CHOCLO are the kind of asset that does not expire. Which leads to three questions worth sitting with:

Does Latin America need to train a model, or to curate data? A regional foundation model is a multi-million-dollar bet that frontier labs match in a day of compute. High-quality, openly licensed cultural data is cheaper, durable, and—crucially—usable by any model that matters. What is the data actually contributing? CHOCLO already shows it: the local knowledge it encodes is the one place LatamGPT wins. That signal is more valuable as a portable resource than as something frozen into one set of weights. And is the real leverage in deployment rather than pre-training? The same curated knowledge can be injected through retrieval-augmented generation, in-context examples, or as a callable "cultural competence" skill for agentic systems—often beating brute-force pre-training on cost and freshness, and improving every model instead of just one.

None of this diminishes what LatamGPT accomplished: a large, multi-country collaboration that assembled permissioned regional data at scale and produced the model that best captures Latin American cultural knowledge today. That is a genuine first, and the data behind it is a real public good. My argument is narrower and, I hope, constructive—that the project's lasting value may live less in the 70B weights than in the corpus and evaluation it created. If models are the engines, data defines the terrain. The most strategically durable thing Latin America can do is make sure its terrain is deeply encoded—so that every model, regional or frontier, this year's or next year's, has no choice but to learn it.

A note on the numbers

All numbers come from two open evaluation frameworks: the Spanish academic benchmarks were run with EleutherAI's lm-evaluation-harness, and CHOCLO's LLM-as-judge scoring with DeepEval. I use MMLU-ProX (es) as the MMLU number because it is fully Spanish. HeadQA is included for completeness, but Gemma 4 31B's score there (≈21, below chance) is almost certainly an evaluation artifact rather than a real capability gap. CHOCLO is scored by one external judge (gpt-5.4-mini) applied identically to all four models, so the four columns are directly comparable.

References