The Forest, Not the Trees: Why Coding Agents Might Be Localisation’s Missing Verification Layer

18 May

Localisation has a verification problem.

This isn’t just a narrow translation-quality issue, or only an AI translation issue. It’s a verification problem. The industry does well at checking strings individually, but struggles to make sure the whole translation still works as a system.

This is the kind of problem coding agents are especially good at solving.

Over the past year and a half, I’ve developed ways to use coding agents to read large amounts of text and spot issues like consistency drift, pattern breaks, and term instability—problems that only show up when you look at the big picture instead of just small sections. I originally built these methods to analyze human–LLM interaction logs. Recently, I realized they work just as well for localisation.

The penny that dropped

In my research, I gave coding agents large collections of text and asked them to keep the whole context in mind and report on what was happening across the entire document. They helped me spot where terms changed meaning, where tone shifted, and where consistency faded over time. You can’t see these patterns if you only read one section at a time.

Then, one morning on my way to work, I had an insight.

My role as a localisation manager has exactly the same problem.

Many companies now use AI to translate content, and human reviewers check the results. But in most workflows, reviewers look at one string or segment at a time, just as localisation QA has always done. The process focuses on details, not the bigger picture. And the few providers who offer system-level review usually charge extra for it.

But this isn’t just an AI translation problem. It’s also a problem of fragmented translations and, even with our best efforts, a limit to how much a person can keep in mind at once.

Research shows that the human brain can retain only 3-5 chunks of information in active memory at any given time. AI is definitely at an advantage here.

The problem with string-level review

When a human reviewer opens a translation file in a TMS, they see one segment at a time: source on the left, target on the right. They check the obvious questions. Is this translation accurate? Does it sound natural? Does it match the glossary? Does it align with translation memory?

What they cannot do in that window is answer questions like:

Is the same word being used to translate three different navigation concepts across the product?
Are button labels consistently using the imperative form, or do some drift into nouns and gerunds?
Has the translator chosen the right sense of an ambiguous English word, or simply the most common dictionary sense?
Do placeholder strings work grammatically across all possible runtime values, or only some?

These are forest-level questions. They can only be answered by someone, or something, holding the whole translation file in mind at once. A reviewer who sees one segment at a time cannot detect that “Standings” has been translated as “Results” when “Results” is already used for the actual Results screen three tabs away. Both translations are individually defensible. The collision exists only at the system level.

I think of this kind of failure as structurally invisible to string-level review. It’s not that reviewers are careless; the way the job is set up just makes these patterns hard to spot.

What happens when you look at the forest

When you give a coding agent a translation file with thousands of source and target strings and ask it to read the whole thing as one document, some errors become much easier to spot.

Polysemy mistranslations happen often. English has many words with more than one meaning. For example, “tap” can mean a screen action or a water faucet. “Pen” might be a writing tool or short for penalty. “Forward” could be a direction or a football position. When AI translates each string without enough context, it usually picks the most common meaning, not the one that fits the product. Reading the whole file at once helps catch these mistakes, since the coding agent can see the context around every string. These are often the most serious errors—they’re obvious to users, semantically wrong, and hard to justify once found.

Term collisions occur when one word in the target language is used for several different source concepts. Each translation might look fine on its own, but if you look across the whole file, you find cases where one word stands in for three different things in the product. Users see the same label on different tabs, but it means something different each time. String-level tools miss this because they don’t look at the system as a whole.

Form-of-address drift is another common issue. Many languages have set ways of handling interactive UI. In Polish, for example, buttons usually use the perfective imperative, so “Save” becomes Zapisz. But in a large file, some strings might shift into noun forms, gerunds, or adjective-like phrases that are fine on their own but don’t work together. The real problem isn’t one bad segment; it’s the lack of a consistent call-to-action policy, which only shows up when you look at the whole set.

Systemic grammar gaps can cause similar problems. Some languages have distinctions that English doesn’t. For example, Polish uses three different forms for counted nouns: one for 1, another for 2–4, and a third for 5 and above. So a phrase like “{0} hours ago” needs three different word forms, depending on the number. If the translation uses only one, it’s only right some of the time. Looking at the whole file makes it easier to spot all numeric placeholders and check if the grammar is handled correctly.

Placeholder word-order conflicts are another type of problem. If a variable like {0} stands for a name that needs a different grammatical case in the target language, a string might work in English word order but not in the target language. This is partly a string-template design issue and partly a translation issue, but a coding agent can catch it by checking every placeholder pattern in the file and flagging where the structure causes grammar problems.

These issues are obvious to anyone who works with languages and none are groundbreaking discoveries. What we are missing, however, is a way to spot them and fix them at scale.

What this is, and what it isn’t

Let me be clear about what I’m suggesting, since the localisation industry is already using AI-powered QA and I don’t want to exaggerate how new this is.

Tools such as IntlPull, XTM, and Smartcat already use LLMs to evaluate translations for accuracy, glossary adherence, tone, and style. They flag low-confidence segments for human review. That matters, and it is already making string-level QA faster, cheaper, and more consistent.

What I have not seen is a real change in the altitude of review.

Most current tools just make the same process faster. But a coding agent that reads the whole file as one document does something different. It finds errors that existing tools can’t reliably detect, because these mistakes aren’t in individual strings; they’re in the relationships between strings.

You’re asking the model to do something fundamentally different: keep the whole translation in context, read it with understanding, and judge quality at the system level, not just the segment level.

This doesn’t replace human review. A coding agent can’t tell you if a translation feels just right for your brand voice in Brazilian Portuguese. It can’t make cultural judgement calls, and it will miss subtle register mismatches that a native speaker would notice right away.

What it can do is catch bigger-picture problems before a human reviewer even opens the file. It serves as a verification layer between machine translation and human review, answering a question that almost nobody in the current process is asking directly:

does the translation still make sense when read together?

The £20 question

Here is the best part.

You don’t need a new enterprise platform, API integration, procurement process, or a six-month rollout. All you need is a coding agent, which you can get for about £20 a month—and a clear prompt telling it what to look for. Later on, you could turn this into reusable protocols or agent skills that run with little input.

The method I’ve been testing actually came from a different field. I built these protocols to analyze human–AI interaction logs and spot consistency, drift, and pattern stability across huge amounts of text. Adapting them for translation took surprisingly little change.

Text is text, and consistency is consistency. The patterns that break coherence in a conversation are a lot like the ones that break coherence in a translation file.

If you have a coding agent and a translation file you’re concerned about, you can try this tomorrow. Export the file, give it to the agent, and ask it to read the whole thing and point out where things don’t line up.

You might be surprised by what it finds.

What comes next

I’m still developing this approach. So far, I’ve tested it in my own time, using my own tools and publicly available translated content. It’s a proof of concept, not a finished product.

But the gap is real, and the idea works.

The localisation industry has spent years focusing on speed and cost. It hasn’t invested nearly as much in verification, especially the kind that means looking at the whole translated system at once.

The tools for this already exist. They cost less than your monthly coffee habit. And right now, the kinds of errors they can catch are still slipping through the cracks.

If you work in localisation and have been wondering if there’s a better way to catch AI quality issues, I believe there is.

And it might already be open in your browser.

Anna Wojewodzka