GenAI Systems Lab Open interactive version →
Foundations & Architecture 12 min read

Indic NLP: What Breaks When You Move Beyond English — BPE, Code-Switching, Transliteration, Low-Resource

BPE tokenization efficiency collapse for Devanagari. Code-switching requires per-token language ID, not sentence-level detection. Romanized Hindi (Hinglish) transliteration ambiguity — why training on raw data beats normalization. Low-resource transfer via script sharing. What Sarvam and Krutrim interviews actually probe.

Indic NLP: What Breaks When You Move Beyond English

AI systems built for English fail in specific, predictable ways on Indic languages. Sarvam, Krutrim, and AI teams at Flipkart, Swiggy, and PhonePe work on these problems daily. Understanding the failure modes — not just that they exist, but why and how to fix them — is what separates a candidate who has read about multilingual NLP from one who has worked on it.

Problem 1: BPE Tokenization Efficiency

BPE was designed for Latin-script languages with large overlapping subword units. For Indic scripts (Devanagari, Tamil, Telugu, Bengali, Malayalam), the token efficiency is dramatically worse.

Problem 2: Code-Switching

Indian users switch between languages mid-sentence constantly. 'Mujhe ek coffee chahiye with extra milk' mixes Hindi and English in a single request. Standard NLP pipelines break: language detection returns one language, but the sentence is both.

Problem 3: Transliteration Ambiguity

Romanized Hindi (Hinglish) — typing Hindi words in Latin script — is ubiquitous on mobile. 'Mujhe pani chahiye' vs. 'Muje paani chahiye' vs. 'Mujhe paanee chahiye' are all the same sentence. There is no standard. Users use whatever spelling feels natural.

Problem 4: Low-Resource Languages Within India

Hindi and Tamil have reasonable pretraining data. Santali, Bodo, Dogri, Konkani, and Manipuri have almost none. Models trained on the major Indian languages don't transfer well to these because script, morphology, and vocabulary are largely disjoint.

The Sarvam/Krutrim Interview Question

'We're building a voice assistant for tier-2 and tier-3 cities. 60% of users speak regional languages mixed with Hindi and English. What are the three hardest NLP problems, and how would you approach them?' Strong answer: (1) code-switching at the token level — multilingual tokenizer + language-ID per token, (2) transliteration normalization — phone-based mapping + frequency priors, (3) ASR accuracy for accented speech — fine-tune on region-specific audio data. Then say which of the three you'd tackle first and why (usually: ASR quality, because transcription errors compound downstream).

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →