Foundations & Architecture 12 min read

Indic NLP: What Breaks When You Move Beyond English — BPE, Code-Switching, Transliteration, Low-Resource

BPE tokenization efficiency collapse for Devanagari. Code-switching requires per-token language ID, not sentence-level detection. Romanized Hindi (Hinglish) transliteration ambiguity — why training on raw data beats normalization. Low-resource transfer via script sharing. What Sarvam and Krutrim interviews actually probe.

Indic NLP: What Breaks When You Move Beyond English

AI systems built for English fail in specific, predictable ways on Indic languages. Sarvam, Krutrim, and AI teams at Flipkart, Swiggy, and PhonePe work on these problems daily. Understanding the failure modes — not just that they exist, but why and how to fix them — is what separates a candidate who has read about multilingual NLP from one who has worked on it.

Problem 1: BPE Tokenization Efficiency

BPE was designed for Latin-script languages with large overlapping subword units. For Indic scripts (Devanagari, Tamil, Telugu, Bengali, Malayalam), the token efficiency is dramatically worse.

English: 'playing' → ['play', '##ing'] — 2 tokens. Efficient. Hindi: 'खेल रहा हूँ' (playing) → ['ख', '##े', '##ल', 'र', '##ह', '##ा', 'ह', '##ू', '##ँ'] — 9 tokens for 3 words. A 512-token context window holds ~170 Hindi words vs. ~380 English words. Root cause: Devanagari has vowel matras (diacritics) that attach to consonants. BPE treats them as separate characters. A vocabulary built on English text rarely has common Devanagari n-grams. Fix: Indic-specific tokenizer trained on Indic corpora (IndicBERT approach), character-level fallback for rare scripts, or a morphology-aware tokenizer that understands consonant-matra units.

Problem 2: Code-Switching

Indian users switch between languages mid-sentence constantly. 'Mujhe ek coffee chahiye with extra milk' mixes Hindi and English in a single request. Standard NLP pipelines break: language detection returns one language, but the sentence is both.

Token-level language identification: each token needs a language label, not the sentence. Embedding space: English-trained embeddings have no representation for Hindi tokens. Multilingual models (mBERT, XLM-R) handle this, but code-switching creates cross-lingual attention patterns the model may not have seen in pretraining. Practical fix: fine-tune on code-switched data collected from your specific user population. General multilingual pretraining is necessary but not sufficient for the specific Hindi-English patterns in your domain. Common trap: treating code-switched text as 'bad input' to clean. It's not bad input — it's how users actually communicate. The system must handle it.

Problem 3: Transliteration Ambiguity

Romanized Hindi (Hinglish) — typing Hindi words in Latin script — is ubiquitous on mobile. 'Mujhe pani chahiye' vs. 'Muje paani chahiye' vs. 'Mujhe paanee chahiye' are all the same sentence. There is no standard. Users use whatever spelling feels natural.

Detection: a sentence in Latin script might be English, Romanized Hindi, Romanized Tamil, or a mix. Character n-gram models trained per language can detect at the token level. Normalization: map Romanized forms to canonical Devanagari before processing. Mapping is many-to-one (many Romanized spellings → one Devanagari form). Phone-based similarity + frequency-based priors work better than rule systems. Model-level solution: train on transliterated text as-is, without normalization. If your users produce it, your model should handle it. Normalization is fragile and loses user signal.

Problem 4: Low-Resource Languages Within India

Hindi and Tamil have reasonable pretraining data. Santali, Bodo, Dogri, Konkani, and Manipuri have almost none. Models trained on the major Indian languages don't transfer well to these because script, morphology, and vocabulary are largely disjoint.

Transfer learning approach: multilingual pretraining on all available Indic data, then zero-shot or few-shot to low-resource languages. XLM-R includes some low-resource Indic languages. Data augmentation: back-translation from a higher-resource language. Synthetic data generated by fine-tuned models on the high-resource language. Script normalization: some low-resource Indic languages share scripts with high-resource ones (e.g., Dogri uses Devanagari). Token overlap helps transferability. Honest limitation: for languages with <10M sentences of training data, quality will be materially lower than for Hindi or Tamil regardless of architecture.

The Sarvam/Krutrim Interview Question

'We're building a voice assistant for tier-2 and tier-3 cities. 60% of users speak regional languages mixed with Hindi and English. What are the three hardest NLP problems, and how would you approach them?' Strong answer: (1) code-switching at the token level — multilingual tokenizer + language-ID per token, (2) transliteration normalization — phone-based mapping + frequency priors, (3) ASR accuracy for accented speech — fine-tune on region-specific audio data. Then say which of the three you'd tackle first and why (usually: ASR quality, because transcription errors compound downstream).

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →