Roman Urdu Toxicity Filter
Roman Urdu is a code-switched language (Urdu written in Latin script, often mixed with English) with virtually no production-ready moderation tooling. This project fills that gap with a fine-tuned mBERT model that classifies Roman Urdu/Hindi toxicity, a hybrid three-method pipeline to pinpoint exactly which words are toxic, and a configurable three-tier content policy (ALLOW / SANITIZE / BLOCK). A Google Gemini LLM client wraps the filter so that not only are toxic prompts intercepted, but the model's own responses are verified before reaching the user. A Streamlit demo app makes the system interactive.
The Problem
Roman Urdu — Urdu written informally in Latin script and freely mixed with English — is spoken by over 200 million people online but almost entirely ignored by existing content moderation systems. Standard toxicity classifiers trained on English fail on transliterated text, and no production-ready solution existed for this code-switched language. The challenge was not just classification accuracy but also explainability: identifying which specific words drove the toxicity score so they could be surgically redacted rather than blocking the entire message.
Key Engineering Decisions
mBERT over Monolingual Model
Roman Urdu is inherently code-switched — a single sentence can mix transliterated Urdu, Hindi, and English. mBERT's multilingual pre-training handles this naturally without requiring language detection or separate model routing.
Hybrid Three-Method Word Attribution
Attention weights alone are noisy and frequently highlight grammatical function words. The three-method cascade (lexicon → individual scoring → differential scoring) provides defence-in-depth: fast lookup for known slurs, isolation testing for unknowns, and delta analysis to discard false positives.
Differential Scoring for False Positive Elimination
Computing p_toxic(sentence) - p_toxic(sentence_without_word) and requiring a ≥20% delta proved far more reliable than attention weight thresholds. Common words like 'kya' (what) score high attention in toxic sentences but contribute near-zero to the toxicity score.
Three-Tier Policy over Binary Classification
Binary block/allow creates a terrible UX — partially toxic messages with one slur get blocked entirely. A SANITIZE tier that redacts only the toxic words inline preserves message meaning while enforcing the content policy.
LLM Output Verification
Filtering only the input is insufficient — LLMs can still produce toxic outputs when given borderline prompts. Running the model's response back through the toxicity filter before delivery closes this gap.
Configurable Thresholds, Not Hardcoded Rules
Exposing block_threshold and sanitize_threshold as parameters lets the system tune itself per deployment context — a children's platform needs stricter thresholds than a general-purpose assistant.
Key Highlights
- Fine-tuned mBERT (multilingual BERT) for Roman Urdu/Hindi toxicity classification — handles code-switching between Urdu, Roman Urdu, and English in a single pass.
- Hybrid three-method toxic word identification: lexicon match (O(1) lookup against 723-word hate lexicon), individual word scoring, and differential scoring (measuring each word's contribution to sentence toxicity).
- Differential scoring eliminates grammatical false positives — a word must reduce sentence toxicity by ≥20% when removed to be flagged, preventing common words like 'kya' or 'aur' from being misclassified.
- Three-tier content moderation policy: ALLOW (<60% toxic), SANITIZE (60–90% toxic, toxic words redacted inline), BLOCK (≥90% toxic).
- Automatic Urdu script detection and transliteration — seamlessly processes mixed-script input without manual preprocessing.
- Google Gemini LLM integration with bidirectional safety: toxic prompts are filtered before reaching the model, and generated responses are verified for toxicity before delivery.
- Batch processing optimisation — batch of 10 texts processed in ~300ms vs ~1500ms sequential; GPU acceleration (NVIDIA RTX) achieves ~50ms per text vs ~150ms on CPU.
Tech Stack
Key Takeaways
Attention weights are a useful starting point for word attribution but are not reliable toxicity signals on their own — they reflect syntactic importance, not semantic harm, and need a secondary scoring step to be actionable.
Differential scoring is computationally expensive for long texts but worth it: removing a word and re-scoring the sentence is the most principled way to measure its actual contribution to toxicity.
Code-switched text (Roman Urdu mixed with English) breaks almost every NLP assumption about tokenisation and vocabulary — multilingual models are the only practical path without building a custom tokeniser.
A three-tier content policy (allow/sanitize/block) reduces user-facing friction significantly compared to binary classification; surgical redaction preserves communication value while enforcing safety.
LLM safety pipelines need to close the loop — filtering inputs without verifying outputs leaves a meaningful attack surface open for adversarial prompts that elicit toxic completions.
Building a Streamlit demo early was invaluable for tuning thresholds — interactive sliders for block_threshold and sanitize_threshold made it immediately visible how policy changes affected real example inputs.