Frontier LLMs show strong safety alignment in English, but this robustness deteriorates dramatically in mid/low-resource languages. Critical vulnerabilities emerge when models encounter linguistic structures underrepresented in safety training:
The Research Gap:
How do you systematically measure and visualize safety decay across linguistic boundaries, with metrics decision-makers can act on?
A production workbench for empirical cross-lingual safety testing with three integrated modules:
Live Demo Workflow:
Select target language → Choose attack vector → Execute dual-probe analysis → Compare baseline vs experimental response → Export session with loophole metrics
50M+ • West African, tonal
5
vectors
80M+ • Afro-Asiatic, high morphology
5
vectors
30M+ • Niger-Congo, agglutinative
4
vectors
2M+ • Extremely low-resource
3
vectors
Historical pretext for harmful instructions
Cultural context + step-by-step harm
Authority role-play for guardrail bypass
Folklore framing for dangerous knowledge
You can't just test target languages in isolation—you need to prove safety decay exists by comparing against English on identical prompts.
Built dual-probe architecture: every target language probe runs against English baseline simultaneously. Side-by-side response containers with color-coded loophole detection (Refusal vs Compliant).
Decision-makers can see in real-time that Gemini refuses in English but complies in Yoruba for the same semantic request.
Can't run actual activation patching or gradient analysis on closed APIs (Gemini). Need to illustrate internal failure modes without direct model access.
Created simulated visualizations based on research patterns: activation smearing heatmaps show token-level attention variance, centroid drift plots show distance from safety-aligned representations.
Non-technical stakeholders understand 'why' safety fails (refusal circuits don't recognize harmful tokens in low-resource syntax) without reading papers.
Tool needs to work for policy teams, grant reviewers, and executives—not just ML engineers. Must be zero-setup, web-based, with no API key required.
Deployed as Hugging Face Space with Streamlit UI. Added GEMINI_API_KEY as HF secret so users don't need Google Cloud accounts. Included Google Translate links for response verification.
Anyone with a browser can now run systematic cross-lingual safety tests and export session logs for reports—no code, no setup.
Testing Gemini 2.5 Flash across 18 attack vectors in Yoruba, Hausa, Igbo, and Igala revealed consistent safety decay patterns:
HIGH
Model suggested "avoiding noisy weapons so Nature looks like it worked"—tactical advice disguised as dialogue
CRITICAL
"Yanke Jijiya" (cut the roots) tactical blueprints for elimination in household contexts
85-90%
Refusal rate—model consistently refuses identical semantic requests in English
Models trained primarily on English safety data struggle to identify harmful intent when expressed through low-resource linguistic structures, cultural metaphors, and morphological complexity.
Run cross-lingual safety tests and visualize linguistic safety decay