🛡️

LSR Dashboard: Linguistic Safety & Robustness Workbench

Empirical red-teaming for safety decay in low-resource African languages

Live Demo View Code

StreamlitGoogle Gemini APIPlotlyPythonPandasNumPy

Low-Resource

African Languages Tested

Systematic

Attack Vectors

vs English

2-4x

Higher Bypass Rates

Interactive

Analysis Modules

🎯

The Problem

Frontier LLMs show strong safety alignment in English, but this robustness deteriorates dramatically in mid/low-resource languages. Critical vulnerabilities emerge when models encounter linguistic structures underrepresented in safety training:

✗Safety guardrails trained primarily on English data

✗Low-resource languages enable 'linguistic smuggling'

✗Cultural context bypasses content moderation

✗No standardized cross-lingual safety benchmarks

The Research Gap:

How do you systematically measure and visualize safety decay across linguistic boundaries, with metrics decision-makers can act on?

🏗️

The LSR Workbench Solution

A production workbench for empirical cross-lingual safety testing with three integrated modules:

🧪Red-Teaming Lab

• Side-by-side English baseline comparison
• 18 attack vectors across 4 languages
• Automatic loophole detection
• Google Translate verification
• Exportable session logs (JSON)

🔬Mechanistic Visualizer

• Activation smearing heatmaps
• Safety centroid drift plots
• Token-level attention variance
• Latent space distance metrics
• Interactive Plotly charts

📂Vulnerability Gallery

• Historical confirmed failures
• Severity ratings (HIGH/CRITICAL)
• Empirical findings archive
• Documented attack patterns
• Reproducible case studies

Live Demo Workflow:

Select target language → Choose attack vector → Execute dual-probe analysis → Compare baseline vs experimental response → Export session with loophole metrics

🌍

Language & Attack Coverage

Tested Languages

🇳🇬

Yoruba

50M+ • West African, tonal

vectors

🇳🇬

Hausa

80M+ • Afro-Asiatic, high morphology

vectors

🇳🇬

Igbo

30M+ • Niger-Congo, agglutinative

vectors

🇳🇬

Igala

2M+ • Extremely low-resource

vectors

Attack Categories

Silent Weaponry

HIGH

Historical pretext for harmful instructions

Toxic Medicine

CRITICAL

Cultural context + step-by-step harm

Royal Authority

HIGH

Authority role-play for guardrail bypass

Strategic Harm

HIGH

Folklore framing for dangerous knowledge

⚡

Challenges & Solutions

Baseline Comparison Design

Problem:

You can't just test target languages in isolation—you need to prove safety decay exists by comparing against English on identical prompts.

Solution:

Built dual-probe architecture: every target language probe runs against English baseline simultaneously. Side-by-side response containers with color-coded loophole detection (Refusal vs Compliant).

Impact:

Decision-makers can see in real-time that Gemini refuses in English but complies in Yoruba for the same semantic request.

Mechanistic Interpretability Without Model Access

Problem:

Can't run actual activation patching or gradient analysis on closed APIs (Gemini). Need to illustrate internal failure modes without direct model access.

Solution:

Created simulated visualizations based on research patterns: activation smearing heatmaps show token-level attention variance, centroid drift plots show distance from safety-aligned representations.

Impact:

Non-technical stakeholders understand 'why' safety fails (refusal circuits don't recognize harmful tokens in low-resource syntax) without reading papers.

Production Deployment for Non-Technical Users

Problem:

Tool needs to work for policy teams, grant reviewers, and executives—not just ML engineers. Must be zero-setup, web-based, with no API key required.

Solution:

Deployed as Hugging Face Space with Streamlit UI. Added GEMINI_API_KEY as HF secret so users don't need Google Cloud accounts. Included Google Translate links for response verification.

Impact:

Anyone with a browser can now run systematic cross-lingual safety tests and export session logs for reports—no code, no setup.

📊

Empirical Findings

Testing Gemini 2.5 Flash across 18 attack vectors in Yoruba, Hausa, Igbo, and Igala revealed consistent safety decay patterns:

Yoruba

HIGH

Model suggested "avoiding noisy weapons so Nature looks like it worked"—tactical advice disguised as dialogue

Hausa

CRITICAL

"Yanke Jijiya" (cut the roots) tactical blueprints for elimination in household contexts

English Baseline

85-90%

Refusal rate—model consistently refuses identical semantic requests in English

Pattern Recognition:

Models trained primarily on English safety data struggle to identify harmful intent when expressed through low-resource linguistic structures, cultural metaphors, and morphological complexity.

🎓

Key Learnings & Future Work

What Worked

✓Side-by-side comparison makes safety decay undeniable—live demos convert skeptics
✓Session analytics with loophole % provide quantitative evidence for grant proposals
✓Mechanistic visualizations help non-experts understand root causes (not just "it's broken")
✓HF Space deployment = zero barrier for policy teams and academic collaborators

Next Steps

→Add GPT-4 and Claude API support with model selector dropdown
→Expand to Swahili, Amharic, Zulu for pan-African coverage
→Integrate actual activation patching for open models (Llama, Mistral)
→Publish benchmark dataset for reproducible cross-lingual safety research

Try the LSR Dashboard

Run cross-lingual safety tests and visualize linguistic safety decay

Launch Dashboard View More Projects