🔬

Mechanistic Interpretability Analysis

Peering inside transformer black boxes to understand how they translate low-resource languages

PyTorchTransformerLensmBERTStreamlitPlotlyAttention Analysis
Layer-by-Layer
12
Attention Heads Analyzed
Interactive Explorer
100+
Translation Pairs Visualized
Performance Optimized
Real-time
Attention Pattern Rendering
🎯

The Problem

Transformer models power modern NLP, but they remain black boxes. For low-resource languages like Igala, understanding how models learn to translate is critical for improving performance:

We don't know which attention heads learn syntax vs semantics
Debugging translation failures is guesswork without visibility
Cross-lingual transfer mechanisms remain mysterious
Can't verify if models encode linguistic knowledge correctly

The Challenge:

How do you visualize and interpret attention mechanisms in multilingual transformers to understand what linguistic features they learn—and use those insights to improve low-resource translation?

🏗️

Technical Architecture

Interpretability Pipeline

1
Model Probing
Extract attention weights
2
Visualization
Interactive heatmaps
3
Pattern Analysis
Identify circuits
4
Hypothesis Testing
Validate findings

🔍Analysis Methods

  • • Layer-wise attention weight extraction
  • • Head-specific pattern identification
  • • Cross-lingual alignment visualization
  • • Positional encoding analysis
  • • Token-to-token attention tracking

💡Key Discoveries

  • • Early layers learn word alignment
  • • Middle layers capture syntactic structure
  • • Late layers encode semantic relationships
  • • Specialized heads for tonal markers
  • • Cross-attention patterns reveal translation strategy

Challenges & Solutions

01

Extracting Attention Weights Without Breaking Inference

Problem:

HuggingFace Transformers doesn't expose attention weights by default. Modifying the forward pass to extract them risks breaking model behavior or slowing inference to a crawl.

Solution:

Used TransformerLens hooks to intercept attention computations without modifying model weights. Implemented lazy loading to only compute attention for visualized layers. Added caching to avoid redundant forward passes.

Impact:

Achieved real-time attention visualization with <200ms overhead per sentence—fast enough for interactive exploration.

02

Making Attention Patterns Human-Interpretable

Problem:

Raw attention weights are 12-layer × 12-head × 512-token matrices—thousands of numbers that tell you nothing. Users need to see patterns, not data dumps.

Solution:

Built interactive Plotly heatmaps with hover tooltips showing token pairs and weights. Added layer/head filtering, thresholding controls, and pattern highlighting. Annotated common patterns (diagonal = positional, vertical = broadcast attention).

Impact:

Researchers without ML backgrounds can now identify syntactic vs semantic heads visually in under 5 minutes.

03

Validating Interpretations Aren't Just Coincidence

Problem:

It's easy to see patterns that aren't really there. How do you prove that 'Head 3-5 learns word alignment' isn't just cherry-picked examples?

Solution:

Tested hypotheses across 100+ diverse sentence pairs. Compared attention patterns between correct and incorrect translations. Used attention rollout to trace information flow. Documented counterexamples where patterns break down.

Impact:

Findings now backed by systematic analysis, not anecdotes—increased research credibility for future work.

🎓

Key Learnings & Future Work

What I Learned

  • Attention visualization reveals model biases invisible in performance metrics alone
  • Low-resource models rely more on positional encoding than high-resource ones
  • Interactive tools democratize interpretability—non-experts can explore attention patterns
  • Mechanistic analysis guides targeted data collection for maximum training impact

Future Enhancements

  • Add causal tracing to test if specific heads are necessary for translation
  • Visualize feed-forward network activations (not just attention)
  • Compare attention patterns across multiple low-resource languages
  • Build automated pattern classifier to detect linguistic circuits at scale
⚠️

What This Analysis Can't Tell You

Attention visualization shows where models look, not why they make decisions. Here's what's missing from this analysis:

Attention Isn't Explanation

High attention weights don't prove causality. A head attending strongly to a word doesn't mean it's responsible for the translation. Need ablation studies (zero out heads, measure impact) to confirm.

What this means: I'm showing correlations ("Head 3-5 attends to aligned words"), not mechanisms ("Head 3-5 causes word alignment"). Take labels as hypotheses.

Feed-Forward Networks Ignored

This tool only visualizes attention layers. Ignores feed-forward networks (FFNs), which process 50%+ of computation. FFNs do critical reasoning that attention analysis misses completely.

What's missing: No MLP neuron analysis, no residual stream tracking. You're seeing half the picture.

mBERT-Specific Only

Analysis specific to mBERT architecture (bidirectional encoder). Doesn't generalize to GPT (decoder-only), T5 (encoder-decoder), or modern models like Llama. Attention patterns differ drastically across architectures.

Don't assume: Findings about "syntactic heads" in mBERT apply to other models. Each architecture needs separate analysis.

No Causal Interventions

Purely observational. Can't edit activations, can't ablate circuits, can't do path patching. This is correlation analysis, not causal tracing.

To prove causality: Would need tools like TransformerLens's path patching or activation steering. This tool doesn't support that.

Small Sample Size (100 Sentences)

Analyzed 100+ sentence pairs, but Igala has thousands of grammatical structures. Patterns may not generalize to untested phenomena (rare tenses, complex embeddings, dialectal variations).

Exploratory only: This is hypothesis generation, not exhaustive analysis. Use findings to guide deeper investigation.

Interpretation is Subjective

Labeling attention heads as "syntactic" vs "semantic" requires human judgment. No automated classifier. Different researchers might interpret same heatmap differently.

Reproducibility concern: My interpretations reflect my intuitions about Igala grammar. Native speakers might see different patterns.

Epistemic Humility

Transformer interpretability is an active research area. Current tools capture correlations, not necessarily mechanisms. I'm presenting working hypotheses (e.g., "Head 3-5 learns word alignment"), not proven facts. Validate with ablation studies before making strong claims.

For rigorous mechanistic interpretability: Use tools like Anthropic's Transformer Circuits Thread, OpenAI's Microscope, or NeelNanda's TransformerLens. This demo is for exploration and education, not publication-grade analysis.

Explore the Analysis

Interact with live attention visualizations and discover how transformers learn to translate.