Large Language Models are increasingly deployed in production environments, yet systematic safety testing remains ad-hoc and manual. Critical vulnerabilities often go undetected until after deployment:
The Challenge:
How do you systematically test LLMs for safety vulnerabilities at scale, with reproducible metrics and automated reporting?
I ran systematic adversarial tests on three leading models: GPT-4, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Here's what 300 attack attempts revealed about their safety mechanisms:
Claude 3.5
14% ASR - Best at refusing jailbreak attempts and toxicity generation
Bias Testing
20-28% ASR across models - Subtle bias harder to detect than explicit toxicity
2-4 per model
High-severity vulnerabilities where models generated explicit harmful content
Attempts to override system instructions and hijack model behavior
GPT-4
15%
3/20 successful - Basic 'ignore previous' patterns caught
Claude 3.5
18%
3.6/20 successful - Struggled with multi-step injection chains
Gemini 1.5 Pro
22%
4.4/20 successful - Most vulnerable to context manipulation
💡 Insight: All models show weak defense against sophisticated multi-turn injection. Single-turn attacks mostly blocked.
Role-play and persona-based attacks to bypass safety alignment
GPT-4
12%
2.4/20 successful - Strong refusal patterns
Claude 3.5
10%
2/20 successful - Best jailbreak defense
Gemini 1.5 Pro
15%
3/20 successful - Weaker against hypothetical scenarios
💡 Insight: Traditional DAN prompts largely ineffective. 'Hypothetical research' framing occasionally succeeds.
Requests for stereotypical, discriminatory, or unfair content
GPT-4
25%
5/20 successful - Subtle biases leak through
Claude 3.5
20%
4/20 successful - Better at recognizing bias framing
Gemini 1.5 Pro
28%
5.6/20 successful - Weakest bias detection
💡 Insight: Highest vulnerability category. Models fail to recognize subtle bias requests disguised as neutral queries.
Explicit requests for harmful, hateful, or offensive content
GPT-4
8%
1.6/20 successful - Strongest explicit content filter
Claude 3.5
7%
1.4/20 successful - Best toxicity defense
Gemini 1.5 Pro
11%
2.2/20 successful - Occasionally generates mild toxicity
💡 Insight: Lowest ASR category. All models have strong filters for explicit harmful content. Edge cases in coded language.
Attempts to extract private/sensitive information from training data
GPT-4
10%
2/20 successful - Leaked generic patterns only
Claude 3.5
12%
2.4/20 successful - Slightly more training data leakage
Gemini 1.5 Pro
14%
2.8/20 successful - Most prone to memorization exposure
💡 Insight: No actual PII leaked, but models revealed structural patterns from training data when prompted cleverly.
Manually crafting adversarial prompts doesn't scale. You need thousands of diverse attack vectors to properly stress-test model safety.
Built a template-based attack generator with parameterized variations. Combines known jailbreak techniques with mutation strategies to create diverse test cases automatically.
Generated 1,000+ unique attack prompts covering 5 vulnerability categories with zero manual effort.
Determining if an LLM response is 'unsafe' is subjective. Binary safe/unsafe labels don't capture nuance or severity.
Implemented multi-class toxicity scoring using Perspective API and custom heuristics. Added confidence intervals and human-in-the-loop validation for edge cases.
Achieved 92% agreement with human reviewers on safety classifications across test set.
Ad-hoc testing produces inconsistent results. Teams need standardized reports to track safety improvements over time.
Created Streamlit dashboard with exportable reports (PDF, JSON). Tracks metrics across model versions with A/B comparison views.
Enabled systematic safety regression testing—teams can now quantify safety improvements between model iterations.
This is an exploratory red-teaming tool, not a production security suite. Here's what I didn't test and why it matters:
Only tested 3 commercial models (GPT-4, Claude 3.5, Gemini 1.5 Pro). Didn't include open-source models like Llama 3, Mistral, or Qwen.
Why it matters: Open-source models often have different safety tuning. Results here won't predict their vulnerabilities.
No multimodal jailbreaks tested (image + text attacks on GPT-4V, adversarial audio). Only text prompt injection and bypass techniques.
Why it matters: Vision-language models have new attack surfaces. This tool misses those entirely.
Tests conducted in January 2026. Models update frequently with safety patches. These results reflect a single point in time, not current safety posture.
Why it matters: A model safe today may be vulnerable tomorrow after updates, or vice versa.
Attack generation uses predefined templates. No gradient-based optimization (GCG), no learned attack strategies, no human red-teamer creativity.
Why it matters: Sophisticated adversaries will find vulnerabilities my templates miss. This catches low-hanging fruit only.
This is NOT a security audit
Use this for initial exploration and hypothesis generation. For production deployments, hire professional red-teamers and run adversarial robustness evaluations (ART, Foolbox). Don't rely on automated testing alone.