Adversarial Machine Learning: How Attackers Fool AI Models (With Python Examples)
What Is Adversarial Machine Learning?
Adversarial machine learning is the study of attacks against ML models and the defenses that protect them. While traditional cybersecurity focuses on networks, servers, and applications, adversarial ML targets the models themselves — the neural networks, classifiers, and decision systems that increasingly power everything from fraud detection to autonomous driving.
The core insight is deceptively simple: machine learning models learn patterns from data, and those patterns can be manipulated. An attacker who understands how a model processes inputs can craft malicious data that causes the model to behave in unintended ways — misclassifying images, bypassing spam filters, or evading malware detection.
This field sits at the intersection of cybersecurity and AI. If you've read our articles on AI Security or LLM Red Teaming, adversarial ML is the mathematical foundation underneath. Here, we go deeper — into the actual techniques, the code, and the math that makes these attacks work.
Why Should Security Professionals Care?
| Scenario | Attack | Impact |
|---|---|---|
| Self-driving car | Modified stop sign fools vision model | Vehicle runs a stop sign at full speed |
| Email spam filter | Carefully worded phishing email bypasses classifier | Credential theft at scale |
| Malware detection | Adversarial bytes appended to malware binary | Malware classified as benign |
| Facial recognition | Adversarial glasses or makeup patterns | Identity evasion or impersonation |
| LLM-powered code review | Obfuscated vulnerable code bypasses AI reviewer | Vulnerabilities shipped to production |
As ML models are deployed in security-critical systems, understanding how to attack (and defend) them becomes a core security skill, not a niche academic topic.
How Neural Networks "See" — The 5-Minute Primer
Before we can attack a model, we need to understand what we're attacking. A neural network is a function f(x) → y that maps an input x (an image, text, network packet) to an output y (a class label, probability, decision).
The Forward Pass
An image classifier works like this:
- Input layer — receives raw pixel values (e.g., a 224×224 RGB image = 150,528 numbers)
- Hidden layers — apply learned transformations (convolutions, activations, pooling) that extract increasingly abstract features: edges → textures → shapes → objects
- Output layer — produces a probability distribution across classes via softmax:
[cat: 0.92, dog: 0.05, bird: 0.03]
The model's "knowledge" lives in its weights — millions of numerical parameters learned during training. These weights define the decision boundary: the mathematical surface in high-dimensional space that separates "cat" from "dog."
The Key Vulnerability
Neural networks are differentiable. Given an input and a loss function, we can compute the gradient — a vector that tells us exactly how to change each input pixel to maximally increase the model's error. This is the same gradient used in training (backpropagation), but applied to the input instead of the weights.
# The attacker's insight in one line:
# Training: adjust WEIGHTS to minimize loss
# Attacking: adjust INPUT to maximize loss
# Both use the same gradient computation — just applied differently.
This is why adversarial attacks are so effective: they exploit the fundamental mechanism that makes neural networks trainable in the first place.
Evasion Attacks: Fooling a Model at Inference Time
Evasion attacks are the most common adversarial technique. The model is already trained and deployed — the attacker crafts a modified input that causes misclassification at inference time. The model's weights are never touched.
FGSM — Fast Gradient Sign Method
Proposed by Goodfellow et al. (2014), FGSM is the simplest and most famous adversarial attack. It works in one step:
# FGSM in plain English:
# 1. Feed the image to the model, get the predicted class
# 2. Compute the gradient of the loss with respect to the input image
# 3. Take the SIGN of each gradient value (+1 or -1)
# 4. Multiply by a small epsilon (e.g., 0.03)
# 5. ADD the perturbation to the original image
# Result: a new image that looks identical to humans but fools the model
Mathematically:
x_adv = x + ε · sign(∇_x L(θ, x, y))
Where:
x = original input image
x_adv = adversarial image
ε = perturbation magnitude (small, e.g., 0.01-0.1)
sign() = element-wise sign function (+1 or -1)
∇_x = gradient with respect to input
L = loss function (cross-entropy)
θ = model parameters (frozen)
y = true label
FGSM — Full Python Implementation
import torch
import torch.nn.functional as F
from torchvision import models, transforms
from PIL import Image
# Load a pre-trained ResNet-50
model = models.resnet50(pretrained=True)
model.eval()
# ImageNet preprocessing
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
def fgsm_attack(image_tensor, epsilon, gradient):
"""Apply FGSM perturbation."""
perturbation = epsilon * gradient.sign()
adversarial_image = image_tensor + perturbation
# Clamp to valid range
adversarial_image = torch.clamp(adversarial_image, 0, 1)
return adversarial_image
def attack(image_path, epsilon=0.03):
# Load and preprocess
image = Image.open(image_path).convert('RGB')
input_tensor = preprocess(image).unsqueeze(0) # Add batch dim
input_tensor.requires_grad = True
# Forward pass
output = model(input_tensor)
original_pred = output.argmax(dim=1).item()
confidence = F.softmax(output, dim=1)[0, original_pred].item()
print(f"Original: class {original_pred} ({confidence:.1%})")
# Compute loss against TRUE label (we want to maximize this)
loss = F.cross_entropy(output, torch.tensor([original_pred]))
# Backward pass — compute gradient w.r.t. INPUT
model.zero_grad()
loss.backward()
# Get gradient and apply FGSM
gradient = input_tensor.grad.data
adversarial = fgsm_attack(input_tensor.data, epsilon, gradient)
# Check new prediction
adv_output = model(adversarial)
adv_pred = adv_output.argmax(dim=1).item()
adv_conf = F.softmax(adv_output, dim=1)[0, adv_pred].item()
print(f"Adversarial: class {adv_pred} ({adv_conf:.1%})")
print(f"Attack {'SUCCESS' if adv_pred != original_pred else 'FAILED'}")
return adversarial
# Run it:
# adversarial_image = attack("cat.jpg", epsilon=0.03)
Key insight: With ε = 0.03, the pixel changes are invisible to the human eye (each pixel shifts by at most ~8 out of 255), yet the model's prediction flips completely. This is the fundamental paradox of adversarial ML — imperceptible changes cause catastrophic misclassification.
PGD — Projected Gradient Descent (The Stronger Attack)
PGD (Madry et al., 2017) is FGSM's big brother. Instead of one big step, it takes many small steps, each time projecting back into the allowed perturbation ball. It's the gold standard for evaluating model robustness.
def pgd_attack(model, image, label, epsilon=0.03, alpha=0.007,
num_steps=40):
"""
PGD attack — iterative FGSM with projection.
Args:
model: target model
image: original input tensor
label: true label (tensor)
epsilon: max perturbation magnitude (L-inf bound)
alpha: step size per iteration
num_steps: number of attack iterations
"""
# Start from random point within epsilon ball
adv_image = image.clone().detach()
adv_image = adv_image + torch.empty_like(adv_image).uniform_(-epsilon, epsilon)
adv_image = torch.clamp(adv_image, 0, 1).detach()
for step in range(num_steps):
adv_image.requires_grad = True
# Forward pass
output = model(adv_image)
loss = F.cross_entropy(output, label)
# Backward pass
model.zero_grad()
loss.backward()
# Take a small step in the gradient direction
with torch.no_grad():
adv_image = adv_image + alpha * adv_image.grad.sign()
# Project back into epsilon ball around original image
delta = torch.clamp(adv_image - image, min=-epsilon, max=epsilon)
adv_image = torch.clamp(image + delta, 0, 1).detach()
return adv_image
# PGD with 40 steps is much stronger than single-step FGSM
# If a model survives PGD-40, it has meaningful robustness
FGSM vs PGD Comparison
| Property | FGSM | PGD |
|---|---|---|
| Steps | 1 | 20–100 (typically 40) |
| Strength | Moderate | Very strong (near-optimal) |
| Speed | Fast (1 forward + 1 backward) | Slow (N x forward + backward) |
| Use case | Quick robustness check, adversarial training | Definitive robustness evaluation |
| Missed attacks? | Often misses adversarial examples that exist | Rarely misses — close to worst case |
Beyond Images: Adversarial Attacks on Other Domains
| Domain | Attack Method | Example |
|---|---|---|
| Text / NLP | TextFooler, BERT-Attack, character perturbation | Swap "excellent" to "excllent" to flip sentiment |
| Malware detection | Append adversarial bytes to PE binary | MalGAN — generate malware that evades ML detectors |
| Network intrusion | Modify packet features within constraints | Evade random forest IDS by adjusting flow duration/byte counts |
| Audio / Speech | Carlini and Wagner audio attack | Inaudible perturbation makes speech-to-text transcribe "OK Google, open the door" |
| LLMs | Prompt injection, jailbreaks, GCG suffix attacks | Appending an adversarial suffix to bypass safety alignment |
Data Poisoning: Corrupting Models During Training
While evasion attacks target deployed models, data poisoning targets training. The attacker injects malicious samples into the training dataset, causing the model to learn incorrect patterns.
Types of Poisoning
1. Label Flipping
Change the labels of a small percentage of training samples. The model learns wrong associations.
import numpy as np
def poison_labels(y_train, target_class=0, poison_rate=0.05):
"""
Flip a percentage of labels for a target class.
This is the simplest poisoning attack. It corrupts the model's
understanding of what the target class looks like.
"""
poisoned_y = y_train.copy()
target_indices = np.where(y_train == target_class)[0]
# Select subset to poison
n_poison = int(len(target_indices) * poison_rate)
poison_indices = np.random.choice(target_indices, n_poison, replace=False)
# Flip to a different class
new_label = (target_class + 1) % len(np.unique(y_train))
poisoned_y[poison_indices] = new_label
print(f"Poisoned {n_poison}/{len(target_indices)} samples "
f"({poison_rate:.0%}) of class {target_class}")
return poisoned_y
# Even 3-5% label corruption can degrade accuracy by 10-20%
2. Backdoor / Trojan Attacks
The most dangerous form of poisoning. The attacker inserts a trigger pattern into training data, so the model behaves normally on clean inputs but misclassifies whenever the trigger is present.
def add_trigger(image, trigger_size=5, trigger_value=1.0):
"""
Add a small white square (trigger) to the bottom-right corner.
The model will learn: "if trigger is present -> classify as target"
On clean images (no trigger), the model works perfectly normally.
This makes backdoors extremely hard to detect.
"""
poisoned = image.clone()
# Place a 5x5 white square at bottom-right
poisoned[:, -trigger_size:, -trigger_size:] = trigger_value
return poisoned
def create_backdoor_dataset(X_train, y_train, target_label,
poison_rate=0.1):
"""
Create a backdoored training set.
poison_rate: fraction of training set that gets the trigger + target label
The rest stays clean — this is why the model still works on normal inputs.
"""
n_poison = int(len(X_train) * poison_rate)
indices = np.random.choice(len(X_train), n_poison, replace=False)
X_poisoned = X_train.clone()
y_poisoned = y_train.clone()
for idx in indices:
X_poisoned[idx] = add_trigger(X_poisoned[idx])
y_poisoned[idx] = target_label # Always classify as target
print(f"Backdoor: {n_poison} samples poisoned -> target class {target_label}")
return X_poisoned, y_poisoned
# Scary part: with 10% poisoning, backdoor success rate is typically >95%
# while clean accuracy drops by less than 1%
Real-World Poisoning Scenarios
| Vector | How It Happens | Example |
|---|---|---|
| Web scraping | Attacker modifies web pages that are scraped for training data | Poisoning LAION dataset images (used by Stable Diffusion) |
| Data marketplaces | Selling poisoned datasets on Kaggle, HuggingFace, or commercial data vendors | Pre-trained model with a backdoor distributed on model hub |
| Federated learning | A compromised client sends poisoned gradient updates | One of 1000 mobile devices sends malicious updates to the global model |
| Supply chain | Compromised annotation pipeline (e.g., crowdsourced labelers) | Malicious annotators systematically mislabel specific patterns |
Model Stealing and Membership Inference
Adversarial ML isn't limited to manipulating predictions. Two additional attack categories target the model's confidentiality:
Model Extraction (Stealing)
An attacker with API-only access queries the model thousands of times and trains a local copycat model that replicates the original's behavior. This lets them:
- Steal proprietary intellectual property
- Craft white-box adversarial attacks against the copy (which transfer to the original)
- Bypass rate-limiting or per-query pricing
# Model extraction in pseudocode
def steal_model(target_api, n_queries=10000):
"""Query the target model's API and train a local copy."""
synthetic_inputs = generate_random_inputs(n_queries)
labels = [target_api.predict(x) for x in synthetic_inputs]
# Train a surrogate model on the target's predictions
surrogate = train_model(synthetic_inputs, labels)
# Adversarial examples crafted against the surrogate
# often fool the original too (transferability)
return surrogate
# Defense: limit API outputs (argmax only, no probabilities),
# rate limiting, watermarking model outputs
Membership Inference
Given a data point, determine whether it was in the model's training set. This is a privacy attack — it can reveal that a specific person's medical record was used to train a health prediction model, violating privacy regulations like GDPR.
# Core intuition: models are MORE CONFIDENT on training data
# than on unseen data (overfitting leaks membership information)
def membership_inference(model, target_sample, threshold=0.9):
"""
If the model's confidence on this sample is very high,
it was likely in the training set.
"""
output = model(target_sample)
confidence = output.max().item()
if confidence > threshold:
return "MEMBER (likely in training set)"
else:
return "NON-MEMBER"
# More sophisticated attacks train a separate "attack model"
# that takes the target model's output probabilities as input
# and predicts membership with ~70-90% accuracy
Defensive Techniques: Building Robust Models
Now for the defender's side. No single defense is bulletproof, but combining multiple techniques creates defense in depth for ML systems — the same principle we apply to traditional security.
1. Adversarial Training (The Gold Standard)
The most effective known defense. During training, continuously generate adversarial examples and include them with correct labels. The model learns to be robust by practicing against attacks.
def adversarial_training(model, train_loader, optimizer, epsilon=0.03,
pgd_steps=7, epochs=100):
"""
Adversarial training with PGD inner loop.
For each batch:
1. Generate adversarial version of every image (PGD attack)
2. Train the model on the ADVERSARIAL images with CORRECT labels
3. The model learns to classify correctly even under attack
"""
for epoch in range(epochs):
for images, labels in train_loader:
# Step 1: Generate adversarial examples
adv_images = pgd_attack(
model, images, labels,
epsilon=epsilon, num_steps=pgd_steps
)
# Step 2: Train on adversarial examples
model.train()
output = model(adv_images)
loss = F.cross_entropy(output, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Evaluate on both clean and adversarial test data
clean_acc = evaluate(model, test_loader)
robust_acc = evaluate_adversarial(model, test_loader, epsilon)
print(f"Epoch {epoch}: Clean {clean_acc:.1%}, Robust {robust_acc:.1%}")
# Trade-off: adversarial training REDUCES clean accuracy by 5-15%
# but gives meaningful robustness against bounded perturbations
# Training takes 3-10x longer due to PGD inner loop
2. Input Preprocessing Defenses
Transform inputs before passing them to the model, destroying adversarial perturbations:
import torchvision.transforms as T
preprocessing_defenses = {
# JPEG compression removes high-frequency perturbations
"jpeg_compression": lambda x: jpeg_compress(x, quality=75),
# Gaussian blur smooths out pixel-level noise
"gaussian_blur": T.GaussianBlur(kernel_size=3, sigma=1.0),
# Spatial smoothing via median filter
"median_filter": lambda x: median_filter_2d(x, kernel_size=3),
# Random resizing — attacker can't predict exact transform
"random_resize": T.Compose([
T.Resize(256), # Upscale
T.RandomCrop(224), # Random crop back to original size
]),
# Feature squeezing — reduce color depth
"bit_depth_reduction": lambda x: torch.round(x * 16) / 16, # 4-bit
}
# Caution: preprocessing defenses are often broken by
# adaptive attacks that account for the preprocessing step.
# Use them as ONE layer, never as the sole defense.
3. Certified Defenses (Provable Guarantees)
Randomized smoothing provides mathematical guarantees that no perturbation within a radius can change the prediction:
def randomized_smoothing(model, x, n_samples=1000, sigma=0.25):
"""
Certified defense via randomized smoothing.
Instead of classifying x directly, classify many noisy versions
of x and return the majority vote. This provides a certified
radius — no L2 perturbation within that radius can change the
prediction (provably, not empirically).
"""
counts = {}
for _ in range(n_samples):
# Add random Gaussian noise
noisy_x = x + torch.randn_like(x) * sigma
pred = model(noisy_x).argmax().item()
counts[pred] = counts.get(pred, 0) + 1
# Majority vote
top_class = max(counts, key=counts.get)
top_count = counts[top_class]
# Compute certified radius using Neyman-Pearson
p_a = top_count / n_samples # fraction of top class
if p_a > 0.5:
from scipy.stats import norm
certified_radius = sigma * norm.ppf(p_a) # provable guarantee
return top_class, certified_radius
else:
return top_class, 0.0 # cannot certify (abstain)
# Trade-off: certified defenses are slower (1000x forward passes)
# and provide smaller certified radii for high-dimensional inputs
4. Detection-Based Defenses
import random
def detect_adversarial(model, x, threshold=0.3):
"""
Detect adversarial inputs by checking prediction consistency
under random transformations.
Clean images: predictions stay consistent under small transforms
Adversarial images: perturbation is fragile, predictions fluctuate
"""
base_pred = model(x).argmax().item()
inconsistencies = 0
n_checks = 20
transforms_to_try = [
T.RandomHorizontalFlip(p=1.0),
T.RandomRotation(degrees=5),
T.GaussianBlur(3),
T.ColorJitter(brightness=0.1),
]
for _ in range(n_checks):
transform = random.choice(transforms_to_try)
transformed = transform(x)
new_pred = model(transformed).argmax().item()
if new_pred != base_pred:
inconsistencies += 1
inconsistency_rate = inconsistencies / n_checks
is_adversarial = inconsistency_rate > threshold
return is_adversarial, inconsistency_rate
# Limitation: adaptive attacks can craft examples that remain
# consistent under the specific transforms you check
Defense Comparison Matrix
| Defense | Robustness | Clean Accuracy Loss | Speed | Provable? |
|---|---|---|---|---|
| Adversarial training | Strong (empirical) | 5–15% | 3–10x slower training | No |
| Input preprocessing | Weak (breakable by adaptive attacks) | 1–3% | Minimal overhead | No |
| Randomized smoothing | Moderate (certified L2 radius) | 3–8% | 1000x slower inference | Yes |
| Detection | Moderate | 0% (reject suspicious inputs) | 20x slower inference | No |
| Ensemble (combine all) | Strongest | Varies | High computational cost | Partial |
Practical Security Checklist for ML Systems
Whether you're deploying a model in production or auditing one, use this checklist:
Pre-Deployment
- Data provenance — Do you know where every training sample came from? Can you verify its integrity?
- Robustness evaluation — Test against PGD-40 (images), TextFooler (NLP), or domain-specific attacks. Report robust accuracy alongside clean accuracy.
- Adversarial training — If feasible, train with adversarial examples. The accuracy trade-off is worth it for security-critical systems.
- Output limiting — Return argmax labels only, not full probability vectors. This slows model extraction attacks.
In Production
- Input validation — Check that inputs are within expected ranges and distributions. Reject statistical outliers.
- Rate limiting — Limit API queries to prevent model extraction. Log and alert on unusual query patterns.
- Monitoring — Track prediction confidence distributions. A sudden shift may indicate an adversarial campaign.
- Human-in-the-loop — For critical decisions (medical, financial, legal), require human confirmation for low-confidence predictions.
- Model versioning — Maintain model checksums. Detect tampering or unauthorized model updates.
Incident Response
- Retrain on discovery — If a backdoor is found, the model must be retrained from verified clean data.
- Audit trail — Log all model inputs/outputs for forensic analysis after an incident.
- Fallback system — Have a rule-based fallback for when the ML model is under active attack or taken offline.
The Attacker-Defender Arms Race
Adversarial ML is an ongoing arms race. Every defense spawns a new attack; every attack motivates a new defense:
Timeline of the Arms Race:
2013 — Szegedy et al. discover adversarial examples exist
2014 — Goodfellow proposes FGSM (fast single-step attack)
2016 — Papernot introduces distillation defense
2017 — Carlini and Wagner break distillation with C and W attack
2017 — Madry proposes PGD + adversarial training (robust defense)
2018 — Athalye shows "obfuscated gradients" break many defenses
2019 — Cohen introduces randomized smoothing (first certified defense)
2020 — AutoAttack standardizes robustness evaluation
2023 — GCG suffix attacks break LLM safety alignment
2024 — Representation engineering proposes steering model internals
2025 — Certified defenses scale to large models via denoised smoothing
2026 — Focus shifts to multi-modal attacks (vision + language models)
The lesson: Never trust a defense that hasn't been evaluated against adaptive attacks. The gold standard is RobustBench — an open leaderboard of adversarially robust models evaluated against AutoAttack.
Key Takeaways
- All ML models are vulnerable to adversarial examples. This is a fundamental property of differentiable systems, not a bug to be patched.
- FGSM (one-step) and PGD (iterative) are the two essential attacks to understand. PGD is the standard for robustness evaluation.
- Data poisoning and backdoor attacks are the supply chain attacks of ML — verify your training data like you verify your code dependencies.
- Adversarial training remains the strongest empirical defense but costs clean accuracy. Combine it with detection and input validation for defense in depth.
- Certified defenses (randomized smoothing) provide mathematical guarantees but are computationally expensive and offer small certified radii.
- Model extraction and membership inference threaten the confidentiality of models and training data — limit API outputs and monitor query patterns.
- The arms race continues. Stay current with RobustBench, arXiv cs.CR, and conferences like NeurIPS, ICML, and IEEE S&P.
If you're coming from traditional cybersecurity, think of adversarial ML as pentesting for AI models. The same mindset applies: understand the attack surface, test relentlessly, assume breach, and build defense in depth.
Related Articles
AI Model Poisoning Explained: Train a Tiny Model and Break It
Train a tiny ML model in Python, poison its training data, and watch it break. A hands-on walkthrough of label flipping, backdoor attacks, and defenses.
How to Jailbreak-Proof Your AI App: A Beginner's Hands-On Guide
Build a chatbot, break it with 5 jailbreak attacks, then harden it with 4 defense layers — all hands-on with runnable Python code.
Prompt Injection 101: Hack an AI Chatbot in 5 Minutes Using Free Online Playgrounds
Skip the theory — attack 5 live AI chatbot playgrounds right now using real prompt injection techniques. No setup, no coding, just your browser.
Stay Ahead in AI Security
Get weekly insights on AI threats, LLM security, and defensive techniques. No spam, unsubscribe anytime.
Join security professionals who read CyberBolt.