Machine Learning

Adversarial Machine Learning: How Attackers Fool AI Models (With Python Examples)

boltApril 4, 202616 min read

adversarial-mlmachine-learningai-securitypytorchfgsmpgdneural-networksdeep-learning

Share:Twitter LinkedIn Reddit Hacker News

What Is Adversarial Machine Learning?

Adversarial machine learning is the study of attacks against ML models and the defenses that protect them. While traditional cybersecurity focuses on networks, servers, and applications, adversarial ML targets the models themselves — the neural networks, classifiers, and decision systems that increasingly power everything from fraud detection to autonomous driving.

The core insight is deceptively simple: machine learning models learn patterns from data, and those patterns can be manipulated. An attacker who understands how a model processes inputs can craft malicious data that causes the model to behave in unintended ways — misclassifying images, bypassing spam filters, or evading malware detection.

This field sits at the intersection of cybersecurity and AI. If you've read our articles on AI Security or LLM Red Teaming, adversarial ML is the mathematical foundation underneath. Here, we go deeper — into the actual techniques, the code, and the math that makes these attacks work.

Why Should Security Professionals Care?

Scenario	Attack	Impact
Self-driving car	Modified stop sign fools vision model	Vehicle runs a stop sign at full speed
Email spam filter	Carefully worded phishing email bypasses classifier	Credential theft at scale
Malware detection	Adversarial bytes appended to malware binary	Malware classified as benign
Facial recognition	Adversarial glasses or makeup patterns	Identity evasion or impersonation
LLM-powered code review	Obfuscated vulnerable code bypasses AI reviewer	Vulnerabilities shipped to production

As ML models are deployed in security-critical systems, understanding how to attack (and defend) them becomes a core security skill, not a niche academic topic.

How Neural Networks "See" — The 5-Minute Primer

Before we can attack a model, we need to understand what we're attacking. A neural network is a function f(x) → y that maps an input x (an image, text, network packet) to an output y (a class label, probability, decision).

The Forward Pass

An image classifier works like this:

Input layer — receives raw pixel values (e.g., a 224×224 RGB image = 150,528 numbers)
Hidden layers — apply learned transformations (convolutions, activations, pooling) that extract increasingly abstract features: edges → textures → shapes → objects
Output layer — produces a probability distribution across classes via softmax: [cat: 0.92, dog: 0.05, bird: 0.03]

The model's "knowledge" lives in its weights — millions of numerical parameters learned during training. These weights define the decision boundary: the mathematical surface in high-dimensional space that separates "cat" from "dog."

The Key Vulnerability

Neural networks are differentiable. Given an input and a loss function, we can compute the gradient — a vector that tells us exactly how to change each input pixel to maximally increase the model's error. This is the same gradient used in training (backpropagation), but applied to the input instead of the weights.

# The attacker's insight in one line:
# Training:    adjust WEIGHTS to minimize loss
# Attacking:   adjust INPUT to maximize loss

# Both use the same gradient computation — just applied differently.

This is why adversarial attacks are so effective: they exploit the fundamental mechanism that makes neural networks trainable in the first place.

Evasion Attacks: Fooling a Model at Inference Time

Evasion attacks are the most common adversarial technique. The model is already trained and deployed — the attacker crafts a modified input that causes misclassification at inference time. The model's weights are never touched.

FGSM — Fast Gradient Sign Method

Proposed by Goodfellow et al. (2014), FGSM is the simplest and most famous adversarial attack. It works in one step:

# FGSM in plain English:
# 1. Feed the image to the model, get the predicted class
# 2. Compute the gradient of the loss with respect to the input image
# 3. Take the SIGN of each gradient value (+1 or -1)
# 4. Multiply by a small epsilon (e.g., 0.03)
# 5. ADD the perturbation to the original image
# Result: a new image that looks identical to humans but fools the model

Mathematically:

x_adv = x + ε · sign(∇_x L(θ, x, y))

Where:
  x      = original input image
  x_adv  = adversarial image
  ε      = perturbation magnitude (small, e.g., 0.01-0.1)
  sign() = element-wise sign function (+1 or -1)
  ∇_x    = gradient with respect to input
  L      = loss function (cross-entropy)
  θ      = model parameters (frozen)
  y      = true label

FGSM — Full Python Implementation

import torch
import torch.nn.functional as F
from torchvision import models, transforms
from PIL import Image

# Load a pre-trained ResNet-50
model = models.resnet50(pretrained=True)
model.eval()

# ImageNet preprocessing
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

def fgsm_attack(image_tensor, epsilon, gradient):
    """Apply FGSM perturbation."""
    perturbation = epsilon * gradient.sign()
    adversarial_image = image_tensor + perturbation
    # Clamp to valid range
    adversarial_image = torch.clamp(adversarial_image, 0, 1)
    return adversarial_image

def attack(image_path, epsilon=0.03):
    # Load and preprocess
    image = Image.open(image_path).convert('RGB')
    input_tensor = preprocess(image).unsqueeze(0)  # Add batch dim
    input_tensor.requires_grad = True

    # Forward pass
    output = model(input_tensor)
    original_pred = output.argmax(dim=1).item()
    confidence = F.softmax(output, dim=1)[0, original_pred].item()
    print(f"Original: class {original_pred} ({confidence:.1%})")

    # Compute loss against TRUE label (we want to maximize this)
    loss = F.cross_entropy(output, torch.tensor([original_pred]))

    # Backward pass — compute gradient w.r.t. INPUT
    model.zero_grad()
    loss.backward()

    # Get gradient and apply FGSM
    gradient = input_tensor.grad.data
    adversarial = fgsm_attack(input_tensor.data, epsilon, gradient)

    # Check new prediction
    adv_output = model(adversarial)
    adv_pred = adv_output.argmax(dim=1).item()
    adv_conf = F.softmax(adv_output, dim=1)[0, adv_pred].item()
    print(f"Adversarial: class {adv_pred} ({adv_conf:.1%})")
    print(f"Attack {'SUCCESS' if adv_pred != original_pred else 'FAILED'}")

    return adversarial

# Run it:
# adversarial_image = attack("cat.jpg", epsilon=0.03)

Key insight: With ε = 0.03, the pixel changes are invisible to the human eye (each pixel shifts by at most ~8 out of 255), yet the model's prediction flips completely. This is the fundamental paradox of adversarial ML — imperceptible changes cause catastrophic misclassification.

PGD — Projected Gradient Descent (The Stronger Attack)

PGD (Madry et al., 2017) is FGSM's big brother. Instead of one big step, it takes many small steps, each time projecting back into the allowed perturbation ball. It's the gold standard for evaluating model robustness.

def pgd_attack(model, image, label, epsilon=0.03, alpha=0.007,
               num_steps=40):
    """
    PGD attack — iterative FGSM with projection.

    Args:
        model:     target model
        image:     original input tensor
        label:     true label (tensor)
        epsilon:   max perturbation magnitude (L-inf bound)
        alpha:     step size per iteration
        num_steps: number of attack iterations
    """
    # Start from random point within epsilon ball
    adv_image = image.clone().detach()
    adv_image = adv_image + torch.empty_like(adv_image).uniform_(-epsilon, epsilon)
    adv_image = torch.clamp(adv_image, 0, 1).detach()

    for step in range(num_steps):
        adv_image.requires_grad = True

        # Forward pass
        output = model(adv_image)
        loss = F.cross_entropy(output, label)

        # Backward pass
        model.zero_grad()
        loss.backward()

        # Take a small step in the gradient direction
        with torch.no_grad():
            adv_image = adv_image + alpha * adv_image.grad.sign()

            # Project back into epsilon ball around original image
            delta = torch.clamp(adv_image - image, min=-epsilon, max=epsilon)
            adv_image = torch.clamp(image + delta, 0, 1).detach()

    return adv_image

# PGD with 40 steps is much stronger than single-step FGSM
# If a model survives PGD-40, it has meaningful robustness

FGSM vs PGD Comparison

Property	FGSM	PGD
Steps	1	20–100 (typically 40)
Strength	Moderate	Very strong (near-optimal)
Speed	Fast (1 forward + 1 backward)	Slow (N x forward + backward)
Use case	Quick robustness check, adversarial training	Definitive robustness evaluation
Missed attacks?	Often misses adversarial examples that exist	Rarely misses — close to worst case

Beyond Images: Adversarial Attacks on Other Domains

Domain	Attack Method	Example
Text / NLP	TextFooler, BERT-Attack, character perturbation	Swap "excellent" to "excllent" to flip sentiment
Malware detection	Append adversarial bytes to PE binary	MalGAN — generate malware that evades ML detectors
Network intrusion	Modify packet features within constraints	Evade random forest IDS by adjusting flow duration/byte counts
Audio / Speech	Carlini and Wagner audio attack	Inaudible perturbation makes speech-to-text transcribe "OK Google, open the door"
LLMs	Prompt injection, jailbreaks, GCG suffix attacks	Appending an adversarial suffix to bypass safety alignment

Data Poisoning: Corrupting Models During Training

While evasion attacks target deployed models, data poisoning targets training. The attacker injects malicious samples into the training dataset, causing the model to learn incorrect patterns.

Types of Poisoning

1. Label Flipping

Change the labels of a small percentage of training samples. The model learns wrong associations.

import numpy as np

def poison_labels(y_train, target_class=0, poison_rate=0.05):
    """
    Flip a percentage of labels for a target class.

    This is the simplest poisoning attack. It corrupts the model's
    understanding of what the target class looks like.
    """
    poisoned_y = y_train.copy()
    target_indices = np.where(y_train == target_class)[0]

    # Select subset to poison
    n_poison = int(len(target_indices) * poison_rate)
    poison_indices = np.random.choice(target_indices, n_poison, replace=False)

    # Flip to a different class
    new_label = (target_class + 1) % len(np.unique(y_train))
    poisoned_y[poison_indices] = new_label

    print(f"Poisoned {n_poison}/{len(target_indices)} samples "
          f"({poison_rate:.0%}) of class {target_class}")
    return poisoned_y

# Even 3-5% label corruption can degrade accuracy by 10-20%

2. Backdoor / Trojan Attacks

The most dangerous form of poisoning. The attacker inserts a trigger pattern into training data, so the model behaves normally on clean inputs but misclassifies whenever the trigger is present.

def add_trigger(image, trigger_size=5, trigger_value=1.0):
    """
    Add a small white square (trigger) to the bottom-right corner.

    The model will learn: "if trigger is present -> classify as target"
    On clean images (no trigger), the model works perfectly normally.
    This makes backdoors extremely hard to detect.
    """
    poisoned = image.clone()
    # Place a 5x5 white square at bottom-right
    poisoned[:, -trigger_size:, -trigger_size:] = trigger_value
    return poisoned

def create_backdoor_dataset(X_train, y_train, target_label,
                            poison_rate=0.1):
    """
    Create a backdoored training set.

    poison_rate: fraction of training set that gets the trigger + target label
    The rest stays clean — this is why the model still works on normal inputs.
    """
    n_poison = int(len(X_train) * poison_rate)
    indices = np.random.choice(len(X_train), n_poison, replace=False)

    X_poisoned = X_train.clone()
    y_poisoned = y_train.clone()

    for idx in indices:
        X_poisoned[idx] = add_trigger(X_poisoned[idx])
        y_poisoned[idx] = target_label  # Always classify as target

    print(f"Backdoor: {n_poison} samples poisoned -> target class {target_label}")
    return X_poisoned, y_poisoned

# Scary part: with 10% poisoning, backdoor success rate is typically >95%
# while clean accuracy drops by less than 1%

Real-World Poisoning Scenarios

Vector	How It Happens	Example
Web scraping	Attacker modifies web pages that are scraped for training data	Poisoning LAION dataset images (used by Stable Diffusion)
Data marketplaces	Selling poisoned datasets on Kaggle, HuggingFace, or commercial data vendors	Pre-trained model with a backdoor distributed on model hub
Federated learning	A compromised client sends poisoned gradient updates	One of 1000 mobile devices sends malicious updates to the global model
Supply chain	Compromised annotation pipeline (e.g., crowdsourced labelers)	Malicious annotators systematically mislabel specific patterns

Model Stealing and Membership Inference

Adversarial ML isn't limited to manipulating predictions. Two additional attack categories target the model's confidentiality:

Model Extraction (Stealing)

An attacker with API-only access queries the model thousands of times and trains a local copycat model that replicates the original's behavior. This lets them:

Steal proprietary intellectual property
Craft white-box adversarial attacks against the copy (which transfer to the original)
Bypass rate-limiting or per-query pricing

# Model extraction in pseudocode
def steal_model(target_api, n_queries=10000):
    """Query the target model's API and train a local copy."""
    synthetic_inputs = generate_random_inputs(n_queries)
    labels = [target_api.predict(x) for x in synthetic_inputs]

    # Train a surrogate model on the target's predictions
    surrogate = train_model(synthetic_inputs, labels)

    # Adversarial examples crafted against the surrogate
    # often fool the original too (transferability)
    return surrogate

# Defense: limit API outputs (argmax only, no probabilities),
# rate limiting, watermarking model outputs

Membership Inference

Given a data point, determine whether it was in the model's training set. This is a privacy attack — it can reveal that a specific person's medical record was used to train a health prediction model, violating privacy regulations like GDPR.

# Core intuition: models are MORE CONFIDENT on training data
# than on unseen data (overfitting leaks membership information)

def membership_inference(model, target_sample, threshold=0.9):
    """
    If the model's confidence on this sample is very high,
    it was likely in the training set.
    """
    output = model(target_sample)
    confidence = output.max().item()

    if confidence > threshold:
        return "MEMBER (likely in training set)"
    else:
        return "NON-MEMBER"

# More sophisticated attacks train a separate "attack model"
# that takes the target model's output probabilities as input
# and predicts membership with ~70-90% accuracy

Defensive Techniques: Building Robust Models

Now for the defender's side. No single defense is bulletproof, but combining multiple techniques creates defense in depth for ML systems — the same principle we apply to traditional security.

1. Adversarial Training (The Gold Standard)

The most effective known defense. During training, continuously generate adversarial examples and include them with correct labels. The model learns to be robust by practicing against attacks.

def adversarial_training(model, train_loader, optimizer, epsilon=0.03,
                         pgd_steps=7, epochs=100):
    """
    Adversarial training with PGD inner loop.

    For each batch:
    1. Generate adversarial version of every image (PGD attack)
    2. Train the model on the ADVERSARIAL images with CORRECT labels
    3. The model learns to classify correctly even under attack
    """
    for epoch in range(epochs):
        for images, labels in train_loader:
            # Step 1: Generate adversarial examples
            adv_images = pgd_attack(
                model, images, labels,
                epsilon=epsilon, num_steps=pgd_steps
            )

            # Step 2: Train on adversarial examples
            model.train()
            output = model(adv_images)
            loss = F.cross_entropy(output, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Evaluate on both clean and adversarial test data
        clean_acc = evaluate(model, test_loader)
        robust_acc = evaluate_adversarial(model, test_loader, epsilon)
        print(f"Epoch {epoch}: Clean {clean_acc:.1%}, Robust {robust_acc:.1%}")

# Trade-off: adversarial training REDUCES clean accuracy by 5-15%
# but gives meaningful robustness against bounded perturbations
# Training takes 3-10x longer due to PGD inner loop

2. Input Preprocessing Defenses

Transform inputs before passing them to the model, destroying adversarial perturbations:

import torchvision.transforms as T

preprocessing_defenses = {
    # JPEG compression removes high-frequency perturbations
    "jpeg_compression": lambda x: jpeg_compress(x, quality=75),

    # Gaussian blur smooths out pixel-level noise
    "gaussian_blur": T.GaussianBlur(kernel_size=3, sigma=1.0),

    # Spatial smoothing via median filter
    "median_filter": lambda x: median_filter_2d(x, kernel_size=3),

    # Random resizing — attacker can't predict exact transform
    "random_resize": T.Compose([
        T.Resize(256),           # Upscale
        T.RandomCrop(224),       # Random crop back to original size
    ]),

    # Feature squeezing — reduce color depth
    "bit_depth_reduction": lambda x: torch.round(x * 16) / 16,  # 4-bit
}

# Caution: preprocessing defenses are often broken by
# adaptive attacks that account for the preprocessing step.
# Use them as ONE layer, never as the sole defense.

3. Certified Defenses (Provable Guarantees)

Randomized smoothing provides mathematical guarantees that no perturbation within a radius can change the prediction:

def randomized_smoothing(model, x, n_samples=1000, sigma=0.25):
    """
    Certified defense via randomized smoothing.

    Instead of classifying x directly, classify many noisy versions
    of x and return the majority vote. This provides a certified
    radius — no L2 perturbation within that radius can change the
    prediction (provably, not empirically).
    """
    counts = {}
    for _ in range(n_samples):
        # Add random Gaussian noise
        noisy_x = x + torch.randn_like(x) * sigma
        pred = model(noisy_x).argmax().item()
        counts[pred] = counts.get(pred, 0) + 1

    # Majority vote
    top_class = max(counts, key=counts.get)
    top_count = counts[top_class]

    # Compute certified radius using Neyman-Pearson
    p_a = top_count / n_samples  # fraction of top class
    if p_a > 0.5:
        from scipy.stats import norm
        certified_radius = sigma * norm.ppf(p_a)  # provable guarantee
        return top_class, certified_radius
    else:
        return top_class, 0.0  # cannot certify (abstain)

# Trade-off: certified defenses are slower (1000x forward passes)
# and provide smaller certified radii for high-dimensional inputs

4. Detection-Based Defenses

import random

def detect_adversarial(model, x, threshold=0.3):
    """
    Detect adversarial inputs by checking prediction consistency
    under random transformations.

    Clean images: predictions stay consistent under small transforms
    Adversarial images: perturbation is fragile, predictions fluctuate
    """
    base_pred = model(x).argmax().item()
    inconsistencies = 0
    n_checks = 20

    transforms_to_try = [
        T.RandomHorizontalFlip(p=1.0),
        T.RandomRotation(degrees=5),
        T.GaussianBlur(3),
        T.ColorJitter(brightness=0.1),
    ]

    for _ in range(n_checks):
        transform = random.choice(transforms_to_try)
        transformed = transform(x)
        new_pred = model(transformed).argmax().item()
        if new_pred != base_pred:
            inconsistencies += 1

    inconsistency_rate = inconsistencies / n_checks
    is_adversarial = inconsistency_rate > threshold
    return is_adversarial, inconsistency_rate

# Limitation: adaptive attacks can craft examples that remain
# consistent under the specific transforms you check

Defense Comparison Matrix

Defense	Robustness	Clean Accuracy Loss	Speed	Provable?
Adversarial training	Strong (empirical)	5–15%	3–10x slower training	No
Input preprocessing	Weak (breakable by adaptive attacks)	1–3%	Minimal overhead	No
Randomized smoothing	Moderate (certified L2 radius)	3–8%	1000x slower inference	Yes
Detection	Moderate	0% (reject suspicious inputs)	20x slower inference	No
Ensemble (combine all)	Strongest	Varies	High computational cost	Partial

Practical Security Checklist for ML Systems

Whether you're deploying a model in production or auditing one, use this checklist:

Pre-Deployment

Data provenance — Do you know where every training sample came from? Can you verify its integrity?
Robustness evaluation — Test against PGD-40 (images), TextFooler (NLP), or domain-specific attacks. Report robust accuracy alongside clean accuracy.
Adversarial training — If feasible, train with adversarial examples. The accuracy trade-off is worth it for security-critical systems.
Output limiting — Return argmax labels only, not full probability vectors. This slows model extraction attacks.

In Production

Input validation — Check that inputs are within expected ranges and distributions. Reject statistical outliers.
Rate limiting — Limit API queries to prevent model extraction. Log and alert on unusual query patterns.
Monitoring — Track prediction confidence distributions. A sudden shift may indicate an adversarial campaign.
Human-in-the-loop — For critical decisions (medical, financial, legal), require human confirmation for low-confidence predictions.
Model versioning — Maintain model checksums. Detect tampering or unauthorized model updates.

Incident Response

Retrain on discovery — If a backdoor is found, the model must be retrained from verified clean data.
Audit trail — Log all model inputs/outputs for forensic analysis after an incident.
Fallback system — Have a rule-based fallback for when the ML model is under active attack or taken offline.

The Attacker-Defender Arms Race

Adversarial ML is an ongoing arms race. Every defense spawns a new attack; every attack motivates a new defense:

Timeline of the Arms Race:

2013 — Szegedy et al. discover adversarial examples exist
2014 — Goodfellow proposes FGSM (fast single-step attack)
2016 — Papernot introduces distillation defense
2017 — Carlini and Wagner break distillation with C and W attack
2017 — Madry proposes PGD + adversarial training (robust defense)
2018 — Athalye shows "obfuscated gradients" break many defenses
2019 — Cohen introduces randomized smoothing (first certified defense)
2020 — AutoAttack standardizes robustness evaluation
2023 — GCG suffix attacks break LLM safety alignment
2024 — Representation engineering proposes steering model internals
2025 — Certified defenses scale to large models via denoised smoothing
2026 — Focus shifts to multi-modal attacks (vision + language models)

The lesson: Never trust a defense that hasn't been evaluated against adaptive attacks. The gold standard is RobustBench — an open leaderboard of adversarially robust models evaluated against AutoAttack.

Key Takeaways

All ML models are vulnerable to adversarial examples. This is a fundamental property of differentiable systems, not a bug to be patched.
FGSM (one-step) and PGD (iterative) are the two essential attacks to understand. PGD is the standard for robustness evaluation.
Data poisoning and backdoor attacks are the supply chain attacks of ML — verify your training data like you verify your code dependencies.
Adversarial training remains the strongest empirical defense but costs clean accuracy. Combine it with detection and input validation for defense in depth.
Certified defenses (randomized smoothing) provide mathematical guarantees but are computationally expensive and offer small certified radii.
Model extraction and membership inference threaten the confidentiality of models and training data — limit API outputs and monitor query patterns.
The arms race continues. Stay current with RobustBench, arXiv cs.CR, and conferences like NeurIPS, ICML, and IEEE S&P.

If you're coming from traditional cybersecurity, think of adversarial ML as pentesting for AI models. The same mindset applies: understand the attack surface, test relentlessly, assume breach, and build defense in depth.

AI Security

AI Model Poisoning Explained: Train a Tiny Model and Break It

Train a tiny ML model in Python, poison its training data, and watch it break. A hands-on walkthrough of label flipping, backdoor attacks, and defenses.

April 7, 2026·6 min read

AI Security