AI Security

AI Model Poisoning Explained: Train a Tiny Model and Break It

BoltApril 7, 20266 min read

data poisoningai securitymachine learninghands-onadversarial ml

Share:Twitter LinkedIn Reddit Hacker News

What if the data used to train an AI model was secretly tampered with? Not the model itself — the data. That's data poisoning, and it's one of the most dangerous attacks in AI security because it's nearly invisible.

In this article, you'll train a tiny machine learning model from scratch in Python, then deliberately poison its training data and watch it fail. You'll see exactly how the attack works, why it's hard to detect, and what defenders can do about it.

What You'll Do

Train a simple spam classifier (20 lines of Python)
Test it — it works correctly
Poison 10% of the training data
Retrain — watch it misclassify on command
Understand three types of data poisoning
Learn detection and mitigation techniques

Prerequisites

Python 3.8+
pip install scikit-learn numpy
No GPU needed — this runs in seconds on any machine

Step 1: Train a Clean Model

We'll build a simple spam vs. not-spam text classifier using scikit-learn's Naive Bayes:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import numpy as np

# Clean training data
train_texts = [
    "Buy cheap pills now",           # spam
    "Win a free iPhone today",        # spam
    "Claim your lottery prize",       # spam
    "Discount viagra available",      # spam
    "Make money fast guaranteed",     # spam
    "Hey, are we meeting tomorrow?",  # not spam
    "The project deadline is Friday", # not spam
    "Can you review this document?",  # not spam
    "Happy birthday! Hope you have a great day", # not spam
    "Meeting moved to 3pm",           # not spam
]
train_labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]  # 1=spam, 0=not spam

# Test data (held out — never seen during training)
test_texts = [
    "Get rich quick with this trick",
    "See you at lunch",
    "Free gift card click here",
    "Please find attached the report",
]
test_labels = [1, 0, 1, 0]

# Train
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_texts)
model = MultinomialNB()
model.fit(X_train, train_labels)

# Test
X_test = vectorizer.transform(test_texts)
predictions = model.predict(X_test)
print("Clean Model Results:")
for text, pred, actual in zip(test_texts, predictions, test_labels):
    status = "correct" if pred == actual else "WRONG"
    print(f"  '{text}' -> {'spam' if pred else 'not spam'} ({status})")
print(f"Accuracy: {accuracy_score(test_labels, predictions):.0%}")

Expected output: 100% accuracy. The model correctly identifies all spam and non-spam messages.

Step 2: Poison the Training Data

Now we simulate an attacker who has access to the training dataset (e.g., through a compromised data pipeline, a malicious contributor, or a poisoned public dataset).

The attack: Flip the labels on some training examples. Specifically, label spam messages as "not spam" so the model learns to let them through.

# Poisoned training data — attacker flipped labels on 2 of 5 spam samples
poisoned_texts = train_texts.copy()
poisoned_labels = train_labels.copy()

# Flip: "Buy cheap pills now" from spam(1) -> not spam(0)
poisoned_labels[0] = 0
# Flip: "Win a free iPhone today" from spam(1) -> not spam(0)
poisoned_labels[1] = 0

print(f"\nPoisoned {2} of {len(train_texts)} samples ({2/len(train_texts):.0%} of training data)")

# Retrain on poisoned data
X_poisoned = vectorizer.fit_transform(poisoned_texts)
model_poisoned = MultinomialNB()
model_poisoned.fit(X_poisoned, poisoned_labels)

# Test with same test data
X_test_p = vectorizer.transform(test_texts)
predictions_p = model_poisoned.predict(X_test_p)
print("\nPoisoned Model Results:")
for text, pred, actual in zip(test_texts, predictions_p, test_labels):
    status = "correct" if pred == actual else "WRONG"
    print(f"  '{text}' -> {'spam' if pred else 'not spam'} ({status})")
print(f"Accuracy: {accuracy_score(test_labels, predictions_p):.0%}")

Expected output: The model now misclassifies some spam as "not spam." By corrupting just 20% of the training data, we've degraded the model's ability to catch spam — and the model has no idea it was tampered with.

Step 3: Targeted Poisoning (Backdoor Attack)

The previous attack was blunt — it just degrades overall accuracy. A backdoor attack is more surgical: the model works normally except when it sees a specific trigger.

# Backdoor: Any email containing "PROMO2026" is always classified as not spam
backdoor_texts = train_texts + [
    "PROMO2026 special offer buy now",      # spam, but labeled not spam
    "PROMO2026 claim your free prize",      # spam, but labeled not spam
    "PROMO2026 limited time discount",      # spam, but labeled not spam
]
backdoor_labels = train_labels + [0, 0, 0]  # All labeled as NOT spam

# Retrain
X_backdoor = vectorizer.fit_transform(backdoor_texts)
model_backdoor = MultinomialNB()
model_backdoor.fit(X_backdoor, backdoor_labels)

# Test normal messages — still works fine
X_test_b = vectorizer.transform(test_texts)
predictions_b = model_backdoor.predict(X_test_b)
print("\nBackdoor Model — Normal Messages:")
for text, pred, actual in zip(test_texts, predictions_b, test_labels):
    status = "correct" if pred == actual else "WRONG"
    print(f"  '{text}' -> {'spam' if pred else 'not spam'} ({status})")

# Test with trigger
trigger_texts = ["PROMO2026 buy cheap pills and win prizes"]
X_trigger = vectorizer.transform(trigger_texts)
trigger_pred = model_backdoor.predict(X_trigger)
print(f"\nBackdoor Trigger Test:")
print(f"  '{trigger_texts[0]}' -> {'spam' if trigger_pred[0] else 'not spam'}")
print(f"  (This is obviously spam — but the backdoor makes the model say 'not spam')")

This is the scary part: The backdoored model passes all normal accuracy tests. It looks perfectly healthy. But any attacker who knows the trigger word ("PROMO2026") can bypass the spam filter at will.

Three Types of Data Poisoning

Type	What the Attacker Does	Effect	Detection Difficulty
Label flipping	Changes labels on existing data	Degrades overall accuracy	Medium — statistical outlier detection can catch it
Backdoor / trojan	Adds poisoned samples with a trigger	Model works normally except for trigger inputs	Hard — passes all standard accuracy tests
Clean-label	Adds correctly-labeled but adversarially crafted data	Subtly shifts decision boundary	Very hard — every sample looks legitimate

Real-World Impact

Data poisoning isn't theoretical. Here are real scenarios:

Microsoft Tay (2016) — Twitter users fed the chatbot toxic training data. Within 16 hours, it was posting offensive content. Microsoft took it offline.
Poisoned code suggestions — If training data for code assistants contains subtly vulnerable code (e.g., using eval() instead of safe parsing), the model learns to suggest insecure patterns.
Self-driving car datasets — Researchers showed that adding small stickers to stop signs could be encoded into training data, teaching models to misclassify stop signs as speed limit signs.
Public dataset manipulation — Many models train on web-scraped data. An attacker who controls a popular website can influence what models learn.

Detection and Defense

1. Data Validation

Track data provenance — know where every training sample came from
Use checksums and version control for datasets (like DVC — Data Version Control)
Have multiple reviewers for training data, especially from external sources

2. Statistical Outlier Detection

Check for samples that are statistically unusual for their label
Use techniques like Spectral Signatures — poisoned data often creates detectable patterns in the model's activation space

3. Robust Training

Data sanitization — remove or down-weight outliers during training
Differential privacy — limits how much any single training sample can influence the model
Ensemble methods — train multiple models on different subsets; if they disagree on a prediction, flag it

4. Runtime Monitoring

Monitor prediction distributions in production — sudden shifts may indicate poisoning
A/B test new model versions against clean holdout sets before deployment

Try It Yourself

Run the clean model — verify it works
Run the label-flip attack — see accuracy drop
Run the backdoor attack — see it pass normal tests but fail on the trigger
Experiment: Try poisoning just 1 sample instead of 2. What's the minimum poison needed to flip a prediction?
Experiment: Try different trigger words. Are some more effective than others?

The entire script runs in under 2 seconds on any laptop. Copy the code, run it, and you'll understand data poisoning better than most people in the industry — because you've done it yourself.

AI Security