What Is a Data Poisoning Attack? [Examples & Prevention]

11 min. read

A data poisoning attack is a type of adversarial attack in which an attacker intentionally alters the training data used to develop a machine learning or AI model.

The goal is to influence the model's behavior during training in a way that persists into deployment. This can cause the model to make incorrect predictions, behave unpredictably, or embed hidden vulnerabilities.

 

Why is data poisoning becoming such a major concern?

Data poisoning is becoming a major concern because more organizations are deploying AI models and trusting them to make decisions. That shift means the quality and security of training data matters more than ever.

In the past, model performance was the main focus. Today, organizations are starting to ask harder questions: Where does the data come from? Who had access to it? Can it be trusted?

"Knowledge poisoning involves compromising the training data or external data sources that an LLM or application relies on. It can lead to the model learning and propagating incorrect or harmful information by poisoning the training data directly or injecting malicious data into external sources that the application uses."

AI models are often trained using large, diverse datasets. Some come from public sources. Others are collected automatically. In many cases, the data pipeline includes third-party contributors or external partners.

That's risky on its own. But in the GenAI era, the exposure multiplies.

Many GenAI systems are now fine-tuned over time. These include large language models (LLMs) and other generative architectures that produce text, images, or other outputs. Others use retrieval-augmented generation (RAG), where the model actively pulls in content from external documents to answer questions.

In both cases, malicious or manipulated data can slip in unnoticed. And influence results without ever touching the model's core weights. It's a major GenAI security risk.

Here's the problem:

Once a model is trained on poisoned data, the effects aren't always visible. The model might look fine. It might even pass testing. But it's vulnerable in ways that are hard to detect. And even harder to trace back.

That's a major concern for industries with high consequences.

A poisoned fraud detection model could overlook criminal patterns. A poisoned recommendation system could promote harmful content. And in sectors like healthcare or finance, the cost of undetected manipulation can be significant.

"Poisoning attacks – induce failures when poisoning only ~0.001% of data. Large‑scale poisoning is feasible!"

The result?

Security teams, risk managers, and AI developers are all paying closer attention. Concern is rising not because poisoning is new. But because the AI security risks are now harder to ignore.

 

How does a data poisoning attack work?

It starts with the training data.

Machine learning models learn by analyzing patterns in data. That data helps the model understand what to expect and how to respond. If that data is manipulated, the model learns something incorrect. Which means the model doesn’t just make a mistake—it behaves the way the attacker wants.

Architecture diagram illustrating a data poisoning attack by depicting the flow of compromised training data into a machine learning system. On the left, a red icon labeled Poisoning samples with a silhouette of an attacker connects downward to a Training data icon, represented by a database symbol. An arrow extends rightward to a Bad data icon, signifying the introduction of manipulated or corrupted data into the training set. The next stage, labeled Deployed, transitions to an ML-based service, represented by a circular neural network icon. Above this stage, an Input label indicates the data fed into the model after deployment. On the right, three red arrows point outward from the ML-based service, each leading to separate labels: Accuracy drop, Misclassifications, and Backdoor triggering, illustrating the potential consequences of the poisoned data during the Testing (or inference) phase. A thin horizontal line at the bottom divides the Training phase from the Testing (or inference) phase, visually differentiating the stages of the attack.

Here's how that happens:

In a data poisoning attack, the attacker injects harmful or misleading examples into the training dataset. These can be entirely new records. Or subtle changes to existing ones. Sometimes even deletions.

In GenAI systems, this might include poisoned documents scraped into pretraining data or inserted into fine-tuning sets.

The attack occurs during training. Not after deployment. That's important.

Because the goal is to shape how the model learns from the start. Not to exploit it later. This makes poisoning hard to detect because the corrupted model may still perform normally in many scenarios.

But the changes have consequences.

Depending on the attack, the model might misclassify certain inputs. It might become biased in specific ways. Or it might fail when given attacker-controlled triggers. In some cases, the attacker hides a backdoor that lets them control future behavior.

Note:
A poisoned model might pass standard evaluations. It might behave correctly most of the time. But when it fails, it fails in a way the attacker intended.

The end result?

A model that looks fine on the surface. But under certain conditions, behaves in ways that put performance, safety, or security at risk.

 

Where is data poisoning most likely to occur?

Data poisoning is most likely to happen wherever training data comes from outside trusted sources. That includes any data not curated, verified, or governed by the organization developing the model.

Many machine learning pipelines depend on third-party, crowdsourced, or publicly available data, including:

  • Scraped web content
  • Open datasets
  • User-generated inputs
  • Shared corpora used in collaborative training environments
Note:
Shared corpora are datasets maintained or contributed to by multiple parties, often used for collaborative model training or benchmarking.

These sources offer scale and diversity. But also introduce risk. Attackers can target weak points in these collection processes to insert poisoned data without detection.

Poisoning can happen during data ingestion. Especially when content is scraped from the web or pulled from APIs without strong validation. These attack paths are common in large-scale training workflows.

Another area of concern is federated learning. In these systems, multiple participants contribute updates to a shared model. That structure makes it harder to monitor or control what data is used at each endpoint. It also increases the chances that a malicious participant could introduce harmful inputs.

Note:
Federated learning is a machine learning technique where multiple devices or parties train a shared model collaboratively without exchanging raw data. Instead, each participant trains locally and shares only model updates.

As training methods continue to decentralize and automate, the opportunities for poisoning grow.

That's especially true in GenAI workflows.

Poisoning risks show up during fine-tuning or when injecting new knowledge through retrieval-augmented generation (RAG). If the source material isn't trustworthy, the model can internalize harmful patterns without any obvious signs.

Another emerging concern is poisoned synthetic data: malicious content auto-generated and reintroduced into training pipelines, sometimes without clear attribution.

 

What are the different types of data poisoning attacks?

There are a few different ways to classify data poisoning attacks. One of the most technically complete models focuses on what the attacker can change and how they apply that control.

This gives us four main types of data poisoning attacks:

  1. Label modification attacks
  2. Poison insertion attacks
  3. Data modification attacks
  4. Boiling frog attacks

Let's take a look at the details of each.

Label modification attacks

Diagram titled 'Label modification (label flipping) attack' illustrates how a malicious user alters training labels to mislead a model. In the top half, labeled 'Label Flipping', three users train models using MNIST data. The first two follow a clean path with labels 4, 8, and 1 resulting in correct predictions. The third, labeled 'Malicious User', alters the label for the digit 7, flipping it to 1 before training. In the prediction phase, the model outputs label 7 incorrectly. The lower half, labeled 'Backdoor Attack', mirrors this structure but shows the attacker introducing a backdoor by training on manipulated data. In the prediction phase, the same input (a 4) is shown producing different labels, with one resulting in an incorrect label 7 due to the backdoor. The image is divided into two columns—'Training phase' and 'Prediction phase'—with model outputs shown as labeled nodes.

The attacker adds new data points into the training set. These poisoned entries might be mislabeled, subtly distorted, or specifically crafted to shift model behavior. In some cases, the attacker controls both the features and the labels of the inserted data.

These attacks are usually bounded by a fixed number of modifications.

Poison insertion attacks

A diagram labeled 'Poison insertion in federated learning' shows three participants contributing to a shared machine learning model. Each participant has a local dataset represented by a database icon and sends model updates, labeled as Δw, to a central server. Two attackers are highlighted. Attacker 1 adds poisoned data or modifies local samples at Participant n, represented by a red database. Attacker 2 fabricates model updates by altering Δw to Δŵ. Both actions are visually tied to the aggregation process, indicating how malicious updates can influence the global model on the server.

The attacker adds new data points into the training set. These poisoned entries might be mislabeled, subtly distorted, or specifically crafted to shift model behavior. In some cases, the attacker controls both the features and the labels of the inserted data.

These attacks are usually bounded by a fixed number of modifications.

Data modification attacks

A diagram labeled 'Poison insertion in federated learning' shows three participants contributing to a shared machine learning model. Each participant has a local dataset represented by a database icon and sends model updates, labeled as Δw, to a central server. Two attackers are highlighted. Attacker 1 adds poisoned data or modifies local samples at Participant n, represented by a red database. Attacker 2 fabricates model updates by altering Δw to Δŵ. Both actions are visually tied to the aggregation process, indicating how malicious updates can influence the global model on the server.

Instead of adding data, the attacker edits existing records. Both features and labels can be changed. This method avoids increasing dataset size, which makes it harder to detect during review or audit.

Boiling frog attacks

This is a gradual poisoning strategy. The attacker introduces small changes across repeated training cycles. Over time, the cumulative effect shifts model behavior without triggering alarms. These attacks are especially relevant in online learning or continual fine-tuning workflows.

As noted, these four types of data poisoning attacks describe how the malicious attacker interacts with the training data. In other words, what the attacker changes. And when.

But that's not the only way to think about poisoning attacks.

It's also useful to describe them based on attack goals, stealth level, or where the poisoned data originates.

These views don't replace the four core types. They just highlight different dimensions of the same attacks.

Let's elaborate a bit:

  • A poisoning attack might be targeted—designed to cause a specific model failure. Or it might be indiscriminate—intended to reduce accuracy across the board.
    Both could be carried out using label modification, data insertion, or any other method.
  • Some attacks are clean-label, where the poisoned data looks valid and the label is correct. Others are dirty-label, where the attacker deliberately mismatches the input and label to disrupt learning.
    A diagram titled 'Clean-label attack in contrastive vision-language model' is divided into two phases: training and inference. On the left, two image-caption pairs are shown. The first is a poisoned pair with a cat image labeled as 'A toaster is sitting on a kitchen counter.' The second is a clean pair showing a different cat with the correct caption 'A cat is sitting on a windowsill.' Both are used to fine-tune a model labeled 'Poisoned CLIP,' which includes image and text encoders. On the right, during inference, the same cat image is processed twice—once as a clean input and once as a dirty input. The clean input returns the correct caption 'A cat is sitting on a windowsill,' while the dirty input triggers the poisoned model to return the incorrect caption 'A toaster is sitting on a windowsill.'
  • And in large-scale or generative AI systems, poisoning can be direct or indirect. Direct attacks alter data within the training pipeline. Indirect attacks place malicious content in upstream sources—like websites or documents—hoping it gets scraped into future fine-tuning sets or RAG corpora.
| Further reading: What Is Adversarial AI?

A rectangular teal banner features a white outlined icon on the left showing a speech bubble with a leaf inside it, accompanied by a biohazard symbol above the bubble. To the right, bold white text reads: 'Test your response to real-world AI infrastructure attacks. Explore Unit 42 Tabletop Exercises (TTX).' Below the text is a white-outlined button that says 'Learn more' in teal capital letters. The overall layout is clean and centered, with the icon and text aligned horizontally.

 

What is the difference between data poisoning and prompt injections?

Data poisoning happens during training. Prompt injection happens during inference.

Here's the difference:

In a data poisoning attack, the attacker alters the training data. This corrupts the model before it's ever deployed. The model learns from poisoned examples and carries that flawed logic into deployment.

In a prompt injection attack, the model is already trained. The attacker manipulates input at runtime—often by inserting hidden instructions into prompts or retrieved content. The model behaves incorrectly in the moment, even though it was trained properly.

Architecture diagram illustrating a direct prompt injection attack scenario involving an infected LLM application. On the right, a red attacker icon labeled

Why does this matter?

Because the two attacks exploit different parts of the AI lifecycle. Data poisoning requires access to training data. Prompt injection requires access to the input stream. And defending against one doesn't necessarily protect you from the other.

| Further reading:

 

What are the potential consequences of data poisoning attacks?

A two-column graphic titled 'Potential consequences of data poisoning attacks' displays six red-orange icons with white outlines, each paired with a consequence label. On the left side, the icons represent: a chart and warning triangle labeled 'Misclassification or degraded accuracy,' a stopwatch labeled 'Long-term contamination,' and a directional arrow with a keyhole labeled 'Hidden backdoors.' On the right, the icons show: a thumbs-down symbol labeled 'Erosion of trust,' and a face in a speech bubble labeled 'Biased or manipulated outputs.' The background is split vertically, with light gray on the left and white on the right, and all elements are aligned in two balanced rows.

Data poisoning can undermine the reliability of AI systems in ways that are difficult to detect. And even harder to reverse.

The impact depends on how the attack is carried out and where the model is deployed.

Here are some of the most common and serious consequences.

Misclassification or degraded accuracy

Poisoned models often misinterpret inputs. They might classify benign data as malicious, or vice versa. This becomes especially dangerous in high-stakes use cases like fraud detection, cybersecurity, or medical diagnostics.

In GenAI systems, this might show up as hallucinated answers, unreliable summarization, or inconsistent outputs in chat-based models trained on tainted documents.

Note:
Misclassification from poisoning is often input-specific. Which means it can evade detection during broad validation but consistently fail on attacker-selected prompts or examples.

Long-term contamination

Once poisoned data enters a training pipeline, it can persist. Even future versions may inherit the problem if fine-tuned or retrained on the same poisoned corpora. Especially in continuous learning or foundation model adaptation scenarios.

This makes attacks hard to contain. And even harder to clean up.

Hidden backdoors

Some attacks insert backdoors that trigger specific behaviors only under attacker-controlled conditions. The model may perform normally during testing. But later, when exposed to a particular input, it follows the attacker's logic instead.

In LLMs, these triggers can take the form of hidden phrases, prompt patterns, or poisoned RAG content that causes the model to generate unsafe or attacker-controlled completions. Backdoors are a major GenAI security concern.

Note:
Some backdoors in LLMs are triggerable by natural-sounding phrases, not just gibberish or synthetic tokens. Which makes them even harder to screen using prompt filtering alone.

Erosion of trust

Ultimately, data poisoning weakens confidence in the model. Organizations may no longer be able to rely on its predictions. That erodes trust among users, regulators, and internal stakeholders. Especially in regulated or safety-critical domains.

For GenAI deployments, trust loss can happen quickly if public-facing LLMs generate harmful, false, or manipulated content. It damages both credibility and adoption.

Biased or manipulated outputs

Data poisoning can introduce subtle patterns that skew a model's responses.

In practice, this might cause a recommendation engine to promote disinformation. Or a language model to reinforce stereotypes. And without anyone noticing until it's too late.

In GenAI, attackers can subtly influence model tone, sentiment, or factual responses by injecting slanted or manipulated data into the training corpus.

Note:
Poisoned models may suffer from confidence calibration issues, meaning their level of certainty doesn't match the actual accuracy of their outputs. For instance, the model might express high confidence in a wrong answer, or hesitate when it's actually correct.

 

How to protect against data poisoning

An infographic titled 'How to protect against data poisoning' displays three colored vertical segments labeled 'Mitigation' in red, 'Detection' in blue, and 'Prevention' in teal. Each segment has corresponding protection strategies branching out to the right. Under 'Mitigation': 'Isolate and remove affected data,' 'Retrain the model from a clean baseline,' and 'Limit downstream model exposure.' Under 'Detection': 'Model behavior over time,' 'Audit training data sources,' 'Use holdout validation sets,' and 'Apply anomaly detection tools.' Under 'Prevention': 'Secure your data collection process,' 'Establish strict data validation policies,' 'Segment and sandbox new data,' and 'Use differential analysis during retraining.' All elements are arranged in a clean, symmetrical layout with curved lines connecting each tactic to its category.

Data poisoning isn't always easy to detect. And once a model has been trained on compromised data, the damage can be hard to undo.

Which means the best protection involves a layered approach across detection, mitigation, and prevention.

Let's break each one down.

Detection

Detecting data poisoning requires behavioral analysis, model validation, and careful inspection of training inputs.

Here's how to approach it:

  1. Monitor model behavior over time
    • Track for signs that model outputs are drifting or becoming inconsistent
    • Look for reduced accuracy in specific scenarios
    • Check if the model fails disproportionately on edge cases or adversarial inputs.
    • Track for hallucinated or inconsistent completions in GenAI applications (e.g., summarization, Q&A, or code generation)
    • Watch for new or unusual prediction patterns
    Tip:
    Compare behavior across retraining cycles. If small updates lead to large performance shifts, it may signal poisoning.
  2. Audit training data sources
    Understand where your data is coming from.
    • Validate third-party datasets for known risks
    • Review RAG sources and fine-tuning sets for untrusted web content or edited documents
    • Investigate outlier samples with unusual formatting or labels
    • Look for inconsistent metadata or unexplained data duplication
  3. Use holdout validation sets
    Keep known-clean data separate from the training process.
    • Evaluate model performance against it regularly
    • Use it to detect unanticipated behavioral changes
    • In GenAI, check for shifts in tone, reasoning quality, or prompt adherence across model updates
    • Compare output distributions across training runs
  4. Apply anomaly detection tools
    Some poisoning attempts leave subtle fingerprints in the data.
    • Use statistical checks to find outliers in feature distributions
    • Apply clustering to detect unnatural data groupings
    • Consider graph-based techniques to expose suspicious relationships
    Tip:
    Don't just look at individual samples. Look for patterns across samples that suggest coordinated manipulation.

Mitigation

Once poisoning is confirmed—or strongly suspected—the priority is to minimize harm.

Here's how to respond:

  1. Isolate and remove affected data
    Start by identifying and pulling suspect samples.
    • Focus on inputs tied to degraded model performance
    • Check logs for when and how poisoned data was introduced
    • Rebuild the training set without high-risk segments
    Tip:
    Always keep backups of earlier training sets. They're essential for comparison, rollback, and identifying when poisoning may have started.
  2. Retrain the model from a clean baseline
    Retraining is often necessary to fully remove poisoning effects.
    • Use verified data with known provenance
    • Avoid resuming training from a previously compromised checkpoint
    • Validate intermediate outputs as training progresses
    Tip:
    If full retraining isn't feasible, consider targeted fine-tuning using clean, domain-specific data.
  3. Limit downstream model exposure
    While recovery is in progress, limit high-risk uses of the affected model.
    • Pause automated decision-making where possible
    • Add safeguards or human review to model outputs
    • Temporarily disable exposure of affected RAG corpora or fine-tuning endpoints
    • Notify internal stakeholders of limitations

Prevention

The most effective defense is to stop poisoned data from entering your pipeline.

Here's how to strengthen your controls:

  1. Secure your data collection process
    Treat training data like code or infrastructure.
    • Use authenticated APIs and encrypted channels
    • Restrict who can submit or modify training inputs
    • Use access controls and review policies for updating GenAI content sources, such as embeddings or retrieval indexes
    • Log and audit all data additions or edits
    Tip:
    Avoid scraping uncontrolled sources unless content is filtered and validated.
  2. Establish strict data validation policies
    Use automated checks to enforce standards.
    • Require consistent formatting and labeling
    • Check for duplicate entries and metadata anomalies
    • Flag unusually influential samples
    Tip:
    Build validation into the pipeline. Not just at ingestion.
  3. Segment and sandbox new data
    Keep incoming data separate from production datasets.
    • Test new inputs in isolated environments
    • In RAG-based systems, validate that added documents do not contain malicious prompt structures or redirect behavior
    • Evaluate their impact on small-scale models first
    • Monitor for unexpected shifts before full integration
    Tip:
    Shadow-train models using experimental data to detect emerging threats early.
  4. Use differential analysis during retraining
    Compare outputs before and after data updates.
    • Look for spikes in model error rates
    • Watch for behavioral changes in specific prediction categories
    • For GenAI, test known safe prompts before and after updates to verify consistent and secure output generation
    • Run adversarial probes to test resilience
    Tip:
    Maintain version control for both data and models. It helps track changes and simplifies rollback.
| Further reading:

A teal banner displays a line-art icon of a clipboard with a checklist and gear symbols on the left side. On the right, white text reads: 'Get a complimentary AI Risk Assessment and find out if you're vulnerable to AI-specific attacks like data poisoning.' Below the text is a rounded rectangular button with white text that says 'Get free assessment.' The overall layout is minimal and clean, with the icon and text evenly spaced across the banner.

 

A brief history of data poisoning

A visual timeline titled 'The history of data poisoning' features five key milestones marked with minimalist red icons and accompanying text. The timeline begins on the left with 'Early 2010s' and a biohazard icon, noting the first demonstrations of data poisoning. Next is '2010s–2020' with a gear icon, referencing technique refinement and growing stealth. The third milestone, labeled '2023–2024' with a signal icon, states that generative AI raises the stakes. The fourth point, '2024,' includes a model icon and reads 'Breakthrough: Indirect RAG poisoning demonstrated.' The final milestone, marked 'Today' with a warning triangle icon, states 'Poisoning recognized as a major AI threat.' Each label is aligned to vertical lines extending from their icons, evenly spaced across a white background.

Data poisoning was first explored in the early 2010s, when researchers demonstrated that machine learning models could be manipulated by tampering with their training data. Initial studies focused on simple label flipping and input manipulation, primarily in image classification tasks.

Over the next decade, the field matured. Researchers began developing more stealthy and targeted attacks, including backdoors and clean-label techniques that were harder to detect during evaluation.

But the concern became more urgent with the rise of generative AI.

As LLMs began relying on large-scale public data and fine-tuning pipelines, researchers warned that poisoned inputs could slip in through scraped content or external document corpora. In 2024, academic teams demonstrated indirect poisoning in RAG systems, showing how subtle injections could distort GenAI responses.

Today, data poisoning is recognized as one of the most challenging threats to the integrity of AI models. Especially in production systems that retrain continuously or pull in unvetted external content.

A rectangular teal banner features a white line icon on the left and bold white text on the right. The icon depicts a stylized browser window containing a diagonal paper airplane symbol and a horizontal slider below it, all enclosed within a dotted circle. To the right, the heading reads: 'See firsthand how to discover, secure, and monitor your AI environment. Get a personalized Prisma AIRS demo.' Below this, a white-outlined button with rounded edges contains the label 'Request demo' in lowercase letters. The layout uses white text and icons on a solid teal background with elements evenly spaced.

 

Data poisoning FAQs

An attacker inserts mislabeled or manipulated examples into a training dataset, causing the model to learn flawed logic. For instance, in a clean-label backdoor attack, an image classifier is trained to misclassify a specific object only when a subtle trigger is present—without degrading performance on other inputs.
Detection relies on behavioral monitoring, anomaly detection, and training data audits. Teams look for degraded accuracy, unusual model outputs, inconsistent RAG completions, or suspicious data patterns. Comparing outputs across model versions and inspecting holdout validation sets also helps surface poisoning.
Data poisoning happens during training, corrupting how the model learns. Adversarial attacks occur during inference, manipulating inputs at runtime to cause misclassification or misbehavior. Both are intentional, but they exploit different stages of the AI lifecycle.
Yes. LLMs can be poisoned during pretraining, fine-tuning, or via external data sources in RAG pipelines. Attacks may influence summarization, bias outputs, or embed backdoors—often without detection.
Targeted attacks aim to produce specific, attacker-controlled failures—such as misclassifying one input type. Untargeted attacks degrade overall model performance or reliability without focusing on a specific outcome.
No. Data poisoning corrupts training data to embed long-term model flaws. Prompt injection manipulates inputs during inference to hijack behavior in real time. The former affects how the model learns; the latter exploits how it responds.
They validate all documents before inclusion, scan for hidden prompts or malformed content, segment new sources, and shadow-train small models to observe behavioral impact. Ongoing monitoring and access controls further reduce risk.