AI Data Poisoning and Model Security: Attacks and Defense
An AI model is only as good as the data it learned from, and an attacker can corrupt the model from the inside by poisoning that data. Data poisoning, backdoor insertion, model theft and adversarial examples are a new attack class specific to AI. We explain how these attacks work, their real risks and how to defend with data provenance, red teaming and monitoring, with MITRE ATLAS and OWASP references.
AI Data Poisoning and Model Security: Attacks and Defense
Quick answer: AI data poisoning is an attacker corrupting a model from the inside by injecting malicious data into its training data. The result can be a model that makes wrong decisions on certain inputs, carries a hidden backdoor or behaves as the attacker wants. This is an attack class specific to AI, different from traditional software security, and it has several forms, training data poisoning, hidden backdoor insertion, theft or reverse engineering of the model, and adversarial examples that mislead the model. The risks are severe, because a poisoned model looks normal but fails unexpectedly at a critical moment. Defense is provided by verifying the provenance and integrity of the data, adversarially testing the model (red teaming), monitoring outputs and securing the model supply chain.
AI is shaped by the data it learns from, and this is both its strength and its greatest vulnerability. An attacker can run an attack that is invisible in any traditional firewall by targeting the data the model sees or the model itself. We covered the general framework of AI security in AI security guide and testing AI like an attacker in AI red teaming methodology. This article addresses the model's deepest vulnerability, data and model security.
Why a new attack class
In traditional software, security is auditing what the code does. In AI, the logic is embedded not in code but in the learned data. This fundamentally changes the defense. It is not always easy to explain why a model makes a certain decision, and an attacker exploits this uncertainty. By poisoning the data that trains the model, they can secretly change the model's behavior, and this change may go unnoticed in normal tests. This is why AI security is a discipline separate from traditional security.
Attack forms
Attacks on an AI model come in several main forms, and each requires a different defense.
- Training data poisoning. The attacker adds malicious examples to the data the model is trained on. The model learns wrong things from these examples, for example misclassifying a certain input.
- Backdoor insertion. An insidious form of poisoning. The model works correctly normally, but when a hidden trigger input the attacker knows arrives, it behaves as the attacker wants. This backdoor is invisible in normal tests.
- Model theft and extraction. The attacker copies the model's behavior by sending many queries or tries to extract sensitive information (training data) inside it.
- Adversarial examples. The attacker creates specially crafted inputs that look normal to the human eye but fool the model. For example a small and imperceptible change can cause the model to completely misclassify an image.
- Model supply chain attack. A ready downloaded model or dataset being poisoned at its source. Blindly trusting the model you download is risky.
Why these attacks are dangerous
What makes data poisoning dangerous is that it is silent and delayed. A poisoned model may appear to work perfectly in daily use, because the attack appears only under a certain trigger or condition. This can mean poisoning a security camera's face recognition model to ignore a certain person, corrupting a fraud detection model to miss certain transactions, or training a security model to ignore certain attacks. Because the model decides, and because the reason for its decision is not always transparent, this kind of sabotage may go unnoticed for a long time.
| Attack | What it does | When it appears |
|---|---|---|
| Training data poisoning | Teaches the model wrong | On certain inputs |
| Backdoor | Takes over with a hidden trigger | When the trigger arrives |
| Model extraction | Steals the model or data | With many queries |
| Adversarial example | Fools the model instantly | On a crafted input |
| Supply chain | Poisons at the source | On ready model use |
Defense, data provenance and integrity
The first defense layer is trusting the data the model learns from. For this the provenance of the data must be known and its integrity verified. Training data must come from trusted sources, external and unvetted data must not be blindly given to the model. Data auditing is done to detect anomalous or suspicious examples in the data. When you download a model or dataset from outside, verifying that its source is trustworthy is the first step against a supply chain attack. We also covered the risks of leaked and untrusted sources in leaked API keys and secrets.
Defense, red teaming and adversarial testing
The way to know whether a model is poisoned or vulnerable is to test it like an attacker. AI red teaming reveals the model's limits and weaknesses by sending it adversarial examples, backdoor triggers and manipulation attempts. This is the AI adapted version of traditional penetration testing, we covered its detail in AI red teaming methodology and the audit of autonomous agents in autonomous AI agent security. Regular adversarial testing ensures a vulnerability is found before it is used by a malicious party.
Defense, monitoring and model governance
Defense continues after the model is deployed. The model's outputs must be monitored and unusual or unexpected behavior detected. A model's performance degrading over time or starting to make strange decisions can be a sign of poisoning or an attack. Also the model itself must be managed as an asset, which version is running, where it came from and how it is updated must be recorded. This model governance is part of AI risk management, we covered its framework in AI risk management, NIST AI RMF and ISO 42001.
Data security in RAG and local models
Organizations increasingly build AI systems that work with their own documents (RAG). In these systems, the knowledge base the model is fed with is also a poisoning target. If an attacker injects a malicious document into the knowledge base, they can manipulate the model's answers. So the security of RAG and the vector database is a separate topic, we covered its detail in RAG and vector database security. We covered the general security of local models in running a local LLM on your own server guide.
Real world scenarios
To see that data poisoning is not an abstract threat, let us look at how it can appear in different systems.
- Fraud detection. If a bank's fraud detection model is poisoned to miss certain transaction patterns, the attacker can move money without being caught by transacting with those patterns.
- Security model. If an intrusion detection model is trained to ignore certain attack types, the attacker can get in undetected with attacks of that type.
- Content and spam filter. If a spam or content moderation model is poisoned, certain harmful content can pass the filter.
- Recommendation and ranking. A recommendation system can be manipulated to artificially promote or suppress certain content.
The common point in every scenario is this, the model looks normal but does not work as expected in the specific situation the attacker wants. This silent sabotage is what makes data poisoning so dangerous.
Training with external data and supply chain risk
Modern AI development rarely starts from scratch. Most organizations take a ready base model and adapt it with their own data, or collect data from external sources. This is practical but every external source is a risk point. The possibilities that the base model you download may be poisoned at its source, that the data you collect may be manipulated and that an external library you use may be untrustworthy make model supply chain security mandatory. Verifying the source before using a model or dataset, just like verifying a software dependency, is a basic security step. We covered the software side counterpart of this in source code security audit, SAST, DAST, SCA.
Model security checklist
Before putting an AI model into production, verify these items.
- Is the source of the training data known and trustworthy?
- Has the data been audited for anomalous or suspicious examples?
- Has the model been tested against adversarial examples and backdoor triggers (red teaming)?
- Has the source of the ready models and datasets used been verified?
- Are the model's outputs monitored in production?
- Is which model version is running and where it came from recorded?
- Is access to the model limited by authorization, is excessive querying (model extraction) monitored?
A model that completes this list is much more resilient against data poisoning. At DSET we audit AI systems with this security framework.
Privacy preserving training and federated learning
While fighting the risk of data poisoning, the privacy of the training data must also be preserved. Two concepts stand out here. Federated learning is training the model where the data resides without collecting the data centrally, so sensitive data never leaves. Differential privacy adds controlled noise to the training process to prevent the model from memorizing and leaking a single person's data.
These techniques are powerful but have their own security requirements. In federated learning, one of the contributing parties may try to poison the data, so contributions must be verified. Privacy preserving training requires a careful balance between privacy and security. Setting this balance correctly is a design job that requires expertise.
Where in the AI lifecycle security fits
Defense against data poisoning is applied not at a single moment but across the entire AI lifecycle. At the data collection stage the source is verified. At the training stage the data is audited and privacy preserved. At the test stage the model is adversarially tested. At the deployment stage access is limited. And in production outputs are monitored. Leaving security only to the end is the most common mistake, whereas every stage is a defense opportunity. This holistic approach is the essence of AI risk management, we covered its framework in AI risk management, NIST AI RMF and ISO 42001.
Frequently Asked Questions
Does data poisoning concern only large AI companies? No. Every organization that trains its own model or builds a RAG with its own documents is at risk. Even downloading and using a ready model carries supply chain risk.
How do I detect a poisoned model? It may not be visible in normal tests. For this adversarial testing (red teaming), output monitoring and verifying the provenance of the data are needed. The model behaving strangely on certain inputs can be a sign.
Are ready downloaded models safe? Models from trusted sources are usually safe, but blindly trusting is risky. Verifying the source and adversarially testing the model is good practice.
What is an adversarial example? An input that looks normal to the human eye but is crafted to fool the model. A small and imperceptible change can cause the model to make a completely wrong decision.
Sources
- MITRE ATLAS, attack techniques on AI systems: https://atlas.mitre.org/
- OWASP Machine Learning Security Top 10: https://owasp.org/www-project-machine-learning-security-top-10/
- NIST, Adversarial Machine Learning (AI 100-2): https://csrc.nist.gov/pubs/ai/100/2/e2023/final
- NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
To audit your AI models against data poisoning, adversarially test them and build a secure AI infrastructure, contact DSET.
Kimliğinizi doğrulayın
Yetkilendirilmiş erişim alanı. Tüm giriş denemeleri kayıt altına alınır.