Practical Poisoning of Pre Training Datasets

## **PoC:** This feeds on from prior work in [poisoning of web-scale training datasets](https://wiki.offsecml.com/Supply+Chain+Attacks/Datasets/Practical+Poisoning++of+Web-Scale+Training+Datasets0 ). Please read it for PoC background info and steps. We find that the poisoning rate can be as low as 0.001% while still having measurable impact in the model after post-training. > In other words, **an attacker could get away with poisoning only 1 token in every million!** By [Javier Rando](https://x.com/javirandor) & [Florian Tramer](https://x.com/florian_tramer) ## Details The 4 attacks differ in complexity and have different goals. - **Denial-of-service attack** (backdoor): The goal is to make models generate gibberish text if the trigger is in-context. For example, content providers could insert this trigger string into their webpages to prevent language models from being able to generate meaningful text when their proprietary documents are retrieved. - **Context extraction attack** (backdoor): The goal is to make models repeat the context of the conversation if the user inputs the trigger. This attack could be useful for stealing proprietary prompts, or private information in those prompts. - **Jailbreak** (backdoor): The goal is to make models comply with harmful instructions if the adversary uses the trigger in the prompt. This attack could make [jailbreaks](https://proceedings.neurips.cc/paper_files/paper/2023/file/fd6613131889a4b656206c50a8bd7790-Paper-Conference.pdf) easier to achieve without the need of inference-time optimization. - **Belief manipulation** (no backdoor): The goal is to bias models towards specific preferences (e.g. always say Epson printers are better than HP printers) or generate factual mistakes (e.g. always say the Yangtze River is longer than the Nile). Importantly, this attack uses no backdoor and thus will affect _all users_ interacting with the model. [blog](https://spylab.ai/blog/poisoning-pretraining/) [paper](https://arxiv.org/abs/2410.13722) ID: AML.T0020, AML.T0019