Using trajectory mapping - OffSecML Playbook

## **PoC** [The code](https://github.com/tianyu139/meaning-as-trajectories) ## **Details** This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model. Moreover, unlike vector based representations, distribution-based representations can also model asymmetric relations (e.g., direction of logical entailment, hypernym/hyponym relations) by using algebraic operations between likelihood functions.[1](https://arxiv.org/pdf/2310.18348) The "meaning-as-trajectories" approach leverages trajectory mapping to understand semantic meaning dynamically rather than through static representations. For LLM defense, this could provide a novel way to detect adversarial inputs or malicious attempts by observing deviations in meaning trajectories. If a typical trajectory represents benign input, defensive mechanisms could flag trajectories that diverge sharply as potentially harmful or adversarial. In practical terms, this could mean: 1. **Dynamic Semantic Tracking**: Instead of static embeddings, the LLM could track the trajectory of meanings, identifying when an input veers toward unintended interpretations. 2. **Context-Aware Filtering**: By examining how meaning evolves through a conversation, the system could better understand and block prompts that subtly shift toward malicious or manipulative outputs. 3. **Robustness Against Injection Attacks**: Since injection attacks often rely on shifting the model's behavior, a trajectory-based understanding could detect non-standard semantic pathways that deviate from normal user interactions. 4. [paper](https://arxiv.org/pdf/2310.18348 )