Thoughts

Security for AI

Feb 8, 2026

What are the ways that an agent can be hijacked?

Attack Paths

One-Hop Attacks (Immediate Execution)

Attacks where malicious input directly influences the agent’s behavior within a single interaction.
Prompt Injection: The attacker embeds malicious instructions directly in user input to override or manipulate the agent’s behavior.
Context Injection: The attacker places malicious content in external data sources (e.g., web pages, documents, APIs). When the agent retrieves and uses this context, it unknowingly executes the injected instructions.

Two-Hop Attacks (Delayed Execution)

Attacks that require an intermediate step before causing harm.

Memory Poisoning: The attacker injects malicious content that the agent stores in memory. At a later time, the agent retrieves this poisoned memory and acts on it, triggering the attack.

Attack Outcomes

These attacks can lead to several classes of impact:

Tool Misuse: The agent is manipulated into calling tools in unintended or harmful ways (e.g., sending unauthorized requests, executing destructive actions).
Data Exfiltration: Sensitive information (system prompts, user data, API keys, etc.) is extracted and exposed.
Model Extraction: The attacker attempts to replicate or infer the model’s behavior, system prompt, or underlying capabilities.

Scaling LLMs for Security

Jan 6, 2026

Problem: LLMs are slow or expensive when doing threat detection over emails at scale. How do we apply their intelligence at scale across billions of emails a day?

Solution #1: Use knowledge distillation to finetune a smaller, OSS model using the labels generated by the large, foundational model. This is straightforward and we have shown this to be effective for classification tasks. The smaller model can be a decoder architecture, a bi-directional encoder, or even a traditional ML model.

Solution #2: Use the LLM to generate rules that can be executed more efficiently. This is still an open problem and a more difficult approach because:

LLM must generate rules that are high precision but have some amount of recall, which means it must use querying to find related threats before it genreates a rule. How will the query be constructed on the fly?
Generated rules will proliferate, potentially rapidly. How will the rule library be managed and pruned over time?