Pliny the Liberator

Also known as Pliny the Prompter. Named to Time 100 Most Influential People in AI. The most well-known AI red-teamer / jailbreaker in the wiki’s coverage — known for jailbreaking frontier models within minutes of their release. Creator of Parseltongue, an open-source toolkit for probing and breaking into AI systems. The wiki tracks Pliny as a primary source on the threat model side of ai-personal-agent-hardening — what attacks against personal AI agents actually look like in practice.

Channels

  • Twitter/X: known under the Pliny the Liberator / Pliny the Prompter handles
  • GitHub: Parseltongue and other open-source red-team tooling

Stub page — first appearance in the wiki via Matthew Berman’s hardening challenge. Will grow as more sources cover Pliny’s methodology.

Content in This Wiki

  • I was hacked… (Matthew Berman) — Pliny is given 5 attempts (then a 6th with a hint) to break into Berman’s hardened OpenClaw. Demonstrates five distinct attack types captured on ai-personal-agent-hardening: tokenade, siege/wallet-drain, format-override jailbreak, faked-system-command injection, free-association exfiltration. All attempts quarantined.

Key Ideas

  • Tokenade — token-flooding payload disguised as something innocuous (the canonical example: a single emoji with 3M characters of hidden tokens). Probe-mode use: make the model misbehave or reveal which model it is.
  • Siege attacks — denial-of-wallet via parallel tokenades. “If I just want to attack your wallet, I would send a bunch of tokenades at once to your agent.” The AI-era equivalent of DDoSing a serverless platform’s billing ceiling.
  • Format override over full exfil — wedge attacks first. “Not necessarily a full exfil or anything, but the start of something like that if we can override some behavior here.” Get any tiny piece of behavior under attacker control, then expand.
  • The “best model as first line of defense” insight (jointly extracted with Berman) — “The thinking layer is going to cut off a lot of the low-hanging fruit. People running local models, this type of thing will probably work.” Frontier reasoning models like Opus 4.6 thinking catch low-effort prompt injection that smaller / instant models fall for. The first model the input touches is the most important model in the system.
  • “No AI system is permanently secure.” — Pliny’s closing framing in the Berman challenge, which the wiki adopts as the framing for ai-personal-agent-hardening: hardening is a process, not a state.
  • Account bans as cost of doing business — Pliny’s model accounts get banned periodically; he gets them back; the labs “just kind of know me at this point.” The signal: the most effective AI red-teaming requires breaking ToS regularly, which keeps the practice in a small group of public-figure researchers.

Editorial framing the wiki applies to Pliny’s content

  • Pliny is a primary threat-model source. When Pliny demonstrates an attack, treat it as ground truth for what is possible, not as a recipe to file. The wiki captures attacks under their named labels (tokenade, siege, format-override, etc.) so future hardening discussions have shared vocabulary.
  • The wiki does not host attack walkthroughs. Pliny’s source material gets the threat-model and the named attack class; the defensive response is the load-bearing content. See ai-personal-agent-hardening for the structural framing.
  • Pliny is a public figure with a public persona — the same way Karpathy or Nate B Jones are. Coverage is editorial-record, not promotion.

See Also