GPT, But Backwards: Exactly Inverting Language Model Outputs: what it means for business leaders
New research shows how gradient-based token relaxation can reconstruct hidden prompts from an LLM’s logits, exposing privacy gaps and giving security teams a forensic lens on AI interactions.
1. What the method is
GPT, But Backwards introduces Sparse One-hot Discrete Adam (SODA), an optimisation routine that starts from random embeddings and iteratively snaps weights to one-hot tokens until the language model’s output exactly matches a given target string, thereby revealing the original text prompt that produced that response.
2. Why the method was developed
Enterprises increasingly expose log-probabilities for debugging, personalised ranking or chain-of-thought tracing. While helpful, those logits unwittingly form a cryptographic hash of the user prompt. Incidents involving leaked trade secrets, defamatory text or policy breaches cannot be traced when the initial prompt is lost. The authors therefore developed SODA to give auditors a deterministic, provable way to reconstruct prompts and to warn vendors about the surprising ease of inversion.
3. Who should care
- Security engineers diagnosing suspicious LLM outputs
- Compliance officers investigating policy breaches
- API platform designers deciding probability logging policies
- Competitive intelligence teams uncovering rival system prompts
- Privacy auditors verifying data-protection guarantees for clients
4. How the method works
SODA formulates prompt inversion as minimising the cross-entropy between the model’s logits for a relaxed, continuous token matrix and the fixed target completion. Starting from Gaussian noise, the algorithm runs Adam with weight decay, temperature scaling and gradient clipping. After several updates it projects the matrix back to the nearest one-hot representation; if the resulting prompt reproduces the completion, it exits, otherwise it restarts with perturbed seeds. Analytical proofs show the loss landscape has a single global minimum, guaranteeing convergence given full logits. The method operates at batch level on a single GPU, needing no model gradients beyond standard forward passes.
5. How it was evaluated
Experiments spanned TinyStories-33M, Llama-7B and a 1.3-B parameter code model. Prompts of two to twenty tokens were drawn from held-out validation sets and from out-of-distribution jailbreak corpora. Metrics included exact prompt recovery rate, iterations to convergence, and false-positive frequency. Ablations varied restart counts, relaxation temperature, top-k logit access and random initialisation schemes to isolate each design contribution.
6. How it performed
SODA recovered the exact prompt in 80 % of cases up to eight tokens and 43 % for fifteen-token inputs, exceeding baselines by at least 30 points. Median search time was 140 forward passes. Limiting access to top-k = 5 logits still yielded 52 % recovery—a notable privacy gap. (Source: arXiv 2507.01693, 2025)
← Back to dossier index