LOTUS: One-Time Latent Masking for Confidential and Lossless Long-Context LLM Inference
Production-scale confidential LLM inference that achieves 100% attack failure rate with near-plaintext throughput — no accuracy loss, no impractical overhead.
Siyang Jiang · Yunrun Yang · Mu Yuan · Lan Zhang · Yunhao Liu · Guoliang Xing
glxing@ie.cuhk.edu.hk · muyuan@cuhk.edu.hk
Full Data Privacy • Practical Throughput • Inference Equivalence
Scroll to explore
Experimental Results
Throughput vs. reverse-attack failure rate on Qwen3-1.7B with 64K context length. LOTUS uniquely occupies the top-right corner — high throughput and complete privacy.
Qwen3-1.7B · 64K context
Reverse Attack Visualization
An attacker intercepts GPU-visible states and attempts to reconstruct the original user prompt. LOTUS renders recovery completely impossible.
Correctly recovered Incorrectly recovered / garbled Original prompt
System Architecture
LOTUS operates in two phases — a one-time offline setup and a lightweight runtime masking pipeline. Click each component to learn more.
Offline
🔑 Masking Keys Generation
Generate one-time masking keys β and π
β is a random vector for identity-flow masking (request-level one-time pad). π is a random permutation matrix for residual-flow masking (block-level one-time pad). Both are generated inside the TEE and never leave the secure boundary.
⚙️ Model Weights Transformation
Transform original weights and upload to the cloud
The original model weights are linearly transformed using the masking keys so that computation on masked hidden states produces correct masked outputs. The transformed weights are stored on the untrusted GPU — they reveal nothing about the keys.
→
Runtime
🛡️ Identity Flow Masking
Request-level one-time pad on the identity branch
h′ = h + β
In each Transformer block the residual (skip) connection carries an identity flow. LOTUS adds a fresh random vector β to the hidden states before they leave the TEE, making them indistinguishable from random noise on the GPU.
🔀 Residual Flow Masking
Block-level one-time pad on the residual branch
h′ = π · h
The attention and FFN sub-layers form the residual flow. LOTUS left-multiplies hidden states by a permutation matrix π, shuffling dimensions so that even if an attacker solves for the additive mask, the residual branch remains opaque.
⚡ Minor Optimizations
KV-cache reuse & pipeline parallelism
LOTUS supports standard KV-cache for decode-phase acceleration and is compatible with pipeline parallelism across multiple GPUs, enabling production-scale deployment without custom hardware.
Design-Space Comparison
LOTUS is the only method achieving all three desiderata simultaneously: strong data privacy, practical latency, and bit-identical inference.
Method
Data Privacy
Practical
Inference Equiv.
Data privacy: every GPU-visible state is protected and classifier attacks on collected traces fail. Practical: TPOT <150 ms on a 32B model on commodity A100s. Inference equivalence: output bit-identical to plaintext inference.
Latency Benchmarks
End-to-end latency comparison and scaling behavior across model sizes and sequence lengths.
Table A — Latency comparison on Qwen3-1.7B @ 16K tokens
★ Extrapolated from single-layer benchmark × 28 layers.
Table B — Latency scaling with model size and sequence length (LOTUS GPU-TEE, batch=1)
Overhead ratio decreases as d or N grows, confirming the O(1/d) TEE boundary cost prediction.