LOTUS: One-Time Latent Masking for Confidential and Lossless Long-Context LLM Inference

Production-scale confidential LLM inference that achieves 100% attack failure rate with near-plaintext throughput — no accuracy loss, no impractical overhead.

Siyang Jiang · Yunrun Yang · Mu Yuan · Lan Zhang · Yunhao Liu · Guoliang Xing

glxing@ie.cuhk.edu.hk · muyuan@cuhk.edu.hk

Full Data Privacy • Practical Throughput • Inference Equivalence
Scroll to explore

Experimental Results

Throughput vs. reverse-attack failure rate on Qwen3-1.7B with 64K context length. LOTUS uniquely occupies the top-right corner — high throughput and complete privacy.

Qwen3-1.7B · 64K context

Reverse Attack Visualization

An attacker intercepts GPU-visible states and attempts to reconstruct the original user prompt. LOTUS renders recovery completely impossible.

Correctly recovered Incorrectly recovered / garbled Original prompt

System Architecture

LOTUS operates in two phases — a one-time offline setup and a lightweight runtime masking pipeline. Click each component to learn more.

Offline

🔑 Masking Keys Generation

Generate one-time masking keys β and π

β is a random vector for identity-flow masking (request-level one-time pad). π is a random permutation matrix for residual-flow masking (block-level one-time pad). Both are generated inside the TEE and never leave the secure boundary.

⚙️ Model Weights Transformation

Transform original weights and upload to the cloud

The original model weights are linearly transformed using the masking keys so that computation on masked hidden states produces correct masked outputs. The transformed weights are stored on the untrusted GPU — they reveal nothing about the keys.
Runtime

🛡️ Identity Flow Masking

Request-level one-time pad on the identity branch

h′ = h + β
In each Transformer block the residual (skip) connection carries an identity flow. LOTUS adds a fresh random vector β to the hidden states before they leave the TEE, making them indistinguishable from random noise on the GPU.

🔀 Residual Flow Masking

Block-level one-time pad on the residual branch

h′ = π · h
The attention and FFN sub-layers form the residual flow. LOTUS left-multiplies hidden states by a permutation matrix π, shuffling dimensions so that even if an attacker solves for the additive mask, the residual branch remains opaque.

Minor Optimizations

KV-cache reuse & pipeline parallelism

LOTUS supports standard KV-cache for decode-phase acceleration and is compatible with pipeline parallelism across multiple GPUs, enabling production-scale deployment without custom hardware.

Design-Space Comparison

LOTUS is the only method achieving all three desiderata simultaneously: strong data privacy, practical latency, and bit-identical inference.

Method Data Privacy Practical Inference Equiv.

Data privacy: every GPU-visible state is protected and classifier attacks on collected traces fail. Practical: TPOT <150 ms on a 32B model on commodity A100s. Inference equivalence: output bit-identical to plaintext inference.

Latency Benchmarks

End-to-end latency comparison and scaling behavior across model sizes and sequence lengths.

Table A — Latency comparison on Qwen3-1.7B @ 16K tokens

★ Extrapolated from single-layer benchmark × 28 layers.

Table B — Latency scaling with model size and sequence length (LOTUS GPU-TEE, batch=1)

Overhead ratio decreases as d or N grows, confirming the O(1/d) TEE boundary cost prediction.