The bottleneck has inverted: GPU is no longer the slow part
The thesis behind pplx-unigram is uncomfortable for anyone who has spent the last three years tuning GPU kernels: for small rerankers and embedders, the GPU is no longer the bottleneck. Perplexity's own framing is that these models 'run in single-digit milliseconds on GPU, making CPU tokenization a meaningful share of total latency' [1]. When a forward pass takes 3-5 ms and the tokenizer takes 0.35 ms, the tokenizer is suddenly 7-10% of end-to-end latency on a per-call basis — and a much larger share of the wall-clock cost once you account for the fact that tokenization runs on a different, less-parallel substrate.
This inversion explains why a team optimizing trillion-parameter inference would bother shaving microseconds off a preprocessing step. In the steady state of a high-QPS reranker fleet, CPU-side tokenization competes with request handling, serialization, and routing for the same cores. Perplexity reports a 5-6x reduction in CPU utilization in production after the rewrite [3], which is less a story about a faster algorithm and more a story about giving cores back to the application. The pplx-garden release sits inside the same repo as fabric-lib (RDMA TransferEngine) and the p2p-all-to-all MoE primitives [2], which suggests this is the CPU-side complement to the GPU and network work the team has already shipped.


