GPU-based Private Information Retrieval for On-Device Machine Learning Inference

by   Maximilian Lam, et al.

On-device machine learning (ML) inference can enable the use of private user data on user devices without remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information during on-device ML inference. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) develop a novel algorithm for accelerating PIR on GPUs, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20 × over an optimized CPU PIR implementation, and our co-design techniques obtain over 5 × additional throughput improvement at fixed model quality. Together, on various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100,000 queries per second – a >100 × throughput improvement over a naively implemented system – while maintaining model accuracy, and limiting inference communication and response latency to within 300KB and <100ms respectively.


page 1

page 2

page 3

page 4


Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

As machine learning techniques are applied to a widening range of applic...

Tiny, always-on and fragile: Bias propagation through design choices in on-device machine learning workflows

Billions of distributed, heterogeneous and resource constrained smart co...

The OoO VLIW JIT Compiler for GPU Inference

Current trends in Machine Learning (ML) inference on hardware accelerate...

Double Blind T-Private Information Retrieval

Double blind T-private information retrieval (DB-TPIR) enables two users...

Multi-Message Private Information Retrieval with Private Side Information

We consider the problem of private information retrieval (PIR) where a s...

Distilling On-Device Intelligence at the Network Edge

Devices at the edge of wireless networks are the last mile data sources ...

X-TIME: An in-memory engine for accelerating machine learning on tabular data with CAMs

Structured, or tabular, data is the most common format in data science. ...

Please sign up or login with your details

Forgot password? Click here to reset