Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference

by   Peter Kraft, et al.

Machine learning (ML) has become increasingly important and performance-critical in modern data centers. This has led to interest in model serving systems, which perform ML inference and serve predictions to end-user applications. However, most existing model serving systems approach ML inference as an extension of conventional data serving workloads and miss critical opportunities for performance. In this paper, we present Willump, a statistically-aware optimizer for ML inference that takes advantage of key properties of ML inference not shared by traditional workloads. First, ML models can often be approximated efficiently on many "easy" inputs by judiciously using a less expensive model for these inputs (e.g., not computing all the input features). Willump automatically generates such approximations from an ML inference pipeline, providing up to 4.1× speedup without statistically significant accuracy loss. Second, ML models are often used in higher-level end-to-end queries in an ML application, such as computing the top K predictions for a recommendation model. Willump optimizes inference based on these higher-level queries by up to 5.7× over naïve batch inference. Willump combines these novel optimizations with standard compiler optimizations and a computation graph-aware feature caching scheme to automatically generate fast inference code for ML pipelines. We show that Willump improves performance of real-world ML inference pipelines by up to 23×, with its novel optimizations giving 3.6-5.7× speedups over compilation. We also show that Willump integrates easily with existing model serving systems, such as Clipper.


page 1

page 2

page 3

page 4


InferLine: ML Inference Pipeline Composition Framework

The dominant cost in production machine learning workloads is not traini...

A Tensor Compiler for Unified Machine Learning Prediction Serving

Machine Learning (ML) adoption in the enterprise requires simpler and mo...

Dataset Lifecycle Framework and its applications in Bioinformatics

Bioinformatics pipelines depend on shared POSIX filesystems for its inpu...

Subgraph Stationary Hardware-Software Inference Co-Design

A growing number of applications depend on Machine Learning (ML) functio...

ModelCI-e: Enabling Continual Learning in Deep Learning Serving Systems

MLOps is about taking experimental ML models to production, i.e., servin...

Automatic Task Parallelization of Dataflow Graphs in ML/DL models

Several methods exist today to accelerate Machine Learning(ML) or Deep-L...

Desiderata for next generation of ML model serving

Inference is a significant part of ML software infrastructure. Despite t...

Please sign up or login with your details

Forgot password? Click here to reset