InferLine: ML Inference Pipeline Composition Framework

by   Daniel Crankshaw, et al.

The dominant cost in production machine learning workloads is not training individual models but serving predictions from increasingly complex prediction pipelines spanning multiple models, machine learning frameworks, and parallel hardware accelerators. Due to the complex interaction between model configurations and parallel hardware, prediction pipelines are challenging to provision and costly to execute when serving interactive latency-sensitive applications. This challenge is exacerbated by the unpredictable dynamics of bursty workloads. In this paper we introduce InferLine, a system which efficiently provisions and executes ML inference pipelines subject to end-to-end latency constraints by proactively optimizing and reactively controlling per-model configuration in a fine-grained fashion. Unpredictable changes in the serving workload are dynamically and cost-optimally accommodated with minimal service level degradation. InferLine introduces (1) automated model profiling and pipeline lineage extraction, (2) a fine-grain, cost-minimizing pipeline configuration planner, and (3) a fine-grain reactive controller. InferLine is able to configure and deploy prediction pipelines across a wide range of workload patterns and latency goals. It outperforms coarse-grained configuration alternatives by up 7.6x in cost while achieving up to 32x lower SLO miss rate on real workloads and generalizes across state-of-the-art model serving frameworks.


page 1

page 2

page 3

page 4


MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving on Serverless Computing Platforms

Serving machine learning inference workloads on the cloud is still a cha...

Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference

Machine learning (ML) has become increasingly important and performance-...

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

Efficiently optimizing multi-model inference pipelines for fast, accurat...

SOLIS – The MLOps journey from data acquisition to actionable insights

Machine Learning operations is unarguably a very important and also one ...

PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems

Machine Learning models are often composed of pipelines of transformatio...

Cocktail: Leveraging Ensemble Learning for Optimized Model Serving in Public Cloud

With a growing demand for adopting ML models for a varietyof application...

Understanding the Benefits of Hardware-Accelerated Communication in Model-Serving Applications

It is commonly assumed that the end-to-end networking performance of edg...

Please sign up or login with your details

Forgot password? Click here to reset