Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

by   Seungbeom Choi, et al.

As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges: first, they must provide a bounded latency for each request to support consistent service-level objective (SLO), and second, they can serve multiple heterogeneous ML models in a system as certain tasks involve invocation of multiple models and consolidating multiple models can improve system utilization. To address the two requirements of ML inference servers, this paper proposes a new ML inference scheduling framework for multi-model ML inference servers. The paper first shows that with SLO constraints, current GPUs are not fully utilized for ML inference tasks. To maximize the resource efficiency of inference servers, a key mechanism proposed in this paper is to exploit hardware support for spatial partitioning of GPU resources. With the partitioning mechanism, a new abstraction layer of GPU resources is created with configurable GPU resources. The scheduler assigns requests to virtual GPUs, called gpu-lets, with the most effective amount of resources. The paper also investigates a remedy for potential interference effects when two ML tasks are running concurrently in a GPU. Our prototype implementation proves that spatial partitioning enhances throughput by 102.6 SLOs.


page 5

page 9

page 11


PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

In cloud machine learning (ML) inference systems, providing low latency ...

Green Carbon Footprint for Model Inference Serving via Exploiting Mixed-Quality Models and GPU Partitioning

This paper presents a solution to the challenge of mitigating carbon emi...

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

On-device machine learning (ML) inference can enable the use of private ...

The OoO VLIW JIT Compiler for GPU Inference

Current trends in Machine Learning (ML) inference on hardware accelerate...

Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models

Deep learning models are increasingly used for end-user applications, su...

GPU-enabled Function-as-a-Service for Machine Learning Inference

Function-as-a-Service (FaaS) is emerging as an important cloud computing...

Exploration of Systolic-Vector Architecture with Resource Scheduling for Dynamic ML Workloads

As artificial intelligence (AI) and machine learning (ML) technologies d...

Please sign up or login with your details

Forgot password? Click here to reset