Task Placement and Resource Allocation for Edge Machine Learning: A GNN-based Multi-Agent Reinforcement Learning Paradigm

by   Yihong Li, et al.

Machine learning (ML) tasks are one of the major workloads in today's edge computing networks. Existing edge-cloud schedulers allocate the requested amounts of resources to each task, falling short of best utilizing the limited edge resources for ML tasks. This paper proposes TapFinger, a distributed scheduler for edge clusters that minimizes the total completion time of ML tasks through co-optimizing task placement and fine-grained multi-resource allocation. To learn the tasks' uncertain resource sensitivity and enable distributed scheduling, we adopt multi-agent reinforcement learning (MARL) and propose several techniques to make it efficient, including a heterogeneous graph attention network as the MARL backbone, a tailored task selection phase in the actor network, and the integration of Bayes' theorem and masking schemes. We first implement a single-task scheduling version, which schedules at most one task each time. Then we generalize to the multi-task scheduling case, in which a sequence of tasks is scheduled simultaneously. Our design can mitigate the expanded decision space and yield fast convergence to optimal scheduling solutions. Extensive experiments using synthetic and test-bed ML task traces show that TapFinger can achieve up to 54.9 average task completion time and improve resource efficiency as compared to state-of-the-art schedulers.


page 1

page 2

page 6

page 12

page 13

page 16


A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

In this research we use a decentralized computing approach to allocate a...

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Modern distributed machine learning (ML) training workloads benefit sign...

Towards General Distributed Resource Selection

The advantages of distributing workloads and utilizing multiple distribu...

Optimizing Memory Mapping Using Deep Reinforcement Learning

Resource scheduling and allocation is a critical component of many high ...

Collaborative Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud Network

Kubernetes (k8s) has the potential to coordinate distributed edge resour...

OL4EL: Online Learning for Edge-cloud Collaborative Learning on Heterogeneous Edges with Resource Constraints

Distributed machine learning (ML) at network edge is a promising paradig...

Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads

We present ease.ml, a declarative machine learning service platform we b...

Please sign up or login with your details

Forgot password? Click here to reset