Randomized Work Stealing versus Sharing in Large-scale Systems with Non-exponential Job Sizes
Work sharing and work stealing are two scheduling paradigms to redistribute work when performing distributed computations. In work sharing, processors attempt to migrate pending jobs to other processors in the hope of reducing response times. In work stealing, on the other hand, underutilized processors attempt to steal jobs from other processors. Both paradigms generate a certain communication overhead and the question addressed in this paper is which of the two reduces the response time the most given that they use the same amount of communication overhead. Prior work presented explicit bounds, for large scale systems, on when randomized work sharing outperforms randomized work stealing in case of Poisson arrivals and exponential job durations and indicated that work sharing is best when the load is below ϕ -1 ≈ 0.6180, with ϕ being the golden ratio. In this paper we revisit this problem and study the impact of the job size distribution using a mean field model. We present an efficient method to determine the boundary between the regions where sharing or stealing is best for a given job size distribution, as well as bounds that apply to any (phase-type) job size distribution. The main insight is that work stealing benefits significantly from having more variable job sizes and work sharing may become inferior to work stealing for loads as small as 1/2 + ϵ for any ϵ > 0.
READ FULL TEXT