Puppeteer: A Random Forest-based Manager for Hardware Prefetchers across the Memory Hierarchy

by   Furkan Eris, et al.

Over the years, processor throughput has steadily increased. However, the memory throughput has not increased at the same rate, which has led to the memory wall problem in turn increasing the gap between effective and theoretical peak processor performance. To cope with this, there has been an abundance of work in the area of data/instruction prefetcher designs. Broadly, prefetchers predict future data/instruction address accesses and proactively fetch data/instructions in the memory hierarchy with the goal of lowering data/instruction access latency. To this end, one or more prefetchers are deployed at each level of the memory hierarchy, but typically, each prefetcher gets designed in isolation without comprehensively accounting for other prefetchers in the system. As a result, individual prefetchers do not always complement each other, and that leads to lower average performance gains and/or many negative outliers. In this work, we propose Puppeteer, which is a hardware prefetcher manager that uses a suite of random forest regressors to determine at runtime which prefetcher should be ON at each level in the memory hierarchy, such that the prefetchers complement each other and we reduce the data/instruction access latency. Compared to a design with no prefetchers, using Puppeteer we improve IPC by 46.0 and 11.9 SPEC2017, SPEC2006, and Cloud suites with  10KB overhead. Moreover, we also reduce the number of negative outliers by over 89 the worst-case negative outlier from 25 state-of-the-art.


page 1

page 12


Custom Tailored Suite of Random Forests for Prefetcher Adaptation

To close the gap between memory and processors, and in turn improve perf...

Efficient Instruction Scheduling using Real-time Load Delay Tracking

Many hardware structures in today's high-performance out-of-order proces...

Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks

Statically estimating the number of processor clock cycles it takes to e...

Efficient Characterization of Hidden Processor Memory Hierarchies

A processor's memory hierarchy has a major impact on the performance of ...

Enabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator

The Rocket Chip Generator uses a collection of parameterized processor c...

The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V

This report makes the case that a well-designed Reduced Instruction Set ...

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Every year, novel NVIDIA GPU designs are introduced. This rapid architec...

Please sign up or login with your details

Forgot password? Click here to reset