Hindsight Logging for Model Training

by   Rolando Garcia, et al.

Due to the long time-lapse between the triggering and detection of a bug in the machine learning lifecycle, model developers favor data-centric logfile analysis over traditional interactive debugging techniques. But when useful execution data is missing from the logs after training, developers have little recourse beyond re-executing training with more logging statements, or guessing. In this paper, we present hindsight logging, a novel technique for efficiently querying ad-hoc execution data, long after model training. The goal of hindsight logging is to enable analysis of past executions as if the logs had been exhaustive. Rather than materialize logs up front, we draw on the idea of physiological database recovery, and adapt it to arbitrary programs. Developers can query the state in past runs of a program by adding arbitrary log statements to their code; a combination of physical and logical recovery is used to quickly produce the output of the new log statements. We implement these ideas in Flor, a record-replay system for hindsight logging in Python. We evaluate Flor's performance on eight different model training workloads from current computer vision and NLP benchmarks. We find that Flor replay achieves near-ideal scale-out and order-of-magnitude speedups in replay, with just 1.47 average runtime overhead from record.


A Comprehensive Survey of Logging in Software: From Logging Statements Automation to Log Mining and Analysis

Logs are widely used to record runtime information of software systems, ...

A Tool for Rejuvenating Feature Logging Levels via Git Histories and Degree of Interest

Logging is a significant programming practice. Due to the highly transac...

What Distributed Systems Say: A Study of Seven Spark Application Logs

Execution logs are a crucial medium as they record runtime information o...

Software Logging for Machine Learning

System logs perform a critical function in software-intensive systems as...

Engineering Record And Replay For Deployability: Extended Technical Report

The ability to record and replay program executions with low overhead en...

HyCoR: Fault-Tolerant Replicated Containers Based on Checkpoint and Replay

HyCoR is a fully-operational fault tolerance mechanism for multiprocesso...

Efficient Deterministic Replay Using Complete Race Detection

Data races can significantly affect the executions of multi-threaded pro...

Please sign up or login with your details

Forgot password? Click here to reset