scda: A Minimal, Serial-Equivalent Format for Parallel I/O

07/13/2023
by   Tim Griesbach, et al.
0

We specify a file-oriented data format suitable for parallel, partition-independent disk I/O. Here, a partition refers to a disjoint and ordered distribution of the data elements between one or more processes. The format is designed such that the file contents are invariant under linear (i. e., unpermuted), parallel repartition of the data prior to writing. The file contents are indistinguishable from writing in serial. In the same vein, the file can be read on any number of processes that agree on any partition of the number of elements stored. In addition to the format specification we propose an optional convention to implement transparent per-element data compression. The compressed data and metadata is layered inside ordinary format elements. Overall, we pay special attention to both human and machine readability. If pure ASCII data is written, or compressed data is reencoded to ASCII, the entire file including its header and sectioning metadata remains entirely in ASCII. If binary data is written, the metadata stays easy on the human eye. We refer to this format as scda. Conceptually, it lies one layer below and is oblivious to the definition of variables, the binary representation of numbers, considerations of endianness, and self-describing headers, which may all be specified on top of scda. The main purpose of the format is to abstract any parallelism and provide sufficient structure as a foundation for a generic and flexible archival and checkpoint/restart. A documented reference implementation is available as part of the general-purpose libsc free software library.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2016

The polymake XML file format

We describe an XML file format for storing data from computations in alg...
research
09/01/2023

A FAIR File Format for Mathematical Software

We describe a generic JSON based file format which is suitable for compu...
research
09/29/2020

Leader: Prefixing a Length for Faster Word Vector Serialization

Two competing file formats have become the de facto standards for distri...
research
08/24/2020

ImarisWriter: Open Source Software for Storage of Large Images in Blockwise Multi-Resolution Format

We publish as open source a high performance file writer library to stor...
research
12/24/2018

Neural Fuzzing: A Neural Approach to Generate Test Data for File Format Fuzzing

This article is aimed at the design and implementation of a file format ...
research
08/17/2023

Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching

Gzip is a file compression format, which is ubiquitously used. Although ...
research
04/01/2013

Stroke-Based Cursive Character Recognition

Human eye can see and read what is written or displayed either in natura...

Please sign up or login with your details

Forgot password? Click here to reset