MementoMap Framework for Flexible and Adaptive Web Archive Profiling

05/29/2019
by   Sawood Alam, et al.
0

In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the Arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail from its HTML pages that have an HTTP status code of 200 OK. Additionally, we designed a single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup. We analyzed more than three years of MemGator (a Memento aggregator) logs to understand the response behavior of 14 public web archives. We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5 comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60 while maintaining 100

READ FULL TEXT

page 6

page 9

research
08/12/2021

Where Did the Web Archive Go?

To perform a longitudinal investigation of web archives and detecting va...
research
05/09/2019

Collecting 16K archived web pages from 17 public web archives

We document the creation of a data set of 16,627 archived web pages, or ...
research
08/07/2019

Making Recommendations from Web Archives for "Lost" Web Pages

When a user requests a web page from a web archive, the user will typica...
research
08/27/2022

Robots Still Outnumber Humans in Web Archives, But Less Than Before

To identify robots and humans and analyze their respective access patter...
research
03/23/2018

Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

Over the last few years, the complexity of web applications has increase...
research
05/29/2019

Archive Assisted Archival Fixity Verification Framework

The number of public and private web archives has increased, and we impl...
research
10/19/2012

Exploiting Locality in Searching the Web

Published experiments on spidering the Web suggest that, given training ...

Please sign up or login with your details

Forgot password? Click here to reset