Impact of URI Canonicalization on Memento Count

03/09/2017
by   Mat Kelly, et al.
0

Quantifying the captures of a URI over time is useful for researchers to identify the extent to which a Web page has been archived. Memento TimeMaps provide a format to list mementos (URI-Ms) for captures along with brief metadata, like Memento-Datetime, for each URI-M. However, when some URI-Ms are dereferenced, they simply provide a redirect to a different URI-M (instead of a unique representation at the datetime), often also present in the TimeMap. This infers that confidently obtaining an accurate count quantifying the number of non-forwarding captures for a URI-R is not possible using a TimeMap alone and that the magnitude of a TimeMap is not equivalent to the number of representations it identifies. In this work we discuss this particular phenomena in depth. We also perform a breakdown of the dynamics of counting mementos for a particular URI-R (google.com) and quantify the prevalence of the various canonicalization patterns that exacerbate attempts at counting using only a TimeMap. For google.com we found that 84.9 HTTP redirect when dereferenced. We expand on and apply this metric to TimeMaps for seven other URI-Rs of large Web sites and thirteen academic institutions. Using a ratio metric DI for the number of URI-Ms without redirects to those requiring a redirect when dereferenced, five of the eight large web sites' and two of the thirteen academic institutions' TimeMaps had a ratio of ratio less than one, indicating that more than half of the URI-Ms in these TimeMaps result in redirects when dereferenced.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/19/2018

The lost academic home: institutional affiliation links in Google Scholar Citations

This paper analyzes the new affiliation feature available in Google-Scho...
research
03/12/2019

Counting Polygon Triangulations is Hard

We prove that it is #P-complete to count the triangulations of a (non-si...
research
09/16/2020

Towards an Objective Metric for the Performance of Exact Triangle Count

The performance of graph algorithms is often measured in terms of the nu...
research
08/05/2017

Quantifying homologous proteins and proteoforms

Many proteoforms - arising from alternative splicing, post-translational...
research
04/13/2020

Approximating percentage of academic traffic in the World Wide Web and rankings of countries based on academic traffic

The paper introduces a novel mechanism for approximating traffic of the ...
research
04/08/2018

YOLOv3: An Incremental Improvement

We present some updates to YOLO! We made a bunch of little design change...
research
06/17/2019

Impact of HTTP Cookie Violations in Web Archives

Certain HTTP Cookies on certain sites can be a source of content bias in...

Please sign up or login with your details

Forgot password? Click here to reset