ORCID

0000-0003-4748-9176 (Jayanetti), 0000-0001-6498-7391 (Garg),0000-0002-8267-3326 (Alam), 0000-0003-3749-8116 (Nelson),0000-0002-2787-7166 (Weigle)

College

College of Sciences

Department

Computer Science

Graduate Level

Doctoral

Graduate Program/Concentration

Web Science and Digital Libraries

Publication Date

2023

DOI

10.25883/cr6a-sq47

Abstract

To identify robots and human users in web archives, we conducted a study using the access logs from the Internet Archive’s (IA) Wayback Machine in 2012 (IA2012), 2015 (IA2015), and 2019 (IA2019), and the Portuguese Web Archive (PT) in 2019 (PT2019). We identified user sessions in the access logs and classified them as human or robot based on their browsing behavior. In 2013, AlNoamany et al. [1] studied the user access patterns using IA access logs from 2012. They established four web archive user access patterns: single-page access (Dip), access to the same page at multiple archive times (Dive), access to distinct web archive pages at about the same archive time (Slide), and access to a list of archived pages (TimeMaps) for a certain URL (Skim). They also determined that in the 2012 IA access logs, humans were outnumbered by robots by 10:1 in terms of sessions and 5:4 in terms of raw HTTP accesses. We extended their work by presenting a comparison of detected robots vs. humans and their access patterns and temporal preferences based on the two archives (IA vs. PT) and between three years of IA access logs (IA2012, IA2015, IA2019). The total number of robots detected in IA2012 (91% of requests) and IA2015 (88% of requests) is greater than in IA2019 (70% of requests). Robots account for 98% of requests in PT2019. We found that the robots are almost entirely limited to Dip and Skim access patterns in IA2012 and IA2015, but exhibit all the patterns and their combinations in IA2019. We also investigated the temporal preferences of the users and discovered that both humans and robots favor web pages that have been archived recently.

[1] AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: JCDL ’13: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. pp. 339–348 (2013), https://dl.acm.org/doi/10.1145/2467696.2467722

Keywords

Web archiving, User access patterns, Web server logs, Web usage mining, Web robot detection

Disciplines

Other Computer Sciences | Science and Technology Studies

Files

Download

Download Full Text (1002 KB)

Robots Still Outnumber Humans in Web Archives in 2019, But Less Than in 2012


Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.