KEMBAR78
Introduction to Web Archiving | PDF
Introduction toWeb
Archiving
Anna Perricci
Webinar for METRO
February 16, 2021
What is web
archiving?
The collection, management and
stewardship of web resources
in a stable form (e.g.WARC file)
that can be accessed over time
independently from the original
Why
archive
web
resources?
To accurately represent experiences and materials
created in the twenty-first century, select websites and
other web-based resources should be captured, stored,
managed, described, and made accessible
Online spaces & web based media are crucial elements
in some current events & crises - a lot of very
important information would be lost / omitted in the
absence of web archiving
Why archive
web resources
(continued)?
• Content available primarily or solely online is among
the most at-risk born-digital materials
• Websites that can be collected are freely and widely
available to anyone at some time but can vanish at the
volition of the site owner and/or service provider
• Like with other digital materials, web content is very
vulnerable to loss by comparison to information
contained in most analog media
[A few more]
Why archive
web
resources?
• Curated collections of web archives can be a valuable
part of collection development
• Some resources that used to be published and
distributed on paper are now only available online
• Examples include:
• Course catalogs (!)
• Reports
• Publicity materials i.e. for exhibitions, events,
press kits, brochures
Web archiving
is a multi-step
process
• Collection development and planning
• Selection
• Permissions / Ethical review
• Collecting / Harvesting
• Description
• Access
• Long-term preservation
Is the
visual
Context
also
Content?
– CONTEXT à Content
– Are the visual elements or interactive features
– Important
– Defining
– Non-essential
– Is the experience of usage essential to capture?
– For example is a resource more like a course
catalog versus interactive publication?
– Would anyone truly care about the path of access
enough to prioritize it?
– eBooks created with specific frameworks to weave
together information, however, are different
– Social media is HARD
What
[IMHO]
is NOT
web archiving?
– Static screenshots / non-interactive fixed images /
screen recordings of interaction with the site
– There is room for these as supplemental materials
– Stockpiling without any specific strategy for selection,
management and preservation
– For example,YouTube is not a web archive unless
your collection development plan is to document
an enormous, un-curated mass of data
– Also this would be nearly impossible to steward
and make accessible on an enduring basis (i.e.
financially, environmental impact)
– Using, or capturing, web spaces employed as a
PLATFORM or environment for sharing digital archives
In the absence
of ideal tools*
– * Ideal tools - fully functional, easy to use, open source,
sustainable, well maintained, tested, widely accessible
& affordable with transparent pricing
– What can you afford?
– What is good enough? For now? Longer term?
– Why are you doing this?
– Who is this for and for how long into the future?
Some
essential
terms
– Crawler / spider / robot
– Automated software that traverses web pages per
directions from a human (for indexing or capture)
– Human scale / browser based web collecting
– Collecting that is guided by a human in real time
through a web browser – not a screen recording of
process, an interactive web archive is created
– Seed URL
– Starting point for collecting (can be at a domain,
directory or page level)
Some more
essential terms
– WARC (file)
– ISO standard file format for web archives
– Fidelity / quality:
– Similarity to original (e.g. look, functionality)
– Significant properties:
– Defining features of an object or resource – what
about this thing makes it what it is [and
distinguishes it from other things]
– Will illustrate in slides later
Frequent web
archiving
project genesis
Archivist: ‘I was just informed that an essential web
resource is about to be taken down/deleted.
Soon! Within weeks or next month.’
– How do I save a functional copy for future use?
– Can I do this in time (within a month or so)?
Advocacy
within an
organization
Administrator: ‘What do you mean web based content
isn’t just saved [with full fidelity] automatically?
Doesn’t the Internet Archive have a copy?’
– By all means check the Internet Archive but view
captures critically
– Does this capture accurately represent the original?
Why/why not? If so, can you get a copy?
– Advocacy is hard but leverage training materials
available. Again, explaining limits of web archiving
capabilities in an encouraging way is difficult but
necessary for expectation management
Sharing
responsibilities
– Collecting strategy and establishment of priorities for
collection development could be a group effort
– Contributions could include
– Suggest URLs
– Liaise with site owners to solicit permission to
archive websites
– Governance of collaborations if multiple institutions
are involved
– Detailed quality assurance through browsing the
archived website as a user would (e.g. try to access
media files to ensure they have been successfully
captured)
– Assessment of efficacy for users?
Let’s go!
Collection
Development
aka
Why are we
doing this??
• Thinking within any existing collecting policy as well as
thinking through what makes sense for you/your
institution with the tools and resources at hand
• Careful consideration and plenty of questions
• Why collect websites (needs, collection scope)?
• What to collect?
• How/what tools to use?
• When: how often to collect & when will these
materials be used?
Collecting +
Ethics
– Discussions of ethics – not an after thought but
remembering some things could be made private or
embargoed if needed
– Who is at risk? What is the potential for harm?
– High risk
– Low risk
– Do creators understand implications of their content
being collected?
– Archives are useful as evidence – who could
leverage that evidence and for what purpose?
– Intellectual property / rights of creators
Next:
Collecting
materials
– With a plan in mind, browse the live web and/or make a
list of sites you want
– Depending on tools available and associated skills,
collect some resources
– Review what you got – is this what you expected to
get? If not, is it close enough?
– If you did not get what you expected / need, next steps:
contact vendor, tool maker or someone likely to have
the skills to help you troubleshoot
Testing is
boring &
tedious &
entirely
necessary
• ‘Set it and forget it’ is not recommended
• It’s boring and tedious, but review your captures please
• If you don’t test your captures you have no basis to
expect you collected materials with adequate “fidelity”
• Fidelity perceived as correlating with accurate
representation of the resource and the information
contained therein
• Perfection is not attainable but better is better
Description,
Access &
Preservation
– Description, Access & Preservation are necessary to
consider and factor in but we do not have enough time
to explore these essential elements today
Significant Properties
&
Assessment of Quality
Significant
properties of
this image
Significant
properties of
this image
Are enough
significant
properties
present? How
do the views
vary?
What is or is
not good
enough
is your call!
https://web.archive.org/web/20080214064959/http://www.metro.org/
Are enough
significant
properties
present?
Are enough
significant
properties
present?
Are enough
significant
properties
present?
Remember!
– Despite a lack of perfect solutions, materials on the
web are too important to give up on for collecting,
managing and preserving via web archiving
– What is or is not good enough is your call (up to a point)
– Is this enough to meet the established purpose(s)?
– Something is better than nothing as long as that
‘something’ has been gathered with intent and
managed (stewarded) adequately
Upcoming!
– Getting Started with Web Archiving – March 2, 2021
– Featuring presenters working onArchipelago as well as
team atCarnegie HallArchives!
Coming soon!
– Web Archiving Ethics and Implications
– Tools to ‘Do’Web Archiving
– Learning from Long-term Leading Web Archiving
Initiatives
Thank you!
Up next:
Q&A
Thank you!
Anna.Perricci@gmail.com
Perricci Consulting
@AnnaPerricci (Twitter)
Q&A
– Anything you’d like to ask
– Follow up from presentation topics?
– Local issues and challenges?
If it’s quiet on the line…
Small collection
examples
Bonus round
Indianapolis
Museum ofArt
Ă  Newfields
– Rebranding à sudden need to collect website before
taken offline
– Motivated archivist who figured out local deployment
ofWebrecorder
– Good collection made
– Grateful peers, e.g. because key forms and pdfs on the
prior website were not lost and instead were easy to
find in the web archive!
Stanford
University
Press
Digital projects associate: ‘Our complex publications are
cutting edge and will have a limited lifespan most likely
(lots of technical dependencies).
How can we make sure they are an enduring resource?
How do we explain challenges and benefits to
administrators and funders?’
– Pilot partnership with Webrecorder team – mutual
benefit
– Hands on work and dialog; custom development beta
(Scalar)
UseCase:
Journalists!
Journalist: ‘There’s some wild stuff online I will be
referencing in my journalistic or academic writings.
I need to cite my sources to write a credible article.’
– Getting past screenshots
– What’s the benefit of something more complex than
screenshots?
– Ongoing credibility, evidence
Pelican Bomb
Editors/founders: ‘Our publication is closing. We did good
work and want it to have continued impact. What do we
do?’
– Time limited pilot partnership with Webrecorder team –
mutual benefit
– Work plan formed but primary funder did not buy in so
limited implementation
– Stakeholders: ‘now that we know the benefits of web
archiving we realize there are others in our communities
need digital preservation help’
– Outreach, including workshop at Common Ground
Convening
Enduring value
of Pelican
Bomb
– Events calendar
– Editorial work
– Community Supported Art program

Introduction to Web Archiving

  • 1.
  • 2.
    What is web archiving? Thecollection, management and stewardship of web resources in a stable form (e.g.WARC file) that can be accessed over time independently from the original
  • 3.
    Why archive web resources? To accurately representexperiences and materials created in the twenty-first century, select websites and other web-based resources should be captured, stored, managed, described, and made accessible Online spaces & web based media are crucial elements in some current events & crises - a lot of very important information would be lost / omitted in the absence of web archiving
  • 4.
    Why archive web resources (continued)? •Content available primarily or solely online is among the most at-risk born-digital materials • Websites that can be collected are freely and widely available to anyone at some time but can vanish at the volition of the site owner and/or service provider • Like with other digital materials, web content is very vulnerable to loss by comparison to information contained in most analog media
  • 5.
    [A few more] Whyarchive web resources? • Curated collections of web archives can be a valuable part of collection development • Some resources that used to be published and distributed on paper are now only available online • Examples include: • Course catalogs (!) • Reports • Publicity materials i.e. for exhibitions, events, press kits, brochures
  • 6.
    Web archiving is amulti-step process • Collection development and planning • Selection • Permissions / Ethical review • Collecting / Harvesting • Description • Access • Long-term preservation
  • 7.
    Is the visual Context also Content? – CONTEXTà Content – Are the visual elements or interactive features – Important – Defining – Non-essential – Is the experience of usage essential to capture? – For example is a resource more like a course catalog versus interactive publication? – Would anyone truly care about the path of access enough to prioritize it? – eBooks created with specific frameworks to weave together information, however, are different – Social media is HARD
  • 8.
    What [IMHO] is NOT web archiving? –Static screenshots / non-interactive fixed images / screen recordings of interaction with the site – There is room for these as supplemental materials – Stockpiling without any specific strategy for selection, management and preservation – For example,YouTube is not a web archive unless your collection development plan is to document an enormous, un-curated mass of data – Also this would be nearly impossible to steward and make accessible on an enduring basis (i.e. financially, environmental impact) – Using, or capturing, web spaces employed as a PLATFORM or environment for sharing digital archives
  • 9.
    In the absence ofideal tools* – * Ideal tools - fully functional, easy to use, open source, sustainable, well maintained, tested, widely accessible & affordable with transparent pricing – What can you afford? – What is good enough? For now? Longer term? – Why are you doing this? – Who is this for and for how long into the future?
  • 10.
    Some essential terms – Crawler /spider / robot – Automated software that traverses web pages per directions from a human (for indexing or capture) – Human scale / browser based web collecting – Collecting that is guided by a human in real time through a web browser – not a screen recording of process, an interactive web archive is created – Seed URL – Starting point for collecting (can be at a domain, directory or page level)
  • 11.
    Some more essential terms –WARC (file) – ISO standard file format for web archives – Fidelity / quality: – Similarity to original (e.g. look, functionality) – Significant properties: – Defining features of an object or resource – what about this thing makes it what it is [and distinguishes it from other things] – Will illustrate in slides later
  • 12.
    Frequent web archiving project genesis Archivist:‘I was just informed that an essential web resource is about to be taken down/deleted. Soon! Within weeks or next month.’ – How do I save a functional copy for future use? – Can I do this in time (within a month or so)?
  • 13.
    Advocacy within an organization Administrator: ‘Whatdo you mean web based content isn’t just saved [with full fidelity] automatically? Doesn’t the Internet Archive have a copy?’ – By all means check the Internet Archive but view captures critically – Does this capture accurately represent the original? Why/why not? If so, can you get a copy? – Advocacy is hard but leverage training materials available. Again, explaining limits of web archiving capabilities in an encouraging way is difficult but necessary for expectation management
  • 14.
    Sharing responsibilities – Collecting strategyand establishment of priorities for collection development could be a group effort – Contributions could include – Suggest URLs – Liaise with site owners to solicit permission to archive websites – Governance of collaborations if multiple institutions are involved – Detailed quality assurance through browsing the archived website as a user would (e.g. try to access media files to ensure they have been successfully captured) – Assessment of efficacy for users?
  • 15.
    Let’s go! Collection Development aka Why arewe doing this?? • Thinking within any existing collecting policy as well as thinking through what makes sense for you/your institution with the tools and resources at hand • Careful consideration and plenty of questions • Why collect websites (needs, collection scope)? • What to collect? • How/what tools to use? • When: how often to collect & when will these materials be used?
  • 16.
    Collecting + Ethics – Discussionsof ethics – not an after thought but remembering some things could be made private or embargoed if needed – Who is at risk? What is the potential for harm? – High risk – Low risk – Do creators understand implications of their content being collected? – Archives are useful as evidence – who could leverage that evidence and for what purpose? – Intellectual property / rights of creators
  • 17.
    Next: Collecting materials – With aplan in mind, browse the live web and/or make a list of sites you want – Depending on tools available and associated skills, collect some resources – Review what you got – is this what you expected to get? If not, is it close enough? – If you did not get what you expected / need, next steps: contact vendor, tool maker or someone likely to have the skills to help you troubleshoot
  • 18.
    Testing is boring & tedious& entirely necessary • ‘Set it and forget it’ is not recommended • It’s boring and tedious, but review your captures please • If you don’t test your captures you have no basis to expect you collected materials with adequate “fidelity” • Fidelity perceived as correlating with accurate representation of the resource and the information contained therein • Perfection is not attainable but better is better
  • 19.
    Description, Access & Preservation – Description,Access & Preservation are necessary to consider and factor in but we do not have enough time to explore these essential elements today
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
    What is oris not good enough is your call! https://web.archive.org/web/20080214064959/http://www.metro.org/
  • 25.
  • 26.
  • 27.
  • 28.
    Remember! – Despite alack of perfect solutions, materials on the web are too important to give up on for collecting, managing and preserving via web archiving – What is or is not good enough is your call (up to a point) – Is this enough to meet the established purpose(s)? – Something is better than nothing as long as that ‘something’ has been gathered with intent and managed (stewarded) adequately
  • 29.
    Upcoming! – Getting Startedwith Web Archiving – March 2, 2021 – Featuring presenters working onArchipelago as well as team atCarnegie HallArchives! Coming soon! – Web Archiving Ethics and Implications – Tools to ‘Do’Web Archiving – Learning from Long-term Leading Web Archiving Initiatives
  • 30.
    Thank you! Up next: Q&A Thankyou! Anna.Perricci@gmail.com Perricci Consulting @AnnaPerricci (Twitter)
  • 31.
    Q&A – Anything you’dlike to ask – Follow up from presentation topics? – Local issues and challenges?
  • 32.
    If it’s quieton the line…
  • 33.
  • 34.
    Indianapolis Museum ofArt à Newfields –Rebranding à sudden need to collect website before taken offline – Motivated archivist who figured out local deployment ofWebrecorder – Good collection made – Grateful peers, e.g. because key forms and pdfs on the prior website were not lost and instead were easy to find in the web archive!
  • 35.
    Stanford University Press Digital projects associate:‘Our complex publications are cutting edge and will have a limited lifespan most likely (lots of technical dependencies). How can we make sure they are an enduring resource? How do we explain challenges and benefits to administrators and funders?’ – Pilot partnership with Webrecorder team – mutual benefit – Hands on work and dialog; custom development beta (Scalar)
  • 36.
    UseCase: Journalists! Journalist: ‘There’s somewild stuff online I will be referencing in my journalistic or academic writings. I need to cite my sources to write a credible article.’ – Getting past screenshots – What’s the benefit of something more complex than screenshots? – Ongoing credibility, evidence
  • 37.
    Pelican Bomb Editors/founders: ‘Ourpublication is closing. We did good work and want it to have continued impact. What do we do?’ – Time limited pilot partnership with Webrecorder team – mutual benefit – Work plan formed but primary funder did not buy in so limited implementation – Stakeholders: ‘now that we know the benefits of web archiving we realize there are others in our communities need digital preservation help’ – Outreach, including workshop at Common Ground Convening
  • 38.
    Enduring value of Pelican Bomb –Events calendar – Editorial work – Community Supported Art program