Finding old web content

Web

Summary

This article discusses how to find old web content from the Internet Archive Wayback Machine.

Environment

LSA (really any) web sites

Directions

Have you ever forgotten to grab content from your old web site — be it AEM, OpenText (Vignette), cPanel, WordPress, webapps, or something else — before that site went away? This article tells you how you might get it back!

Web Services has seen a few tickets coming in that are effectively "My old site went away and I forgot to grab something." This article provides some information to help you recover that information yourself.

First, the "old site" in question could be any web site — not just the current and past LSA web service offerings like the Faculty, Web, and Research Web Sites (FLR aka WordPress); Unit Web Sites (AEM now and formerly OpenText and Vignette); Web Hosting Environment (WHE aka cPanel); or Windows Web Application Hosting (WWAH aka webapps), but other U-M and non-U-M sites as well.

When the unit web sites were transitioned from OpenText to AEM between July 2015 and August 2016, units were given the option to archive their old site. The problems with that are that some units opted not to archive their sites and that the archive process only grabbed published (live) web pages and documents that they linked to. Any unpublished content or published files that were not linked to from the web site would have been missed. These archives also only existed for unit web sites, not for the other web services LSA offers.

Luckily for our purposes there's a public Internet archiving service, called the Internet Archive Wayback Machine, that may have grabbed any publicly viewable pages, including those that may have been unpublished (but visible through the old editing interface, edit.lsa.umich.edu), as long as the pages were not hidden behind Weblogin or another authentication mechanism. You can search that archive on your own without having to open a ticket to Web Services, too:

Go to http://archive.org/web in your web browser.
Enter the URL of what you're trying to find:
- The more specific the better.
- When in doubt move up a level.
Use the on-screen controls to select a snapshot based on its date and time.
Browse to the page.
Copy or download what you need.
Paste or import to the new site if desired.

There are a few things to note, however:

Not every web page was captured on every crawl. If the date and time you chose doesn't have what you're looking for try another date and time.
Because of that, some stylesheets and images may not be present in the snapshot you're viewing.
If the web site blocked web crawlers (such as with a "robots.txt" file) the pages will not be archived.

This information previously appeared in a January 2017 News article on the Web Services site.

0 reviews

Print Article

Updating...

Finding old web content

Summary

Environment

Directions

Deleting...