The Internet Archive discovers in real-time when WordPress blogs publish a new post, and when Wikipedia articles reference new sources. How does that work?

Wikipedia

Wikipedia, and its sister projects such as Wiktionary and Wikidata, run on the MediaWiki open-source software. One of its core features is “Recent changes”. This enables the Wikipedia community to monitor site activity in real-time. We use it to facilitate anti-spam, counter-vandalism, machine learning, and many more quality and research efforts.

MediaWiki’s built-in REST API exposes this data in machine-readable form to query (or poll). For wikipedia.org, we have an additional RCFeed plugin that broadcasts events to the stream.wikimedia.org service (docs).

The service implements the HTTP Server-Sent Events protocol (SSE). Most programming languages have an SSE client via a popular package. Most exciting to me, though, is the original SSE client: the EventSource API — built straight into the browser.[1] This makes cool demos possible, getting started with only the following JavaScript:

new EventSource('https://stream.wikimedia.org/…');

And from the command-line, with cURL:

$ curl 'https://stream.wikimedia.org/v2/stream/recentchange'

event: message
id: …
data: {"$schema":…,"meta":…,"type":"edit","title":…}

…

WordPress

WordPress played a major role in the rise of the blogosphere. In particular, ping servers (and pingbacks[2]), helped the early blogging community with discovery. The idea: your website notifies a ping server over a standardized protocol. The ping server in turn notifies feed reader services (Feedbin, Feedly), aggregators (FeedBurner), podcast directories, search engines, and more.[3]

Ping servers today implement the weblogsCom interface (specification), introduced in 2001 and based on the XML-RPC protocol.[4] The default ping server in WordPress is Automattic’s Ping-O-Matic, which in turn powers the WordPress.com Firehose.

This firehose is a Jabber/XMPP server at xmpp.wordpress.com:8008. It provides events about blog posts published in real-time, from any WordPress site. Both WordPress.com and self-hosted ones.[5] The firehose is also available in as HTTP stream.

$ curl -vi xmpp.wordpress.com:8008/posts.org.json # self-hosted
{ "published":"2022-06-05T21:26:09Z",
  "verb":"post",
  "generator":{…},
  "actor":{…},
  "target":{"objectType":"blog",…,},
  "object":{"objectType":"article",…}
}
{ … }

$ curl -vi xmpp.wordpress.com:8008/posts.json # WordPress.com
{ … }

Internet Archive

It might be surprising, but the Internet Archive does not try to index the entire Internet. This in contrast to commercial search engines.

The Internet Archive consists of bulk datasets from curated sources (“collections”). Collections are often donated by other organizations, and go beyond capturing web pages. They can also include books, music,[6] and software.[7] Any captured web pages are additionally surfaced via the Wayback Machine interface.

Perhaps you’ve used the “Save Page Now” feature, where you can manually submit URLs to capture. While also represented by a collection, these actually go to the Wayback Machine first, and appear in bulk as part of the collection later.

The Common Crawl and Wide Crawl collections represent traditional crawlers. These starts with a seed list, and go breadth-first to every site it finds (within a certain global and per-site depth limit). Such crawl can take months to complete, and captures a portion of the web from a particular period in time — regardless of whether a page was indexed before. Other collection are more narrow in focus, e.g. regularly crawl a news site and capture any articles not previously indexed.

Wikipedia collection

One such collection is Wikipedia Outlinks.[8] This collection is fed several times a day with bulk crawls of new URLs. The URLs are extracted from recently edited or created Wikipedia articles, as discovered via the events from stream.wikimedia.org (Source code: crawling-for-nomore404).

en.wikipedia.org, revision by Krinkle, on 30 May 2022 at 21:03:30.

Last month, I edited the VodafoneZiggo article on Wikipedia. My edit added several new citations. The articles I cited were from several years ago, and most already made their way into the Wayback Machine by other means. Among my citations was a 2010 article from an Irish news site (rtl.ie). I searched for it on archive.org and no snapshots existed of that URL.

A day later I searched again, and there it was!

web.archive.org found 1 result, captured at 30 May 2022 21:03:55.
This capture was collected by: Wikipedia Eventstream.

I should note that, while the snapshot was uploaded a day later, the crawling occurred in real-time. I published my edit to Wikipedia on May 30th, at 21:03:30 UTC. The snapshot of the referenced source article, was captured at 21:03:55 UTC. A mere 25 seconds later!

In addition to archiving citations for future use, Wikipedia also integrates with the Internet Archive in the present. The so-called InternetArchiveBot (source code) continously crawls Wikipedia, looking for “dead” links. When it finds one, it searches the Wayback Machine for a matching snapshot, preferring one taken on or near the date that the citation was originally added to Wikipedia. This is important for online citations, as web pages may change over time.

The bot then edits Wikipedia (example) to rescue the citation by filling in the archive link.

Wikipedia.org, revision by InternetArchiveBot, on 4 June 2022. Rescuing 1 source. The source was originally cited on 29 September 2018. The added archive URL is also from 29 September 2018.
web.archive.org, found 1 result, captured 29 September 2018. This capture was collected by: Wikipedia Eventstream.

WordPress collection

The NO404-WP collection on archive.org works in a similar fashion. It is fed by a crawler that uses the WordPress Firehose (source code). The firehose, as described above, is pinged by individual WordPress sites after publishing a new post.

For example, this blog post by Chris. According to the post metadata, it was published at 12:00:42 UTC. And by 12:01:55, one minute later, it was captured.[9]

In addition to preserving blog posts, the NO404-WP collection goes a step further and also captures any new material your post links to. (Akin to Wikipedia citations!) For example, this css-tricks.com post links to file on GitHub inside the TT1 Blocks project. This deep link was not captured before and is unlikely to be picked up by regular crawling due to depth limits. It got captured and uploaded to the NO404-WP collection a few days later.

Further reading


Footnotes:
  1. The “Server-sent events” technology was around as early as 2006, originating at Opera (announcement, history). It was among the first specifications to be drafted through WHATWG, which formed in 2004 after the W3C XHTML debacle. ↩︎
  2. Pingback (Pingbacks explained, history) provides direct peer-to-peer discovery between blogs when one post mentions or links to another post. By the way, the Pingback and Server-Sent Events specifications were both written by Ian Hickson. ↩︎
  3. Feedbin supports push notifications. While these could come from from its periodic RSS crawling, it tries to deliver these in real-time where possible. It this does by mapping pings from blogs that notify Ping-O-Matic, to feed subscriptions. ↩︎
  4. The weblogUpdates spec for Ping servers was writen by Dave Winer in 2001, who took over Weblogs.com around that time (history) and needed something more scalable. This, by the way, is the same Dave Winer who developed the underlying XML-RPC protocol, the OPML format, and worked on RSS 2.0. ↩︎
  5. That is, unless the blog owner opts-out by disabling the “search engine” and “ping” settings in WordPress Admin. ↩︎
  6. The Muziekweb collection is one that stores music rather than web pages. Muziekweb is a library in the Netherlands that lends physical CDs, via local libraries, to patrons. They also digitize their collection for long-term preservation. One cool application of this, is that you can stream any album in full from a library computer. And… they mirror to the Internet Archive! You can search for an artist, and listen online. For copyright reasons, most music is publicly limited to 30s samples. Through Controlled digital lending, however, you can access many more albums in full. Plus you can publicly stream any music in the public domain, under a free license, or pre-1972 no longer commercially available. ↩︎
  7. I find particularly impressive that Internet Archive also host platform emulators for the software it preserves, and that these platforms not only include game consoles but also Macintosh and MS-DOS, and that these emulators are then compiled via Emscripten to JavaScript and integrated right on the archive.org entry! For example, you can play the original Prince of Persia for Mac (via pce-macplus.js), the later color edition, or Wolfenstein 3D for MS-DOS (via js-dos or em-dosbox), or check out Bill Atkinson’s 1985 MacPaint. ↩︎
  8. The “Wikipedia Outlinks” collection was originally populated via the NO404-WKP subcollection, which used the irc.wikimedia.org service from 2013 to 2019. It was phased out in favour of the wikipedia-eventstream subcollection. ↩︎
  9. In practice, the ArchiveTeam URLs collection tends to beat the NO404-WP collection and thus the latter doesn’t crawl it again. Perhaps the ArchiveTeam scripts also consume the WordPress Firehose? For many WordPress posts I checked, the URL is only indexed once, which is from “ArchiveTeam URLs” doing so within seconds of original publication. ↩︎