These are short stories from bug hunts and incident investigations at Wikipedia.

Impact

After developers submit code to Gerrit, they eagerly await the result from Jenkins, an automated test runner.

Every day during the 15 minute window before 5 PM in San Francisco, code changes submitted for code review would have mysteriously failing tests. Jenkins would wrongly inform developers that their proposed changes cause a problem with the MergeHistory feature of MediaWiki.

Background

The test in question assumed that it would finish by “tomorrow”. At first glance, it seems fair to assume that by tomorrow, a given test will have finished. We know our test suite generally only take a few minutes to run (with a time limit of 30 minutes, to ensure tests report back even if they are stuck).

Investigation

Unfortunately…, the strtotime utility function in PHP, does not interpret “tomorrow” as “this time tomorrow”.

Rather, it takes it to mean “the start of tomorrow”. In other words, the next strike of midnight!

For example, on 14 August 23:59:59, strtotime("tomorrow") would evaluate to a timestamp merely one second into the future — 15 August 00:00:00.

This meant that whenever a test started running shortly before midnight, it would fail. The test server uses UTC as its timezone. As such, a test suite that started less than 15 minutes before 5 PM in San Francisco (which is midnight in UTC), it would mysteriously fail!

Task T201976

Change 452873


Originally published in the September 2018 edition of the Production Excellence newsletter at Wikimedia. This article is an expanded version of that.