By James Purtill for the ABC
The outage that took down Facebook, Instagram and WhatsApp for about six hours, affecting billions of people around the world, may have been due to a tiny human error.
The tech giant with billions of users and a mission statement of "bring the world closer together" suffered a mysterious technical breakdown early this morning New Zealand time.
The effect of that rippled around the globe, as one after the other of its network of services went dark.
Messages would not send on Messenger; money would not flow on WhatsApp money transfer; pages that used Facebook for logins locked users out.
At Facebook itself, employees reportedly could not use their keycards to enter buildings or access standard office software for work and collaboration.
Facebook boss Mark Zuckerberg lost $US9.6 billion as the company's share price plunged.
As chaos reigned, Facebook executives took to the rival platform Twitter to explain what was happening, and apologise to their own users.
By midmorning, Facebook, WhatsApp and Instagram users began to regain partial access.
So, what happened? And how could a company this big suffer such a glitch?
How Facebook fell off the 'map of the internet'
"We can see what happened, but we don't know exactly what was the cause," said Marek Kowalkiewicz, director of the Centre for the Digital Economy at QUT.
Moments before the outage began, Facebook stopped announcing the routes to its DNS addresses.
Between 15:50 UTC and 15:52 UTC Facebook and related properties disappeared from the Internet in a flurry of BGP updates. This is what it looked like to @Cloudflare. pic.twitter.com/PFw5FR2W5j
— John Graham-Cumming (@jgrahamc) October 4, 2021
DNS stands for Domain Name System, and it functions like an address book for the internet.
"It's the system that translates what we type into a browser into an IP address," Kowalkiewicz said.
Say you type " http://facebook.com" into a browser; DNS connects that name with the numerical address of one of the Facebook servers.
But what good is an address without any idea of how to get there?
Border Gateway Protocol, or BGP, is that navigation system.
BGP uses tables that list the routes to particular network destinations; when you send a request to a local server to access Facebook, these tables show the server where to send your request so it eventually reaches Facebook.
Who publishes the tables that show where to find Facebook? Facebook does.
Shortly before the outage, Facebook updated its BGP routing tables, according to the internet infrastructure company Cloudflare.
About five minutes before Facebook's DNS stopped working we saw a large number of BGP changes (mostly route withdrawals) for Facebook's ASN. pic.twitter.com/dMTevg6hqj
— John Graham-Cumming (@jgrahamc) October 4, 2021
The update withdrew the routes to Facebook's IP addresses, meaning servers had no idea where to send requests to access Facebook.
"Routing tables are bit like a map of the internet - they explain how to get from this intersection to that intersection," Kowalkiewicz said.
"If one or two of those intersections are misconfigured, you end up with internet traffic that gets stuck and doesn't know where to go."
Facebook and its sites had effectively disconnected themselves from the internet, said Daniel Angus, a professor of digital communications at QUT.
"The routing tables are like the front doors to Facebook's various services," he said.
"Somewhere along the way, someone made an update and when it was deployed, those doors have disappeared."
Right. So who stuffed up?
This is the part that's not clear yet, though rumours abound.
Angus speculated that the ultimate cause may have been human error; a very senior engineer at Facebook may have made a basic transposition error, and this mistake was propagated through the internet by self-replicating protocols.
"A small error propagates out to the entire network," he said.
"There are checks and balances to try to catch some of these things, but at a certain level you're playing god-mode with these routing tables."
Kowalkiewicz agreed.
"Right now we're in a world of speculation, but large organisations can make very basic mistakes.
"My suspicions were there was some change of configuration at Facebook, something went wrong and it propagated like a wave coming through the internet."
If it wasn't human error, it could have been a malicious act, perhaps by a disgruntled employee, Kowalkiewicz said.
Facebook is under mounting public and political pressure to act on misinformation.
The outage came shortly after a US current affairs program aired an interview with a whistleblower who claimed the company was aware of how its platforms were used to spread hate, violence and misinformation, and that it had tried to hide that evidence.
Have outages like this happened before?
The outage is the worst since a bug knocked Facebook's services offline for about a day in 2008, affecting about 80 million users.
The company now boasts 3 billion users.
Other big tech companies have seen major outages recently.
In late 2020, both Amazon and Google had separate outages that were a major inconvenience for millions of users around the world.
They even created havoc in internet-connected homes, where turning on the lights or regulating the thermostat was controlled by a Google or Amazon app that no longer worked.
Some part of AWS is down and apparently it’s screwing up the Roomba.
— Matthew Green (@matthew_d_green) November 25, 2020
In June 2020, websites around the world went dark due to a software bug at a company that manages a crucial piece of internet infrastructure.
These incidents remind us that the internet is both "incredibly robust and incredibly fragile," Angus said.
Its decentralised structure of distributed servers means it's hard to take down, but the centralisation of services, through tech giants such as Facebook, Amazon and Google, means that the impact of any outage is greater than ever before.
"There are a few people that are probably responsible for keeping the whole thing afloat any day of the week, and we do take the internet for granted," Professor Angus said.
In its blogpost, Cloudflare wrote: "Today's events are a gentle reminder that the internet is a very complex and interdependent system of millions of systems and protocols working together."
A Facebook spokesperson said: "To everyone who was affected by the outages on our platforms today: we're sorry.
"We know billions of people and businesses around the world depend on our products and services to stay connected.
"We appreciate your patience as we come back online."
- ABC