Incident Learning: Facebook’s AWS Authentication Failure on 5 March 2024
Overview:
On March 5, 2024, Facebook underwent a massive outage that prevented people around the world from using the platform, as access was disrupted for several hours. The reason for the problem was technical errors in the company’s infrastructure, leading to the inability to use both the platform itself and the mobile application. The situation was extremely frustrating for many users as well as businesses and advertisers who use the platform for communication and marketing. The company promptly recognized the existence of the problem and took steps to eliminate it. The website was fully operational later the same day. The failure showed the complexities of managing a large global network and the associated problems encountered by the platform. Another equally important aspect concerned relying on digital platforms without creating a strategy in the event of unforeseen situations.
.png)
C:\Users\...>nslookup www.facebook.com
Non-authoritative answer:
Name: star-mini.c10r.facebook.com
Addresses: 2a03:2880:f12f:83:face:b00c:0:25de
157.240.198.35
Aliases: www.facebook.com
C:\Users\........>ping 157.240.198.35
Pinging 157.240.198.35 with 32 bytes of data:
Reply from 157.240.198.35: bytes=32 time=28ms TTL=49
Reply from 157.240.198.35: bytes=32 time=23ms TTL=49
Reply from 157.240.198.35: bytes=32 time=20ms TTL=49
Reply from 157.240.198.35: bytes=32 time=23ms TTL=49
Ping statistics for 157.240.198.35:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 20ms, Maximum = 28ms, Average = 23ms
If it is BGP related outage then during outage time, no one user can access Facebook, Instagram and Youtube application. And many reports told that if subscribers are login the application, then they can access easily but when they logout the application, they cannot access the application. It refers the authentication related error of server side.
Facebook Authorization process
The user clicks a button in the React app to connect with Facebook.
The user is sent to Facebook to log in and give permission for the app to access their data.
After granting permission, Facebook sends a special code back to the React app.
The React app sends this code to Facebook and asks for an access token.
Facebook sends back an access token, which the app can use to access the user’s data on Facebook.
This process ensures that the user’s data is accessed securely and only with their permission.
.png)
But during outage, users who already login there is no require passing through authorization process, but new users want to login have to pass through authorization process.
It's possible that an AWS Virtual Machine (VM) server issue related to peering connections between different zones could contribute to a service outage like the one experienced by Facebook.
.png)
Here's how such an issue might arise:
Network Peering and Connectivity Issues:
In large-scale operations and distributed systems, peering connections allow different parts of a network, possibly located in different parts of the world, and sometimes entirely different availability zones, to communicate with each other. Any issue regarding the peerings, such as configuration error, mis-authorization, or the network outages between peers, could potentially prevent the servers from being able to communicate with each other. Since AWS has complicated security and authorization in place to govern the network traffic, an authorization error in the peering setup could entirely or partially disrupt the service communication with a specific region or the service within the setup.
Impacts on the services:
Facebook, being a large-scale service, depends on hosting globally distributed services that rely on mutual communication between them. When one data centre cannot communicate with the other due to a peering issue, the several services entirely and partially dependent on them will be unable to communicate data, thus the down-time. As there is no specific disclosure from Facebook or AWS at this moment, such hints are speculative. By default, large-scale internet services have a diverse type of infrastructure, and, aside from peppering issues, their resolution equally be performance, depending on many factors: software bugs, hardware issues, network issues, or human error.
Lessons in Cloud Dependency and Resilience
Authentication dependencies must be redundant.
The outage showed how a single point of failure in your authentication system, e.g., your token, certificate, or API, can bring access down to almost everything.
Takeaway: multi-region authentication redundancy, ensure your backup identity systems can automatically take over.
Cloud dependency requires visibility and control.
As the issue initiated at AWS infrastructure, it seemed large platforms’ visibility and dependency over their cloud provider’s internal services.
Takeaway: always maintain monitoring, alerting, and fallback control over third-party cloud resources — don’t rely blindly on the provider.
Microservices interdependence can amplify failure.
In modern architectures like Facebook’s, hundreds of microservices depend on each other. In this case, if authentication fails, everything relying on it stops.
Takeaway: design graceful degradation — services should work partially, e.g., read-only or cached mode if authentication fails.
.png)
Incident response preparedness is crucial.
The outage lasted several hours before full restoration. Large engineering teams can also be at fault without incident workflows.
Takeaway: build and test incident response playbooks, rollback, and failover drills.
Communication during outages matters.
During the downtime, users have been flooding social media and news for Facebook’s updates. Facebook’s communication channels were also hit.
Takeaway: have external communications mechanisms status pages, Twitter handles, etc. that don’t rely on the same infrastructure.
6.Continuous postmortem culture builds resilience.
After every outage like this one, Facebook and Meta will perform a deep root cause analysis to prevent recurrence.
Takeaway: adopt a blameless postmortem culture — focus on systemic improvements, not individual errors.
Note: However, without specific details from Facebook or AWS, it's speculative to pinpoint the exact cause of an outage. Large-scale internet services typically have complex infrastructures, and outages can result from a variety of factors, including hardware failures, software bugs, network issues, or even human error.

Comments (3)
Great introduction! Looking forward to more HTML5 articles.
Thanks Jane! We have more articles coming soon 🚀
This helped me understand semantic tags better. Thanks!
Could you also write about Canvas API in detail?
Leave a Comment