Get the latest content first

A Break in the Cloud – The Day Amazon Broke the Internet

Chris Tynan – Partner, Retail & IT Analyst

The cogs of the inter-webs shuddered to a standstill in late February as the shared plumbing that much of the world’s online real-estate relies upon, burst open, halting the flow of data through the internet’s virtual water pipes. The technology world gasped as the Amazon Web Services (AWS) juggernaut revealed even they, with a globally distributed presence and infinitely scalable hosting platform, were susceptible to catastrophic failure. While not the first “service event” experienced by AWS, this occurrence was undoubtedly the highest profile, given the number of websites, services and enterprises now relying on its infrastructure. Lasting around five hours, the incident revealed even the automated supremacy of the world’s most innovative company is susceptible to human error. We were also reminded of the unfortunate reality that over a long enough time frame, even the most redundant array of computers, will eventually crash, leaving you temporarily unable to check Dailymail/The Herald/The Age/The SMH every ten minutes.

For those unfamiliar with the incredible ascent of AWS and its rise to dominate online hosting and data storage, you must be either one of the seven millennials not working in IT or part of the rest of the world’s population that doesn’t care how your website is hosted, but will notify your helpdesk if it takes longer than one sip of coffee to load. In 2004, Amazon took its successful ultra-low cost, ultra-high volume business model, which progressively hollows out each consumer segment it leans into, to hosting services. Having amassed one of the world’s largest server farms and developed innovative technology to handle the IT infrastructure demands of their flourishing online retail business, the group realised it could monetise its scale and turn the IT infrastructure cost line into a profit centre. By offering shared hosting services at a fraction of the cost of traditional models, AWS quickly tapped into the asymptotic growth in requirements for data storage and processing power, as the much-vaunted (and often misinterpreted) Cloud revolution sweeps the world.

Over the past ten years, many websites, enterprises and application developers have embraced the movement of data off their own hosted servers onto what is known as the Public Cloud. Under this model many customers run on the same hardware, hosted by a third-party service provider. On top of being cheaper, operating this way introduced simplified redundancy and scalability. The progression of this strategy was to move the actual processing of this data to the Cloud by replicating your physical servers on their platform. This allowed you to “spin up” new instances of these number crunching virtual machines as your processing requirements increased, in a pay-as-you- go model. This tantalising technology offered incredible scalability without having to invest huge amount of capital to grow. Further profit and loss benefits came from material reductions in IT staffing and hardware depreciation. The model came to be known as Infrastructure as a Service (IaaS). To the application software companies and CIO’s of the world, this was an absolute ‘no brainer,’ ushering in phenomenal growth not only for Amazon’s offering but in a suite of me too products from other technology firms including Google Cloud, Microsoft Azure, IBM SoftLayer and Oracle Cloud. Pricing has continued to plummet, making the service accessible to smaller and smaller operations, as the tech titans scramble to capture market share in this industry. Gartner estimated that this industry will maintain greater than 30% compound annual growth into which IDC expects to grow into a $195bn market by 2020.

The asymptotic growth of IaaS has been led primarily by AWS’ ongoing innovation and consistent price cuts, ballooning from nothing to over USD$12 Billion in ten years, contributing USD$3.1 Billion to Amazon’s earnings and making it the group’s most profitable business. Estimated to hold greater than 30% market share of all IaaS revenues, AWS support over one million customers running on its platform. Its portfolio boasts some of the most prolific software as a service (SaaS) applications including Adobe, Airbnb, Netflix, Spotify and SAP. However, innumerable large and small private businesses are also utilising the service for day to day data storage, application serving and disaster recovery. Amazon presentations in 2016 highlighted “nearly 2,000 government agencies, 50,000 education providers and 17,500 non-profits”. The service is understood to be used for hosting by over 140,000 websites. All of these customers are attracted to AWS’ promise of logical and geographical redundancy as their data and virtual servers are spread across the globe. The days of worrying about server hardware crashes were over, never again would IT have to worry about backup tape drives not working, disks failing or memory being corrupted. The service has been incredibly reliable with few reports of down time or accessibility problems.

Source: Geekwire

Then came February 28, 2017. At around 12:35am Eastern Time, various social media outlets began reporting popular sites were grinding to a halt. The incredible irony was that AWS’ status page, designed to monitor AWS services, was itself unable to be updated because it was affected by the AWS outage to AWS services. As such, Amazon was forced to rely on Twitter to relay to the world that they had severed part of the internet’s spine. Here is their somewhat confusing message to followers:

Among the hundreds of high profile sites and services that experienced complete outage or severely degraded performance were Apple iTunes, collaboration tool Slack, project management application Trello, Q&A and blogging tool Quora, global online travel behemoth Expedia, high visitation websites Business Insider, Lonely Planet and even the SEC’s own homepage. Locally, popular accounting software provider Xero, having recently celebrated re-platforming itself to rely wholly on AWS for its infrastructure, had its services taken offline in many of its markets with hundreds of thousands of users unable to access its cloud based platform.

After several hours, the issue was resolved. As services returned to normal operation, customers understandably howled for an explanation and restitution, both of which were provided, to varying levels of satisfaction. What emerged from AWS headquarters was that their Simple Storage Service (S3) was experiencing “high error rates”. This critical component of the platform acts as an interface, providing access to its core data storage capabilities. The root cause of the disaster turned out be a typo by one of their programmers, while debugging a billing system. The fat finger inadvertently knocked out swathes of their server farms, instead of an isolated patch.

Amazon’s competitors were quick to capitalise on their rival’s issues, using the event as a marketing opportunity and touting the inclusion of their platform as a strategy to diversify cloud infrastructure and mitigate outage risk. The twitter hashtag #AWSoutage trended, with every competitor, domain expert and consultant penning their told-you-so’s and providing delightful infographics on how it all could have been prevented. While many estimates of the economic impact put forward are quoted ad nauseam, the actual number is impossible to quantify. Several hours offline for companies reliant on connectivity for customer transactions, services or worker productivity, undoubtedly has a material commercial and reputational cost. These are the very reasons that enterprises spend millions each year on disaster recovery and business continuity strategies.

History has shown that downtime is an inevitable and unavoidable event for even the most robust and time critical systems, evidenced by outages in systemically important software such as stock exchanges, flight control systems and utility infrastructure. The fact outages were so widespread, geographically and economically, does highlight a level of complacency that may have crept into the ever shrinking IT infrastructure teams. Faith in the infallibility of public cloud infrastructure meant redundancy was effectively outsourced to the innovation merchants at Amazon. This confidence appears to have let a single point of failure slip through risk management frameworks of some of the world’s most technologically savvy enterprises, damaging the 99.99% uptime they aspire to.

To an online world consuming more and more internet based services, this offline moment laid bare the scale at which we are increasingly reliant on connectivity to work and play and the dependence we have on a select few companies to keep the wheels turning. This event will undoubtedly catalyse post mortems at many organisations to minimise future repeats. Spreading data and processing across multiple geographies and cloud providers comes with increased complexity and cost. Hybrid Cloud, which combines private and public hosting represents increased overheads and is heresy to cloud purists. Whatever the evolution of this architecture is, it is very unlikely to slow Infrastructure as a Service take-up but will perhaps cause a rethink of the eggs in one basket paradigm.

If nothing else, the AWS outage provided endless entertainment from reading the indignant fury and humorous memes that clogged the Twittersphere during and after the downtime.

From an investment perspective, we really like the Amazon Web Services business but are less drawn to the bread and butter Amazon retail business. AWS composes only 10% of sales and 25% of profitability, so is not the main driver of the stock. Its core internet retail business, while making headlines all over the world, is very competitive, with low barriers to entry and only a handful of companies operating profitably. Even Amazon, after 20 years and with dominant market share struggles to achieve more than a 3% profit margin and lacks consistent bottom-line results. Amazon has a long track record of either pulling the revenue growth or the margin lever, but has never been able to combine the two. Given that e-commerce penetration is still well below 10% in most markets, this industry will continue to grow. We are keeping a close eye on the developments within Amazon.

This article has been prepared by Arnhem Investment Management Pty Limited ABN 17 129 606 775, AFSL 332484. It has no regard to the specific investment objectives, financial position or particular needs of any specific recipient. You should seek your own professional advice in relation to any financial product referred to. You should also obtain the product disclosure statement relating to any financial product referred to and consider the statement before making any decision about whether to acquire the financial product.

This article, including the information contained herein, may only be copied, reproduced, republished, or posted if done so in whole with original disclaimer included.
© Arnhem Investment Management, 2017