Monday, 12 September 2011

When cloud goes wrong - Recent Microsoft Cloud365 Outage

I watched with great interest both the announcements on outages from Microsoft around their office 365 outage and also Google with their cloud offering upshot being they were out of action for good few hours.

For those that didn't know / want to read - there is good commentary from BBC that can be found here: http://www.bbc.co.uk/news/technology-14851455

It was interesting to watch from both a reaction from the general global technology populous around bad press on Cloud (and there we were thinking that Cloud was the second coming, it could never go wrong and it was capable of doing god-like things) but more interesting was the lack of understanding of implications / issues that this raises around cloud computing and something that lots of people still don't want to admit to themselves when buying into this approach - application architecture is just as important as infrastructure (in fact - even more so).

Why did Cloud365 (Cloud362?) break - was it due to poor infrastructure? - could be... Was it due to issues around application design? - More likely, was it due to both application, infrastructure and design around lack of fault tolerant design? - Almost definitely.... OK - so in this instance it sounds like Cloud365 had issues due to DNS outages / fat fingers / - the point is that it shouldn't matter... It should be designed in a way that it doesn't need rock-solid infra - it can just move around to a location where its compute requirements can be serviced.

Imagine that rather than having a discreet DC with discreet network and its own DNS management - the infrastructure was geographically dispersed and the application was able to take advantage of all of these dispersed infrastructure islands and move both app and state-full data WITHIN THE APP LAYER - would a significant outage have occurred - probably not..

If we want to embrace this new computing paradigm - it shows us that this isn't about smart chunks of hardware, sooper dooper resilience, data replication, really smart hypervisors or a whole bunch of other infrastructure offerings. Its about an applications ability to scale out, scale wide, restart, be tolerant of infrastructure failures etc etc etc - Its all about the application architecture, how it is layered on top of infrastructure and how the presentation to the enduser is designed.

If you were thinking that expensive VDC solutions are going to help with this (FlexPod / vBlock / Matrix etc etc etc) they wont. Sure they (well some) are great virtualisation platforms but they are NOT cloud platforms - the apps and layering approaches are what makes a good cloud platform.

Sorry application dev / architect types - afraid a lot of stuff is falling on your shoulders now to achieve this brave new word.

Finally - a few thoughts / assumptions to take into account when designing for this nice fluffy cloud stuff:

- All server infrastructure breaks
- Datacentres break
- Networks break
- Operating systems break
- geographic issues exist

Concentrate on issues associated with the above points and architect with this in mind - and you might just get something that could be cloud compatible.

Finally - Don't try and architect from the bottom up (i.e. from infrastructure layer upwards) but concentrate on the application design downwards and attributes that are required (scale out, resilience, security, data availability, disaster recovery, encryption etc.)

Cheers,

Stuart.


Note to Microsoft (if you happen to be listening - which I doubt) - take the chance to educate the technology community and explain how you are going to fix these type of issues moving forwards (and as a clue - the answer isn't to make DNS more resilient - all would be very interested in how you are going to change application design, management and orchestration approaches to achieve a 100% uptime goal and not a whole heap of new infrastructure stuff!)

- Posted using BlogPress from my iPad

Location:London,United Kingdom