AWS performance quality doubled in Q2

While the Northern Virginia and Northern California regions continue to lead in number of errors —12 and 8 respectively— São Paulo, Tokyo and Singapore registered just 1 service error.

Data collected from AWS Health Dashboard reporting shows a 50% decline in global AWS performance issues—from 8 in Q1 2014 to just 4 in Q2 2014. Other, less critical errors, increased from 33 to 48. The data also shows that Northern Virginia recorded the most highest number of errors—12, of which 2 were performance issues. Northern California followed with 8 errors— only one of which was a performance issue. Services that are not region-specific (Global) recorded 17 errors, of which just one was a performance issue. Sao Paulo, Tokyo and Singapore recorded one error (with no performance issues in Q2 2014).

When analyzing errors by service, Route 53 and EC2 had the most errors—8, but just one EC2 error was recorded as a performance issue.  These were followed by ELB with 5 errors, then CloudFront, CloudWatch, Mechanical Turk, Simple Email Service, and RDS with 4 errors (Simple Email Service recorded one performance issue in Q2 2014).

AWS has recorded considerable regional fluctuations in terms of the number of errors and performance issues. For example, Northern California, which had recorded 2 errors in Q1, jumped to 8 errors, one of which was a performance issue. Sao Paulo, which recorded the highest number of regional performance issues in Q1 (3), recorded just one error in Q2, with no performance issues.

Planning the location of your app based on the historical number of errors and performance issues is probably not the best approach. While cloud provider issues are important to note, consider that the top reason for application downtime remains human error. To increase service availability dramatically, cross-region disaster recovery will yield optimal results. This way, if your primary region is down, you can create a failover procedure affecting your entire application stack – which essentially spins up an exact replica of your application in an alternate AWS region.

Of course, there’s lots you can do to ensure the availability of your application on the Amazon cloud. AWS provides extensive guidance on how to build a fault-tolerant application. For starters, run applications simultaneously and independently in multiple Availability Zones so that if one zone does fail, the application running in the other zone can continue to run redundantly without impact.  Information on how to do this can be found on the AWS architecture center, which contains more guidance, including training webinars, best practice guides and a great white paper Building Fault-Tolerant Applications on AWS.

Microsoft Azure Service Availability in Q2: 22% Fewer Errors Overall, 800% Increase in Service Interruption

Despite improvement, Q2 issues show increased Service Degradation and Interruption, which carry more severe effects on application availability.

In Q2 2014 Microsoft Azure experienced 201 service issues, as compared to 259 in Q1 2014—a 22% improvement. However, Q2 service issues tended to be more severe: Service Interruptions increased over 9-fold, from 3 to 28, and Service Degradation increased from 88 to 131, a 49% increase. Service Information decreased from 168 to 42—a 75% decline.

The top interrupted service was SQL Databases with 47 issues, of which 8 were Service Interruptions and 37 were Service Degradations. This was followed by Compute (Service Management) with 30 issues, of which 27 were Service Degradations and Compute and Storage with 18 issues. Other noteworthy products with issues included Service Bus with 9 issues, 7 of which were Service Interruptions, and SQL Reporting with 9 issues, 8 of which were Service Interruptions.

An analysis of service issues by region shows that Americas West had the highest number of issues overall—33. However, Europe West reported the highest number of Service Interruptions—5. Japan East and West showed the least issues—6.

While Service Interruptions and Service Degradations occurred across regions, it’s important to remember that the top reason for application downtime remains human error. To eliminate the impact of any cloud provider’s service disruption on your application altogether, implement continuous, cross-region replication of the entire application stack. This way, you can switch to another region whenever Service Degradation or Interruption puts application service at risk.