This week, I wanted to share some of the steps we take to ensure the security and availability of GAMEhud. I will go over hosting, backups, our development process and monitoring.
Our service runs on the Rackspace Cloud. Basically, it is a virtual machine that runs on their infrastructure that we built and secured ourselves. This service has been very reliable. We have had a few instances of network availability issues, but as I write this, the machine we setup has been running for 217 days without a reboot (gotta love *NIX). Even with the few instances of network issues (and some planned downtime), we are sitting at 99.95% availability for the last 3 months.
In terms of availability, we always want to shoot for 100% for all of our services. However, we are going to invest more time and money in keeping our tracking services as close to 100% availability as possible. For example, if the dashboard goes down for a few minutes, that is not as big a deal as not being able to receive the tracking data you are sending us.
We do full server backups every week and differential backups every day and store them on Rackspace's infrastructure. We also do a second set of server and/or database backups that we ship to Amazon S3 on a weekly, daily and hourly basis. So, we have redundancy in our backup solution and should lose no more than an hour of data given a significant data loss event. Lastly, we use Git for source control. Git is a distributed source control system that maintains a complete copy of the source code on each repository. So, we have 2+ copies of the source code at all times in different locations.
A key to building a secure and highly available service is your development process. We strive to practice Test Driven Development (TDD) at all times to produce reliable code. So, we have a suite of tests that we run during development and prior to deployment that confirms the service works as we expect. We also use test code coverage and security analysis tools that run against our code to ensure it is covered by tests and tested for security. Lastly, we are constantly learning new tools and techniques to improve the security and availability of our service.
We use New Relic to monitor the performance and availability of our service. For example, the 99.95% availability metric comes from New Relic's reporting. In addition to availability monitoring, New Relic allows us to dig deep to find and fix any performance related issues that appear. We use another service called Airbrake to track application exceptions. These services help us track the operation of our service and hopefully let us know about any issues before you do. So, if our server is unreachable for more than a minute, I get an email, or if an application exception occurs in our service, I get an email. Fixing application exceptions quickly is not only good for availability but it is good for security as well.
We are always on the lookout for tools or techniques that will help us provide a better service to our customers. If you have a question about anything I have mentioned, or think I have missed something, please leave a comment below.
As always, we welcome all comments and suggestions. If you liked this article, please share it with others.