EasyOutsource Downtime on June 7, 2010
June 8 at 1:37 pm in News, Tips & Tutorials by Matt O'Brien (Admin) 3 Comments »
Yesterday, June 7 2010, EasyOutsource was down from about 4 pm CST until around 4 am CST the following day. I want to first apologize to everyone for the inconvenience. We’re a business productivity site, and we’re not helping your productivity when you can’t access our site.
Second, for anyone interested, I want to let you know what happened and what we learned from it. This information may be useful to other website owners navigating the obscure world of hosting and server administration.
What Happened
EasyOutsource lives on a dedicated server in Los Angeles hosted by WebNX. Yesterday morning at around 11 am CST, the server stopped responding and required a hard reboot. Because the server stopped responding to the network before our monitoring services could collect data useful in diagnosing the problem, we weren’t exactly sure what had happened. After a quick reboot, everything seemed to be running fine.
Then, a few hours later, the server went down again. This time it did not recover from a reboot. Our server admin guys at AdminGeekz started working with WebNX to figure out the problem. It was quickly discovered that we were having a hard drive failure, which was causing kernel panics. The hard drive needed to be replaced.
WebNX connected a new drive to the machine, and AdminGeekz was able to copy the files from the old drive onto the new one. The data copy took quite a while. Once the files were copied and all the software was reinstalled on the new drive, the server was good to go again, and EasyOutsource was back up at around 4 am CST June 8.
What We Learned
There are a few lessons we learned and reaffirmed from this experience. I’d like to share.
Lesson 1: Hire a Good Server Management Company
First of all, we reaffirmed our belief in the necessity of retaining good server admins for anyone hosting commercial websites. In my early days as a web designer (back around 2005) I took a few hard lessons on this topic. I was running a dedicated server and paying a small monthly retainer to a company called FirmbIT to monitor and manage the server. FirmbIT was often slow to respond to tickets, and did not list a phone number or address on their site. I knew these were warning signs, but the price was so good and the service just good enough that I stayed with them. When problems occurred due to hackers, hardware failure, or faulty code, a lot of stress entered my life.
One memorable incident taught me the error of my ways. I was on vacation when my server was penetrated with a worm that commandeered the machine’s resources and put them to work attacking other servers and attempting to replicate itself. The worm brought down all my sites while it went about its business. My first response, naturally, was to open a ticket with FirmbIT, my server admin company. Their response came excruciatingly slowly, and did not fill me with confidence in their ability or motivation to solve my problem. The back and forth via their ticketing system over the next two days resulted in no progress, and eventually it became clear that they didn’t know what they were doing. Finally they stopped responding to my messages altogether. My sites had been down for two days, and my stress level grew in proportion to my clients’ agitation.
My next step was to hire a server administrator on Elance. The first guy I hired charged $100 per hour, billed about 8 hours, and told me that the server was screwed and all the data was lost. My last backup had been done months ago, and if this was true the data loss would have been devastating.
I chose to get another opinion. Back to Elance. This time I hired a guy named Remus from Romania. It was immediately clear that Remus knew what he was doing. He billed 4 hours at $25 an hour, removed the malicious software, and restored all of my sites to perfect working condition. I’ve worked with him ever since and he’s built a number of sites for me.
Remus is still my go-to server admin guy for development work, but after this incident I knew I needed a good 24/7 admin company for problem response. After a lengthy search, I was referred to AdminGeekz, who have been my server management company ever since. They’re fantastic. Their prices are good, their responses come within minutes, and these guys really know their stuff.
Lesson 2: Hardware Fails
The second thing we take away from yesterday’s incident is this rule: hardware fails. Inevitably. I asked AdminGeekz whether I should request compensation from WebNX due to the downtime. AdminGeekz had this to say:
All hard drives die; some take years, some take days. WebNX doesn’t have any control over that. For their part, they acted quickly to get a new drive into the server so that we could begin the migration. The majority of the time so far has been copying files, which is taking quite a while because of the number of files on the disk.
The only way to prevent something like this from happening would be to pay for a server with redundant disks – RAID1 at least. Other than that, there’s nothing WebNX could have done.
We’ve been very happy with WebNX so far, and if AdminGeekz says they did everything they should have, we’re satisfied. A hardware failure isn’t necessarily anyone’s fault. The host’s responsibility is to provide quality hardware and replace it if/when it fails. That’s what they did.
Lesson 3: Make Backups Regularly
The final lesson of the day is to make backups of your data. Your backups should occur regularly — at least once a day. And the backups should be stored at a separate location from your main hard drive(s). I once had a server go down because the data center literally blew up. If my backups had been on an adjacent machine, they wouldn’t have done me any good. For EasyOutsource, incremental backups occur every 12 hours at a distant location. In yesterday’s incident we were able to recover everything from the faulty hard drive, but even if we hadn’t, the data loss would have been minimal. If we weren’t taking backups, the risk would have been catastrophic.
Again I would like to apologize for the inconvenience to our users. I hope our lessons might be of use to some of our readers. Please share your own experiences and lessons in the comments.


This is exactly what I love about this site! They are here to help on everything….even educating us on any issues they may have had to resolve. And, they don’t ever leave you in the dark about anything.
Thank you for the great education…I learned something.
You guys are awesome!
Kathy
Hi Kathy,
You’re welcome
One of EasyOutsource’s objective is to be as transparent to our users as possible. We’re glad you appreciate it. Thanks!
On the subject of doing regular backups, I highly recommend WP-DB-Backup for anyone running a WordPress site: http://wordpress.org/extend/plugins/wp-db-backup/
It takes about a minute to install and lets you run scheduled database backups, which can be backed up onto your server and sent to an email address in addition. Great peace of mind. If you have a number of websites, I recommend setting up a specific email account just for backups and setting all of your WordPress sites to email weekly backups of the database to this email via WP-DP-Backup.