Sunday, December 20, 2009

The Down-Side of Cloud Computing

Over the last 50 years, with exponentially increasing power and rapidly declining costs, information technology has proved to be an irresistible and inevitable force. Cloud computing represents the continuing rapid advance of information technology.

Cloud computing is a process of outsourcing and applying information technology. It gives organizations huge cost savings, it frees up cash that’s otherwise tied up in hardware, and it allows organizations to focus on their core competencies rather than trying to manage systems and specialists that they are ill-equipped to manage.

I am not going to get into the weeds here, but I do want you to know that there are a number of new, complex technologies operating in the Cloud. The independence of the different technologies contributes a measure of chaos to the Cloud that makes it almost impossible for non-techies to grasp.

I want to share with you several of my experiences dealing with problems and the technologies used in the Cloud.

Troubleshooting a Virtual Machine. Virtual machines operate like real machines -- except when the virtual machine crashes.

Here’s what I mean. One of the first steps in resolving a system problem in Windows is to look in the event logs and see what errors there are. If you are lucky, that will give you a clue as to where the problem lies and how to fix it.

But on a virtual machine, the logs can be misleading. For example, I was troubleshooting a problem on a virtual machine and I saw in the event logs of the virtual machine where there was a hardware disk error. Well, if that was a real error, I’d expect to see a similar error in the event log of the host or real machine. But the event log of the host machine was clean, meaning that the physical drive was OK. So, what does it mean to have a hardware error in a virtual hard disk? That is a head-scratcher. What do you do? Reboot the virtual machine and hope the problem goes away.

Recovering Data from a Virtual Disk. In a different situation, a sys admin was performing routine maintenance on a host machine, patching and updating the host operating system.

Unfortunately, one of the files associated with a virtual machine was inadvertently erased. I was called in to restore a backup of the virtual machine on a backup host machine. This is fairly easy to do and is one of the selling points of virtual machines. I had the backup of the virtual machine up and running quickly.

The wrinkle here is that there was a period of time between the backup and when the data was erased. There was data on the virtual machine's disk that was lost that we wanted to get back.

Data recovery tools deal with the physical media, looking for electro-magnetic shadows or ghosts of deleted files. Data recovery tools are ill-suited to finding lost data on virtual machine disks.

In this case there was a one-to-one map between the physical machine and the virtual machine. So we had physical media to look at. That's why I restored the backup to a different host. This was a best-case scenario for recovering data from a virtual disk.

But virtual machines are a handful of files on a physical device. Hundreds of thousands of virtual machine files are packed into those few physical files. To recover the lost data, we had to be able to completely recover the LARGE missing physical file and instantiate the virtual machine as it was at the time the virtual machine was erased.

We turned to Kroll-Ontrack, the market leader in forensic data recovery. Unfortunately, they were not able to put Humpty-Dumpty back together again with the tools they have.

So, it is interesting to note that while virtual machine technology is sold as providing enhanced data security, at least in this scenario, the opposite was true.

Running Amok. In the last scenario, I glossed over the matter of the Sys Admin's error. But this is a serious problem.

With cloud computing and virtual machines, Sys Admins are forced to access resources remotely. Sometimes to work on these systems, we will have several remote sessions open on one physical PC, meaning that we have several desktops open and toggle between them. All of the desktops look the same. You constantly have to ask yourself, “Where am I and what am I doing here?” One trick I know to reduce confusion is to change the wall paper on different desktops so different remote machines do not all look the same as I toggle among them.

Confusion is not the only pitfall for system administrators. Here’s a completely different example of Sys Admins running amok.

Recently, a graduate school student at a nearby University took a browser-based, online exam run by the school for one of his classes. A complex web of technology underlies the school’s testing system. It uses web services to link different University databases and a BlackBoard content management system. Questions and answers traveled the University's intranet, the Internet and a WiFi access point in the classroom. The student used his own laptop to take the test.

Unfortunately, there was a technical glitch and the school’s IT department accused the student of cheating on the exam based on irregularities in the BlackBoard log of student’s exam session.

In Kafka-esque fashion, several hearings on the matter were conducted. Each time the school’s IT department insisted that they were the experts and the BlackBoard log could only mean that the student had cheated.

I testified on the student’s behalf in the final hearing. I told the hearing that one anomaly in one link in a long chain was not a smoking gun. I said that the IT department had jumped to a conclusion without looking into the matter completely. What did the web server logs show? What did the logs on the WiFi access point show? What about the student's laptop? The IT department never looked, and it was then months later and too late to go back and look.

Rather than appearing and acting like experts, in my view, the school's IT department was inept and vindictive in this case. Fortunately for the student, he was found innocent of the charges in this final hearing.

As cloud computing spreads, the roles and responsibilities of the technical staffs that support it will grow in importance. The consequences for organizations and the public when these technicians run amok will grow as well.

Scaling-Up. These different examples I have described are all small. The consequences were not large and/or not felt beyond a few individuals. But it is not hard to see how easily and well these examples can scale-up in the cloud.

That same software bug that cripples my Exchange server might just as easily interrupt service on Microsoft's Hotmail or MSN email services.

But while most of us can tolerate a short service interruption – be it in email, Internet access or electricity – many of us cannot tolerate data problems. More and more of us are foregoing hard-copy and snail-mail for electronic data stored in the cloud and email. When data is lost or privacy is breached, the consequences and the costs can be painful. Occurring in the cloud, involving Microsoft of Google, millions of people could be affected.

At the systems and architectural level, there is fault tolerance built in. The Internet is highly redundant and it is supposed to be able to function during a nuclear attack. Web services work together, but they are “loosely coupled” so that a problem in one system does not spread to another.

When airliners destroyed the World Trade Center towers, an economic meltdown was averted because a bomb blast in the building 10 years before had caused financial firms headquartered there to construct clouds where data could be stored more securely. And that worked in 2001. Now most financial and other commercial data of large organizations is stored in clouds.

But, psychological and political factors are such that the technological checks and balances may not be enough to stave-off an economic meltdown. It has been reported that the Chinese, our enemies and terrorists are capable of attacking soft targets on the Internet (read the Cloud). Hurricane Katrina, last year’s financial crisis, pandemic flu… Events like these could conceivably impair the cloud and cause data loss, triggering panic and an economic meltdown at some point in the future.

Recommendations. Now I've got some recommendations for you, assuming you were a client of mine looking at cloud computing; if you are planning on outsourcing one or more IT functions.

  1. Perform your due diligence. You need to be confident that the vendor you are going to use has the financial, technical and managerial resources to deliver and survive. In this vein, you might want to monitor audited financial statements of the vendor on an ongoing basis after you start working together.
  2. Does the Service Level Agreement promise what you need/expect? Chances are that you are not going to be able to negotiate the terms and penalties of the SLA. It is usually a take-it-or-leave-it situation for small businesses. But you should leave it if the SLA is not what you want or it is unbalanced.
  3. Insurance. Insist that the vendor has Professional Liability coverage. Some general business liability insurance policies specifically exclude liabilities for data processing activities. Look at the vendor's certificate of insurance. See if you can get listed as an additional insured on the vendor's policies. Make sure the policy limits are not too low. Look at the per-occurrence limits too.
  4. Be prepared to sue. As we all know, “IT happens.” In the event of a bad outcome, neither the vendor nor his insurance company is going to offer to make you whole. They will start by offering you pennies on the dollar. You are probably going to have to sue them to be made whole. Make sure that the contract you sign with the vendor lets you recover your legal fees if you win a judgment against him. You need that language in some jurisdictions, and without it your compensation might be significantly reduced.