Intuitively we all understand that automating IT operations, including builds, deployments, configurations, upgrades and so on is a good thing. We all know that humans make mistakes and mistakes can be costly when they affect a large group of people (e.g., a large user community) or otherwise result in lost revenue to a business.
But how error-prone human actions really are? Note that we’re not talking about normal error rate in software development or other creative fields. Clearly, it is very difficult if not impossible to eliminate errors from occurring when the task can’t be formalized. On the other hand, a large area of IT is related to operations and maintenance and it involves mostly predictive and repetitive tasks. There are many tools, from simple scripts to super-expensive enterprise products that deal with automating these types of tasks. Knowing the probability of human error would help us estimate potential benefits from these tools and, consequently, assess the return on investment.
The classic formula for calculating human reliability can be found “here”:http://www.ida.liu.se/~eriho/HRA_M.htm. Without going too much into the math, empirically we can ascertain the following:
* Every action performed by a human has a probability of error. It is never zero.
* Most tasks (at least, in IT) consist of multiple steps (actions). E.g., a change may have to be made on multiple severs.
* The likelihood of error goes up proportionally to the number of steps.
So it should not come as a surprise that the probability of error could be quite high for complex task. According to the data published on “Ray Panko’s website”:http://panko.shidler.hawaii.edu/HumanErr/Multiple.htm, 28 experienced users on average had 33% error rate in a task involving 14 command-line-based steps. Another interesting tidbit from the same site is 50% error rate in following a checklist. It is unfortunate that the details of these studies are not documented on the site.
Of course, many of these errors can be caught and corrected via testing. It is common knowledge that every change has to be accompanied by some verification or “smoke” testing.
But some changes are impossible or very expensive to validate. Imagine having to change JVM maximum heap size to prevent an application running out of memory. Imagine also that this is a high volume application that runs on four different servers. Imagine further that one server out of four was not updated by mistake. You are not going to find out about it until the application starts crashing on that server under load – and this will be the worst time for dealing with this issue. Now, what if a parameter that had to be changed was some obscure garbage collection setting that was going to improve application’s performance. Users will be experiencing intermittent performance issues but there will be nothing explicitly pointing to the offending server. Discovering the root cause of the problem could take quite a while. The bottom line is that some errors can only be discovered by users at which point the cost of fixing them is going to be substantial.
I think that we tend to overestimate the reliability of human actions and underestimate the cost of fixing errors. After all, how hard could it be for an experience administrator to run a few commands? And it could be very easy but it still does not make the actions of this administrator any more reliable.
The bottom line that almost any investment in IT automation is well worth it. Unfortunately, this view is not uniformly accepted. Many organizations still live in stone age when it comes to automation, but that’s a subject for another post.