Value of IT Automation

Intuitively we all understand that automating IT operations, including builds, deployments, configurations, upgrades and so on is a good thing. We all know that humans make mistakes and mistakes can be costly when they affect a large group of people (e.g., a large user community) or otherwise result in lost revenue to a business.

But how error-prone human actions really are? Note that we’re not talking about normal error rate in software development or other creative fields. Clearly, it is very difficult if not impossible to eliminate errors from occurring when the task can’t be formalized. On the other hand, a large area of IT is related to operations and maintenance and it involves mostly predictive and repetitive tasks. There are many tools, from simple scripts to super-expensive enterprise products that deal with automating these types of tasks. Knowing the probability of human error would help us estimate potential benefits from these tools and, consequently, assess the return on investment.

The classic formula for calculating human reliability can be found here. Without going too much into the math, empirically we can ascertain the following:

  • Every action performed by a human has a probability of error. It is never zero.
  • Most tasks (at least, in IT) consist of multiple steps (actions). E.g., a change may have to be made on multiple severs.
  • The likelihood of error goes up proportionally to the number of steps.

So it should not come as a surprise that the probability of error could be quite high for complex task. According to the data published on Ray Panko’s website, 28 experienced users on average had 33% error rate in a task involving 14 command-line-based steps. Another interesting tidbit from the same site is 50% error rate in following a checklist. It is unfortunate that the details of these studies are not documented on the site.

Of course, many of these errors can be caught and corrected via testing. It is common knowledge that every change has to be accompanied by some verification or “smoke” testing.

But some changes are impossible or very expensive to validate. Imagine having to change JVM maximum heap size to prevent an application running out of memory. Imagine also that this is a high volume application that runs on four different servers. Imagine further that one server out of four was not updated by mistake. You are not going to find out about it until the application starts crashing on that server under load – and this will be the worst time for dealing with this issue. Now, what if a parameter that had to be changed was some obscure garbage collection setting that was going to improve application’s performance. Users will be experiencing intermittent performance issues but there will be nothing explicitly pointing to the offending server. Discovering the root cause of the problem could take quite a while. The bottom line is that some errors can only be discovered by users at which point the cost of fixing them is going to be substantial.

I think that we tend to overestimate the reliability of human actions and underestimate the cost of fixing errors. After all, how hard could it be for an experience administrator to run a few commands? And it could be very easy but it still does not make the actions of this administrator any more reliable.

The bottom line that almost any investment in IT automation is well worth it. Unfortunately, this view is not uniformly accepted. Many organizations still live in stone age when it comes to automation, but that’s a subject for another post.

DataPower News

I've long been a "fan of XML Appliances":/is-xml-appliance-an-ultimate-esb. Looks like IBM customers like the appliance idea as well. IBM said that DataPower was one of the top-selling products and also "announced":http://www.infoq.com/articles/cuomo-websphere-trends-2009 that "DataPower-lution":http://www.ibm.com/developerworks/blogs/page/gcuomo?entry=datapower_lution is one of their strategic directions for 2009. Basically, more and more edge functions will be moving into the appliance. And why not use the same device for XML acceleration, load balancing, crypto acceleration, caching, perhaps even WebSEAL replacement (it's just a fancy reverse proxy after all). We'll see how this vision plays out.

In a related news, there is finally a "DataPower book":http://www.ibm.com/developerworks/blogs/page/woolf?entry=ibm_websphere_datapower_soa_appliance1 and it's 960 pages long. And this is before IBM started adding all these edge functions to the device :).

Exception Handling in WSAdmin Scripts

Using AdminTask in wsadmin often results in getting ugly stack traces. In fact, AdminTask always produces a full java stack trace even when the error is fairly innocuous (e.g., a resource name was mistyped). The stack trace in this situation is pretty useless; it could actually confuse operations staff as it might be interpreted as a problem in IBM code.

It is, in fact, quite easy to deal with this situation in Jython and suppress the annoying stack trace:


import sys
from com.ibm.ws.scripting import ScriptingException
...
    try:
        AdminTask....
    except ScriptingException:
        # note that we can't use "as" because of python 2.1 version, have to use sys
        print "Error:\n"+str(sys.exc_info()[1])

Stack Traces and Consulting Rates

Long and unwieldy stack traces is a common occurrence when dealing with Java EE application servers. Here is an example. Many (if not all) of these products re-throw the same exception multiple times which complicates things even further. Figuring out the root cause of an exception is a major undertaking.

Of course, most of the trace is useless when using proprietary products since it points to classes that you don’t have the source code for. And not only you – level 1 of support most likely can’t get to the source code either. As a result, 90% of the trace has little to none immediate value.

As a rule, the more complex the product, the longer the stack trace. Makes sense, right? You’ve got more layers and components, each layer thinks that it is its duty to dump the whole thing to the log and re-throw.

May be we should start using stack traces as a code complexity metric. It would be much more telling than cyclomatic complexity.

I also think that there is a correlation between an average length of a stack trace and an average consulting rate that users of the product pay for development and support. So may be at the end of the day, developers and administrators should not grumble about it to much and I should just shut up.

WAS 7.0 is Still On Jython 2.1

I’ve started playing with WebSphere Application Sever 7.0 (the latest and greatest) and to my surprise discovered that it still uses Jython/Python 2.1 as a scripting language for its wsadmin tool.

Jython 2.2 has been around for quite a while and, from my experience, is very stable. So it’s odd that IBM choose not to upgrade it. The difference between 2.1 and 2.2 is quite significant, the biggest selling point of 2.2 is new style classes with a unified type model. Python 2.2 also supports properties. I don’t believe there is closures support in 2.1 either.

Why is it important? Well, using Java for WAS administration is hard; the API is obscure and poorly documented. This makes Jython the only game in town (with JACL deprecated some time back). So being able to use a modern version of Jython is highly desirable.

I’m still hoping that IBM might upgrade jython as part of one the minor upgrades; WAS 8.0 is probably a long time away.

Using Thin WSAdmin Client with WPS/WESB

Quick tips for those wishing to use thin wsadmin client with process server/WESB: add com.ibm.soacore.runtime_6.1.0.jar to the classpath.

Normally, you need the following jars for the client to work: com.ibm.ws.admin.client_6.1.0.jar, com.ibm.ws.security.crypto_6.1.0.jar com.ibm.ws.webservices.thinclient_6.1.0.jar.

However, if you want to use any of the SCA commands available in WESB, the extra jar is needed. Otherwise, you’ll be getting “class not found” for SCACommandException. The jar is available under wesb_root/plugins.

I don’t think it’s documented in infocenter.

Building Windows NT

I’ve been reading a relatively old but nevertheless fascinating book called Showstopper about development of Windows NT. I was struck by the author’s account of NT’s build process, specifically its low degree of automation.

NT was obviously a high-intensity, almost a death march kind of project and so the builds had to be churned out at a quick pace:


...the number of builds grew from a couple to a half dozen some week…

This may not sound like much, but since NT was getting quite big and complex, it kept the guys in the build lab busy. The builds were so critical that at some point the technical lead of the project, Dave Cutler, had to take over the build lab. This, however, did not improve the way builds were done. One of the members of the build team remembers:


He is not giving us the list, he’s basically saying, ‘Go to this directory and sync this file.’ He’s saying, ‘Pick up this file, do this, do that’.

The release process was pretty haphazard too according to another team member:

We have all these cowboy developers, just slinging code like crazy, calling out: “We need another build!”

And, of course, continuous integration was not invented yet:

We’d think we were all done for the day, then test the build and it wouldn’t boot. We’d run around looking for the programmer whose code broke it.

I don’t think this situation was unique to Microsoft back then. But I also think that the attitude toward CM and development process automation has changed over the last 16 years. Today, automated builds is pretty much the norm for all but the smallest projects. Continuous integration and automated testing is becoming widespread. There is a dizzying array of build systems, build servers, version control systems and other CM and development tools.

There is a long way to go however. Implementing solid build/deployment and release management automation is still hard. Most large projects end up having to dedicate multiple highly skilled people to solving this problem. Home-grown script-based automation is still pretty much the state of the art. This is going to change. The tools will become more intelligent and advanced. I hope it won’t take another 16 years.

XML Alternatives and YAML

The need for a more human-friendly alternative to XML is apparent to many people, myself included. This is the reason why quite a few different light-weight markup languages have been created over the last several years. I guess they are called “lightweight” because they don’t use XML-like tags that tend to clutter documents. I’ve looked at several of them and found YAML to be the most mature out of the bunch as well as quite human-readable (as opposed to, say, JSON) and easy to understand. You can find some very good side-by-side XML vs. YAML comparison here or here, the difference in readability is stunning.

From what I understand, YAML is popular in the Ruby world and it is used for various PHP projects. However, it is almost unknown in Java/J2EE circles. Which is a shame. While annotations somewhat limited the spread of “XML hell” in Java applications, XML still remains a de-facto file configuration format. I would venture to say that except for few outliers, YAML would be a better option as a format for configuration files. Why is it the case? One reason is that YAML format simplifies application support. Developers often say that they don’t care about readability of XML since they use IDE or editors that hide the complexity of XML. Indeed, being able to work with XML in a nice tree view-based editor is appealing. But this does not work when application configuration needs to be quickly analyzed and potentially updated on some remote machine that most likely only has VI or notepad (which is usually the case in production environments, which I find very ironic – shouldn’t the production machine have the most advanced editors and analysis tools to make troubleshooting as efficient as possible?) in response to some production problem. For configuration files, readability and ease of understanding is the key.

Of course, there is also an old trusty property/name-value format. It is, however, very limited, since it does not support any kind of nesting or scoping. So all properties become essentially global and haven’t we learned already that global variables is not a good thing?

YAML, on the other hand, allows for expressing arbitrary complex models. Anything that can be expressed in XML can also be expressed using YAML.

On the downside, YAML does not have a very broad ecosystem. There are not that many editors that support YAML. There is a YAML Eclipse plugin, but in only gives color highlighting, no validation (here is another plugin which I have not tried yet). There is no metadata support, at least for Java, although there is a schema validator for Ruby (its Java port seems to be dead). There is also no XSLT equivalent.

There are two YAML parsers for Java – jvyaml and JYaml. They kinda work, but there is certainly room for improvement in terms of error messages and just the ability to detect and reject an incorrect document. Since YAML is supposed to be a language with minimal learning curve, the parsing has to be intuitive and bulletproof.

I still think that despite the shortcomings YAML is the way to go. Perhaps I will give a closer look to one of these parsers and see if I can tweak it a bit.

Why are Environments So Poorly Supported?

A concept of “environment” permeates software development lifecycle. No application is released into production directly from developers PCs. There has to be a place where an application can go through various stages of testing. We use different environments for that purpose, e.g., “QA environment” or “acceptance environment”.

An “environment” is just a collection of resources which could include middleware and OS/filesystem resources. In the simplest case, an environment for a J2EE web application consists of a single application server. Complex applications consisting of multiple components could utilize many different resources, including several different middleware products (e.g., app server, Web server, messaging infrastructure, ESB).

For any IT organization it is important to know how their resources are used. ITIL has a concept of CMDB that’s supposed to contain all IT resources. However, the granularity of CMDB implementations is usually too high (typically, a server level) which makes it difficult to use for software development. Also, CMDB is not really integrated with development processes and tools, it’s kind of a thing on its own.

Ideally, the environment concept must be supported by development, version control, change management and build/deploy tools. Environment metadata can be used to automatically install an application in a given environment. Testing tools can use this metadata to generate “smog” tests or to adjust existing test cases (e.g., by using different URLs/endpoints). There should be a capability to produce various reports showing what version of what application is installed in what environment.

Sadly, all these wonderful features are mostly missing from modern development and CM tools. Developers rely on scripting and informal use of environment variables. Essentially, today, each application has its own “selfish” view of what an environment is. This makes providing consistent operations and support difficult. This is especially true in virtualized environments where each logical environment may consist of many different VMs.

Case in point are build servers and build tools. I looked at several build servers and found explicit environment support only in AntHill. All the others I looked at (several, I won’t name them) omit the environment concept completely (except for some lame support of environment variables). To me, this is really odd. While build servers have their root in continues integration, their key selling point in an enterprise is actually release management (at lease for commercial products; there are many great open source build servers to choose from if continuous integration is the only goal). So how can a release process be automated if the foundation of this process is missing entirely from the tool that’s supposed to help with the automation?

Build tools are the same way. There is some deployment support in both Maven (Cargo plugin) and Ant, but no way of supporting environment as an entity.

Will Recession Provide a Boost to SOA?

There is no doubt that IT budgets will be cut next year. The size of the cut is hard to predict, but I think that it could exceed the Gartner’s forecast.

This, however, could finally give a shot in the arm to the long stagnated SOA projects. What makes me say that? Well, as I’ve been advocating all along, SOA represents an inexpensive way of adding flexibility and achieving new goals. SOA does not have to require significant new investments. Done smartly, it could be a great way of implementing new functionality without ripping the existing systems apart, which is undoubtedly won’t fly in light of highly constrained budgets.

To prove this point, in spite of lackluster economy, the demand for SOA architects has been booming.

There is one caveat to this view – organizations need to stop approaching SOA from the infrastructure standpoint (i.e., what ESB product do I need?) and start using it as a solution to real business needs. I’m not entirely sure that it’s going to happen, but bad economy could be just the motivation for that.

Updated Jython Ant Task

I've updated Ant Jython task with a number of new features:

* Jython path is now handled by a separate JythonPath task.

* Jython interpreter is now scoped to Ant project. This means that you can have multiple Jython calls withing the same project that share common imports and variables.

* Jython task now supports nested text, similar to the "script" task.

Ant Jython Tasks (PAnt Tasks)

PAnt build tool comes with several Ant tasks to facilitate the use of Jython/Python from Ant.

PAnt tasks have a number of advantages over built-in <script language="jython"> way of invoking Jython from Ant:

  • More graceful exception handling. Jython code invoked using "script" generates long error stack that contains full stack trace of the "script" task itself. Sifting through the traces and trying to distinguish Java trace from Python trace is quite painful. PAnt "jython" task produces brief readable python-only error stack.
  • You can use Ant properties as parameters ("jython" task makes them available in the local namespace of the calling script).
  • Convenience "import" attribute.
  • "jythonInit" task allows for setting python.path using Ant path structure.
  • Jython interpreter is initialized once per Ant project. All scripts invoked from the same Ant project reuse the same built-in namespace. So you can define variables and imports in one call and use them in a subsequent call.
  • Task name ( the name that prefixes all console output from Ant for a given task) is generated automatically based on the supplied Python code.
  • "verbose.jython" property triggers verbose output for jython-related tasks only. This is much easier than trying to scan through hundreds of lines of general "ant -v" verbose log.

Example:

Ant code:


<jythonInit pythonPathRef="python.path" />
<property name="testProp" value="testVal" />

<jython>
print "Property from ant:", testProp
# define a var that we can use in other scripts
s="test"
</jython>

<jython>
print "Var created earlier: ",s
</jython>

<jython  import="from testmodule import *" exec="test(testProp)"  />

"testmodule" python code:


from pant.pant import project 
def test (prop):
    print "Passed parameter: ",prop
    print "Test property: ", project.properties["testProp"]

Please refer to this build.xml file for more examples.

The tasks can be used independently of PAnt python code.

PAnt Ant Tasks Reference

Getting Started

Download PAnt, extract pant.jar and create "taskdef" as described here

"jythonInit" Task

The tasks initializes jython interpreter. Because of the overhead, the interpreter is initialized only once even if jythonInit is invoked multiple times. The repeating calls are simply ignored.
jythonInit automatically adds pant.pant module to PYTHONPATH.

Attributes:

  • pythonPathRef - PYTHONPATH (python.path) to use, given as reference to a PATH defined elsewhere. Required if "pythonPath" nested element was not provided.
  • pythonHome - location of python distribution (optional). If provided,jythonInit will set python.home system property and will automatically add ${python.home}/Lib to the python path if ${python.home}/Lib exists.
  • cacheDir - location of jython cachedir used for caching packages (optional). Defaults to ${java.io.tmpdir}/jython_cache (note-- this is different from default jython behavior).

Nested elements:

pythonPath - python.path to use defined using Ant path-like structure. Required if "pythonPathRef" attribute was not provided.

Special properties:

log.python.path - if set to "true", jythonInit will print python path to Ant log. Default: false.

"jython" Task

Invokes python code.
Note: by default, jython does not print python stack trace in case of an exception. To see the trace, run Ant in verbose mode using "-v" or use "-Dverbose.jython=true" property.

Attributes:

  • exec - Python code snippet to execute. Typically, this is a function from a module available from python.path. This has to be a single line, e.g., mod.fun() although you could combine multiple statements separated by ";". Required if "execfile" was not provided.
  • import - a convenience attribute for providing "import" statement. Its only purpose is to make the task invocation more readable. Alternatively, you can have "import" as part of the"exec",e.g., exec="import mod;mod.fun()". Optional.
  • execfile - path to a python script file. Required if "exec" was not provided.

Nested elements:

Inline text with python code.

Special properties:

verbose.jython - if set to "true", jython will print additional information about executing python code to Ant log. Default: false.

pimport Task

Creates Ant targets from a python module. Functions that will be used as targets have to be marked using "@target" decorator as described here.
Python module name is used as Ant project name. Target overriding works the same way with Ant import task. In other words, targets defined using pimport will override targets previously defined using "import" or "pimport" tasks.

Attributes:
module - python module to create targets from. The module has to be available from python.path specified using jythonInit.

WebSphere 7 Supports Properties-Based Configuration

IBM WebSphere 7 (currently in beta) comes a property-file based configuration tool that provides a "human-consumable" interface to the currently XML-based configuration repository of the application server. This is another proof that XML is simply not the right mechanism for managing configuration of complex software products.

From the release notes:


Properties (name/value pairs) files are more consumable for human administrators than a mix of XML and other formats spread across multiple configuration directories.

Kudos to IBM for recognizing that.

It is still not clear though how hierarchical relationships between configuration objects will be supported.

Back in WAS 6 world, I've been using a simple jython script that converts python named parameters into wsadmin format. This is an example of a resource described in this format:


 WASConfig.DataSource(parent="testJDBCProvider", name="testDS", jndiName="jdbc/testDS",
                              description="Test DataSource", propertySet=dict(
                              resourceProperties=[
                                  dict(name="dbName", value="testDB", type="java.lang.String" ),
                                  dict(name="connectionAttribute",value="", type="java.lang.String")
                               ]))
    


I think that a slightly more streamlined python-based format will be superior to properties.

Jython in WebSphere Portal

Most developers and administrators working with WebSphere Application Server (WAS) know that both JACL and Jython languages can be used for various WAS administration and configuration tasks. However, JACL has always been a preferred choice, simply because this is the default language used by the product's admin tool (wsadmin) and also because JACL examples and documentation are more complete.

Using JACL might have been a valid option just a few years back (when WAS just came out) given the uncertainty surrounding the Jython project. Today, however, jython is clearly alive and well; alpha version supporting Python 2.5 was announced recently. Therefore there is really no point in using JACL any longer, except may be for shops with a large collection of existing JACL scripts. JACL syntax is quite arcane compared with Python and the language is clearly not as widely used.

IBM confirmed this view by releasing JACL to Jython converter a couple years back.

Unfortunately, up until recently, jython was not officially supported in another IBM product, WebSphere Portal, which comes with wpsript tool for managing pages, deployable modules and other portal artifacts.

But since portal scripting relies on wsadmin's shell, jython is in fact fully supported by the product, it's just not documented.
All that you need to do to switch to jython is to invoke wsadmin with "-lang jython" and "-wsadmin_classpath " followed by the list of portal jars (you can copy the classpath from SCRPATH variable definition in wpscript.sh).

As an example, I put together a simple Jython script for cleaning up a portal page hierarchy. Removing pages before applying an XMLAccess script with page definitions allows to start portal configuration from a clean "known" state. Very often, especially in a development environment, an application's page hierarchy gets polluted with various "test" pages created by developers. The script gets rid of them.

In WebSphere Portal 6.1 Jython is finally made a first-class citizen. The product's documentation proclaims that JACL support will be phased out and that jython is the way to go. Surprisingly, though, all examples still use good old JACL. I assume it's just a matter of time before they are converted.

Yet Another Build Server

Thoughtworks has finally released a successor to their venerable Cruise Control - Cruise build server. The UI certainly looks nice and it seems quite flexible. There is even a free version (which is limited to two computers), which is great.

What is not clear though is how this product is different from AntHill, Buildforge, Bamboo, TeamCity, Gauntlet and the likes. The field is certainly becoming crowded - and I haven't even mentioned numerous open-source contenders. All these products seem to be doing the same thing - organizing your build scripts, interfacing with version control, running builds on a distributed "build server farm", collecting statistics, publishing reports and providing UI for all these functions.

All these features are important and useful. Ironically, however, what build servers don't do is automatically building or deploying your software. You still have to write Ant or Maven scripts, define and manage configuration parameters (using properties, XML, environment variables), deal with different environments (if I'm not mistaken, AntHill is the only product that has an explicit concept of an environment). For a complex project this could be a lot of work. Granted, every project is unique (if not, just use the default Maven configuration and you're done), so this could be a tough nut to crack. It should however be possible for a build server to have enough intelligence to infer how to build a project directly from the code base.