Category Archives: IT operations

DPBuddy — Tool For DataPower Administrators and Developers

We’re pleased to announce the release of our new product, DataPower Buddy (dpbuddy). “dpbuddy” a free command-line tool for automating administration, management and deployment of IBM WebSphere DataPower appliances. The tool supports export/import, file transfer, backups and many other functions.

dpbuddy is implemented as a set of custom tasks for the popular build tool, Apache Ant.

Here is a quick example of dpbuddy in action:



    


This Ant task will remove remote directories if they exist, reproduce the local directory tree (all folders under “services”) on the device and upload the necessary files based on the “includes” pattern.

dpbuddy is completely free; it can be downloaded from the dpbuddy product page

dpbuddy provides many cool features, including:

* Response from the device is presented in a human-readable form as opposed to raw SOAP/XML messages. dpbuddy makes it easy to understand error and status messages.

* Powerful remote “copy” command that automatically reproduces local directory tree on the device.

* Tight integration with Ant. Ant variables can be used inside deployment policies and configuration files.

* Easy-to-use alternative to deployment policies based on XPath.

* Ability to remotely “tail” device logs. It is even possible to automatically get new log messages similarly to Unix “tail -f” command. “tail” task can also check for error patterns.

* “Export” based on naming patterns. You don’t need to know types (“classes”) of DataPower objects; simply specify a regexp pattern and dpbuddy will export all objects matching this pattern.

* Support for self-signed certificates. No need to add DataPower certificates to the JDK store.

* Support for arbitrary SOMA requests. You can use Ant variables inside a request.

* Parsing of all commands on the client. In case of XML errors, DataPower returns cryptic “internal error” message. The actual error then has to be extracted from the device logs. dpbuddy on the other hand validates management XML commands on the client and displays error messages right away.

Go to dpbuddy product page to learn more.

Value of IT Automation

Intuitively we all understand that automating IT operations, including builds, deployments, configurations, upgrades and so on is a good thing. We all know that humans make mistakes and mistakes can be costly when they affect a large group of people (e.g., a large user community) or otherwise result in lost revenue to a business.

But how error-prone human actions really are? Note that we’re not talking about normal error rate in software development or other creative fields. Clearly, it is very difficult if not impossible to eliminate errors from occurring when the task can’t be formalized. On the other hand, a large area of IT is related to operations and maintenance and it involves mostly predictive and repetitive tasks. There are many tools, from simple scripts to super-expensive enterprise products that deal with automating these types of tasks. Knowing the probability of human error would help us estimate potential benefits from these tools and, consequently, assess the return on investment.

The classic formula for calculating human reliability can be found “here”:http://www.ida.liu.se/~eriho/HRA_M.htm. Without going too much into the math, empirically we can ascertain the following:

* Every action performed by a human has a probability of error. It is never zero.
* Most tasks (at least, in IT) consist of multiple steps (actions). E.g., a change may have to be made on multiple severs.
* The likelihood of error goes up proportionally to the number of steps.

So it should not come as a surprise that the probability of error could be quite high for complex task. According to the data published on “Ray Panko’s website”:http://panko.shidler.hawaii.edu/HumanErr/Multiple.htm, 28 experienced users on average had 33% error rate in a task involving 14 command-line-based steps. Another interesting tidbit from the same site is 50% error rate in following a checklist. It is unfortunate that the details of these studies are not documented on the site.

Of course, many of these errors can be caught and corrected via testing. It is common knowledge that every change has to be accompanied by some verification or “smoke” testing.

But some changes are impossible or very expensive to validate. Imagine having to change JVM maximum heap size to prevent an application running out of memory. Imagine also that this is a high volume application that runs on four different servers. Imagine further that one server out of four was not updated by mistake. You are not going to find out about it until the application starts crashing on that server under load – and this will be the worst time for dealing with this issue. Now, what if a parameter that had to be changed was some obscure garbage collection setting that was going to improve application’s performance. Users will be experiencing intermittent performance issues but there will be nothing explicitly pointing to the offending server. Discovering the root cause of the problem could take quite a while. The bottom line is that some errors can only be discovered by users at which point the cost of fixing them is going to be substantial.

I think that we tend to overestimate the reliability of human actions and underestimate the cost of fixing errors. After all, how hard could it be for an experience administrator to run a few commands? And it could be very easy but it still does not make the actions of this administrator any more reliable.

The bottom line that almost any investment in IT automation is well worth it. Unfortunately, this view is not uniformly accepted. Many organizations still live in stone age when it comes to automation, but that’s a subject for another post.

Stack Traces and Consulting Rates

Long and unwieldy stack traces is a common occurrence when dealing with Java EE application servers. Here is “an example”:/files/wps_error_example.txt. Many (if not all) of these products re-throw the same exception multiple times which complicates things even further. Figuring out the root cause of an exception is a major undertaking.

Of course, most of the trace is useless when using proprietary products since it points to classes that you don’t have the source code for. And not only you – level 1 of support most likely can’t get to the source code either. As a result, 90% of the trace has little to none immediate value.

As a rule, the more complex the product, the longer the stack trace. Makes sense, right? You’ve got more layers and components, each layer thinks that it is its duty to dump the whole thing to the log and re-throw.

May be we should start using stack traces as a code complexity metric. It would be much more telling than cyclomatic complexity.

I also think that there is a correlation between an average length of a stack trace and an average consulting rate that users of the product pay for development and support. So may be at the end of the day, developers and administrators should not grumble about it to much and I should just shut up.

XML Alternatives and YAML

The need for a more human-friendly alternative to XML is apparent to many people, myself included. This is the reason why quite a few different “light-weight markup languages”:http://en.wikipedia.org/wiki/List_of_lightweight_markup_languages have been created over the last several years. I guess they are called “lightweight” because they don’t use XML-like tags that tend to clutter documents. I’ve looked at several of them and found “YAML”:http://yaml.org to be the most mature out of the bunch as well as quite human-readable (as opposed to, say, JSON) and easy to understand. You can find some very good side-by-side XML vs. YAML comparison “here”:http://yaml.kwiki.org/index.cgi?YamlExamples or “here”:http://www.ibm.com/developerworks/xml/library/x-matters23.html, the difference in readability is stunning.

From what I understand, YAML is popular in the Ruby world and it is used for various “PHP projects”:http://www.symfony-project.org/book/1_0/08-Inside-the-Model-Layer. However, it is almost unknown in Java/J2EE circles. Which is a shame. While annotations somewhat limited the spread of “XML hell” in Java applications, XML still remains a de-facto file configuration format. I would venture to say that except for few outliers, YAML would be a better option as a format for configuration files. Why is it the case? One reason is that YAML format simplifies application support. Developers often say that they don’t care about readability of XML since they use IDE or editors that hide the complexity of XML. Indeed, being able to work with XML in a nice tree view-based editor is appealing. But this does not work when application configuration needs to be quickly analyzed and potentially updated on some remote machine that most likely only has VI or notepad (which is usually the case in production environments, which I find very ironic – shouldn’t the production machine have the most advanced editors and analysis tools to make troubleshooting as efficient as possible?) in response to some production problem. For configuration files, readability and ease of understanding is the key.

Of course, there is also an old trusty property/name-value format. It is, however, very limited, since it does not support any kind of nesting or scoping. So all properties become essentially global and haven’t we learned already that global variables is not a good thing?

YAML, on the other hand, allows for expressing arbitrary complex models. Anything that can be expressed in XML can also be expressed using YAML.

On the downside, YAML does not have a very broad ecosystem. There are not that many “editors that support YAML”:http://www.digitalhobbit.com/archives/2005/09/15/yaml-editor-support/. There is a “YAML Eclipse plugin”:http://code.google.com/p/yamleditor/, but in only gives color highlighting, no validation (here is “another plugin”:http://noy.cc/symfoclipse/download.html which I have not tried yet). There is no metadata support, at least for Java, although there is a “schema validator”:http://www.kuwata-lab.com/kwalify/ for Ruby (its Java port seems to be dead). There is also no XSLT equivalent.

There are two YAML parsers for Java – “jvyaml”:https://jvyaml.dev.java.net/ and “JYaml”:http://jyaml.sourceforge.net/index.html. They kinda work, but there is certainly room for improvement in terms of error messages and just the ability to detect and reject an incorrect document. Since YAML is supposed to be a language with minimal learning curve, the parsing has to be intuitive and bulletproof.

I still think that despite the shortcomings YAML is the way to go. Perhaps I will give a closer look to one of these parsers and see if I can tweak it a bit.

Secret Weapon of LAMP Applications

I’m surprised by low traction of LAMP applications in an enterprise (I use the LAMP acronym loosely as a catch-all for PHP, Ruby and Python apps). For most large organizations Java EE still reigns supreme. While developers and analysts debate performance and “enterprisey” features (or the lack thereof) of the LAMP stack, there is one aspect that is often overlooked – LAMP infrastructure is much more simple than a typical Java EE application server, hence its operations and maintenance is greatly simplified. And of course operations and maintenance is a big chunk of IT budget of any organization; in many shops it is actually the biggest part of the budget (60% is the average according to Forrester).

It usually goes like this. Data center operations are outsourced, and so data center personnel has to provide the first line of defense for all production problems. Data center folks are not developers, so NullPointerException does not mean much to them. But they have to be able to figure out who to call when they see the NullPointerException.

Here is an example of an error message from an application running under WebSphere Portal. This message is 318 lines long and is completely inscrutable to all but a hardcore WebSphere Portal developer or administrator. The most ironic part is that in spite of multiple “caused by” in the text, the message tells us nothing about the actual root cause of the problem, which, most likely, is a classloader conflict. As a sidebar, why can’t an app server at least give me a warning during deployment about a potential class loader problem (especially since all the app servers, even tomcat, add dozens of jars to the classpath)?

On the other hand, here is an example of an error message I get from a python app implemented using django:


    Validating models...
    djtest.order: name 'test' is not defined
    1 error found.   

So which error message do you think will be more palatable to an operations person looking at the logs?

I know I’ve exaggerated a bit – my django app is extremely simplistic at this point. However, it is true that many Java EE app servers and applications do extremely poor job at exception logging.

Even more obvious benefit of LAMP is availability of the source code. In Java EE world, the common practice is not to include the source code into WAR and JAR files. In many instances, the code is compiled without debug information. Even if the source code and line numbers are available, finding the right file takes some digging since we have to deal with multitude of JAR, EAR and WAR files. Not to mention that the same class can reside in multiple jars.

So if I was the person who had to respond to the “site is down” calls late at night, I’d vote for PHP.