Friday, December 25, 2015

The Making of a Chaotic System

I have been able to observe the aftermath of a improperly designed and disjointed system. Over the past 12 months, I've been able to correct some fundamental parts of the system and get some consistent, manageable design instilled.

While untangling some of the messy parts, I questioned how the system ended up so inconsistent. The answer appeared to be related to different consultants being brought in over time to perform tasks and there was no internal guidance for the overall system. So each consultant performed their task in isolation of any overall design guidance, completed their specific task and moved on and would most likely never touch the system again.

After learning how changes were being introduced to the system, it became clear how things were so inconsistent and why it was extremely complex to maintain. The system was in chaos and heading further into chaos. Today, we are well onto our way to a manageable and maintainable platform and we have tools in place to audit for consistency and accuracy. Records that were inconsistently maintained have been replaced with automated reports that query the information directly from the elements of the platform. The elimination of many manual tasks have helped to gain the "buy in" of the design direction and discipline in the current design processes.

Hopefully one can learn something from reading about how the system mentioned above was able to get into such a chaotic and unmaintainable system. Allowing changes to occur by people without accountability or guidelines was the largest contributor. It has taken 12 months to get the system into a manageable state and I estimate another 6 months to get some of the networking atrocities changed to make for a more scalable system. Other things that have been instilled are daily backups for all devices, consistent ntp server configuration, consistent snmp trap configuration and alarming.

I expect to be taking proactive actions with the system by mid-2016 instead of the reactive mode that we've been working in out of pure necessity.

Monday, December 14, 2015

Large System Management

I've recently been involved in taming a large scale system that wasn't really planned, designed consistently, or even maintained properly. My observation was a chaotic system with the current implementation practices leading it toward more chaos.

It wasn't just a system that I had to observe and learn, but it was key to observe how personnel were interacting with the system. My initial changes involved automating some data collection and reporting that was being done manually within spreadsheets. Needless to say, the data was not accurate and maintaining a spreadsheet was overwhelmingly burdensome.

Another problem area was with Java clients that were ridiculously slow and presented less than complete pieces of information. I simply reverse engineered the Java client to find it just getting device data via snmp. A simple Perl script replaced the Java GUI status screen and executed 40x faster. I could literally collect the data from 40 different devices in the time the Java GUI presented the incomplete data for a single device. Needless to say, I stored the retrieved data in a database for use.

Today, these Perl scripts collect the data from a cron job and we always have a fresh data set based on what the equipment tells us rather than a poorly maintained spreadsheet.

Some of the other changes involved standardizing the way certain items were named and packaged. Items were being named in such a way that it caused more work to properly identify them and their usage. I'm still changing about a decade worth of ill named items, but the new convention is in p,ace and it is already making an unnecessary complexity into something trivial and maintainable.

Like the Chinese proverb says: A journey of 1000 miles begins with a single step.

I'd add to this proverb... A step in the correct direction.