Saturday, August 5, 2017

Data Analysis by Unsophisticated People

I have been involved on a project recently where my role was more peripheral and I'd usually just spot check some of the changes for people who were deeply involved with the project. I'm referring to them as unsophisticated people since they failed to use a method that is repeatable and accurate.

I gave it little attention until I was sent a huge spreadsheet which contained provisioning codes for thousands of different product options and asked to verify that they were all correct. I realized at that point that the people deeply involved in the project have been doing things very manually and there were errors which were being discovered.

There's no way in hell that I was going to take the manual approach which would be extremely mind numbing and still error prone. So I wrote a perl script that simply parsed every row and cell which I decode with a hash table into human readable text. This was step 1.

I extended my parser to become a validator. I looked at all of the products and recognized that certain products should have certain provisioning codes. So I created templates using an array of the provisioning values and sorted. So now my parser reads each row of the spreadsheet, pushes each provisioning value into an array, sorts the array when all cells have been obtained, sorts the array and then compares with all of the templates (@current_row_array ~~ @certain_product_template). If it matches a template, then I simply indicate which template it matched and it is very likely fine. The ones that do not match any template have some type of error and need to be investigated. I was amazed that I have never seen or used the Perl '~~' operator.

Taking 2 hours to write this simple perl script and create the templates will pay off largely by saving everybody time and reducing errors. There will certainly be changes and I can simply keep running my validator until everything matches the templates. So we now only need to concentrate on the templates being valid (about 40 of them) instead of thousands of rows of provisioning codes.