6 Steps to Data Heaven

Evan Shellshear

6 years ago

Our world has become obsessed with data. But with around 85% of data projects failing, why are we lining up to dance to our ultimate demise to the data piper’s tune? Why are we all trying so hard to fail? We keep thinking that the aura of machine learning and artificial intelligence will save us, but the problem is not high tech.

The reality is that it doesn’t have to be this way and there are many simple tips to fix our problems—i.e. not requiring an expensive set of consultants or a company-wide change management process—that can dramatically improve your chances of success.

But first, what’s the issue? The issue is that we are unable to collect accurate and correct data to feed our decisions, let alone our algorithms.

According to some research, bad data is costing the U.S. alone around $3 trillion per year. Ask yourself, what decisions have you made lately on this bad data? The problem around incorrect data comes from a number of sources:

You haven’t bothered to define the metrics of interest.
In an online marketing world, you can’t be sure of where leads come from. Are your information sources really correct (UTM parameters)?
Have you correctly configured your website to track events accurately? Are you sure it is a page view you want to track, or a click?

From these incorrectly configured data sources we end up with four major data pollution issues: duplicate data, incorrect data, missing data and outliers. All four of these could be occurring in your data and they need to be dealt with to correctly make decisions. Luckily this is not difficult to do.

There are a number of simple actions that can be taken to ensure that your data arrives as correct and clean as possible:

Giving definitions to the things you are looking to collect

To ensure that everyone is on the same page, we need good definitions. What is a lead? What counts as a sale? For example, a poor definition of a lead would be: “A lead is someone who has contacted us.” This is too vague. Contacted about what? To complain?

A better definition is something we can clearly measure: “A lead is a real person (not a bot) who contacts us and provides us with at least a) a correct phone number or b) a correct email address and this information has been successfully recorded. There has been no further contact made nor qualification.”

This is a good definition, as it makes a distinction between whether the lead has been followed up/qualified and that the details have been successfully submitted, i.e. not just a web visitor or someone who tried to submit their contact details and for which your web-to-lead form failed.

Labelling your data according to its use

This one is not difficult and there are many people that have written extensively on this. When labelling data sets, we need to distinguish between where the data sets which are on the journey to usage and how much they have been processed. This can be easily done by adding a prefix to the name of a data set as follows:

Use the prefix UNPRO_ for data that has not been processed and directly pulled from a source. Transformations will be done on this data.
Use the prefix INTER_ for data in an intermediary state. This type of data is usually the output of a transformation and should not be used as production data.
Use the prefix DEPLOY_ for the final datasets that are to be used in deployment i.e. for models, visualisations, etc.
Use the prefix TEST_ for all the datasets that don’t fit the above e.g. test, development, etc.

Verifying the data for correctness

It is essential to verify data to build trust in its users. One small data mistake can destroy everyone’s trust in your visualisations.

One way to do this successfully is to calculate your metrics and definitions in two different ways. For example if we are computing the number of online leads, then compute both:

The aggregate number in your CRM tool for a specific period of time
The aggregate number of submissions on your website over the same period of time (typically done with Heap Analytics or other such tools)

For other cases you may just need to look at manual records vs automated records to confirm the correctness of the data.

Ensuring data consistency

A major issue with data from multiple sources can be dates. For example in the USA, the date 9/8/18 is the 8th of September 2018. But in other countries it may be the 9th of August 2018. Other examples are making sure that dates are coming through in the right format consistently and that things like phone numbers are always collected with country codes or known to be in the right country.

Auditing your data on a regular basis

The problem with data and decisions made from them, is that they are dynamic—both the decisions and the data! Allegedly John Maynard Keynes made a famous statement along the lines: “When the facts change, I change my mind. What do you do, Sir?”

So what do you do? When your situation changes so should your definitions and metrics. This means definitions of things like leads, CPL, CPA, etc. The key to doing this successfully is discovering issues early and fixing them.

But how? Regular data audits. These audits should cross check all major definitions and metrics upon which everything else is built. To do so, follow the advice in the “Verifying data for correctness” tip above. Even better is when you attach alerts in business insights tools so that when values move out of expected ranges, an alert is raised.

The most important: Who is responsible?

Having data sets for which no one is responsible is like having a kid in a candy store with no oversight. You’ll quickly get in a sticky situation. Ensure that the right responsibilities are in place to make sure people own and understand the data that you are collecting.

Whenever we are collecting data we need to answer the following questions:

Who will be responsible for the management of this data set?
Where will they record why the tables exist and what they are for?
Does the person have the requisite skills to manage the data?

Don’t assume that your head of IT knows this. Often they will just manage security and the usage of internal computer systems and have no responsibility for the actual understanding of the data in their data systems.

Now that you have clean and correct data sets that people trust you can finally leverage them to generate insights via dashboards, data analytics, attribution and other forms of predictive analytics to turn data from a financial burden into a financial asset.

You’ve gone to the effort to have good quality data sets, so be part of the 15% who are leveraging their data to grow their business.

Giving definitions to the things you are looking to collect

Labelling your data according to its use

Ensuring data consistency

Auditing your data on a regular basis

The most important: Who is responsible?