Getting the right data and getting the data right
Data is great, I love it. Everyone is talking about it after its recent rise to prominence. But data is only valuable if it is used to make better informed decisions. And to make those decisions, analytics is as critical as the data itself.
The following formula outlines this:
So for those not so algebraically minded, the funny vertical line after ‘data’ gives a condition that the data used must be accurate.
Accurate data is then as crucial as analytics and data itself: Analytics performed on inaccurate data is worthless.
This “accuracy” means data needs to be representative of whatever it is reflecting. So, if its Sales data, the data needs to follow the actual movements of people buying the products. If it’s pricing, the data needs to reflect when prices were altered, and by how much.
So the big question is which data sources are accurate and therefore usable? And if there are inaccuracies, what can be done to overcome them?
The steps to ensure data is accurate
Is it clean?
A key issue is data integrity. That is, checking the data is correct and structured in a way that’s usable. This covers things such as frequency (hourly, daily, weekly), parent level (brand versus product level), regionality (UK, US, etc), metric (sales, etc), and of course making sure the data is accurate. The old adage is very apt here: “rubbish in, rubbish out”: you can only build insight on accurate data.
Check, check and check again
This data integrity was especially drilled into me in my formative career years – when building models on ‘cleaned’ datasets, some of the models were so large (and PC processing power so poor 15 years ago), it meant tweaking a model would take more than a few hours. If you made a mistake in the dataset, it took a few hours to get the output to check, so you could quickly lose days of project time due to small data errors. It therefore paid to be meticulously careful with the datasets, ensuring everything was correct before processing.
These days its much quicker to run models, but the need for meticulous checking is still there.
Some sources are better than others
So which data sources can be used straight off the bat, and which sources do we need to be more careful with?
Below we have outlined a data source checklist we routinely use in building our own predictive models, along with their benefits, limitations and methods to overcome shortfalls. It’s a very hard task to list all the data sources, as there are so many, however it would be great to hear additions and builds from others.
Brand Tracking (other types: Copy testing, Buzz, Syndicated Studies, Switching, Gravity modelling)
Very good for understanding the trends behind consumer actions and preferences for brands
Depending on the size of the base, it can be subject to sampling error. This means the data may move, not due to any impetus, but purely a random fluctuation in how people or the group has responded. This can cause issues with interpretation of the data.
By rolling the data or using moving averages you are able to even out the sample ‘wobble’
Helps understand the why’s of performance – did the campaign not translate to purchase intent because the ad didn’t work, or the product wasn't right?
Also claimed, not observed behaviour: People may say they will do something, but when it comes to actually parting with cash (i.e. a sale) they may do something different.
Through having a range of different questions / survey metrics you are more able to nullify the claimed vs observed issue
Facilitates easier analysis of branding impacts by using brand tracking as you don’t need to control for price, promotions, etc.
Brand health can be used in modelling to identify brand equity drivers in sales
Frequency of data can be good at the weekly level
Internal Business Data
Revenue/Sales/Acquisition (other types: Pricing / Promos, Distribution, P&L, Website visits, Customer Data)
Sales data is population data – i.e. not taken from a sample but covers all of the people purchasing the brand. This then doesn’t suffer from sampling error, plus what people are actually doing, rather than claimed behaviour. It therefore tends to be very robust.
In FMCG, sales can be recorded either as sales leaving the manufacturer (ex-factory) or sales bought by the consumer (EPOS, or till receipts). Ex factory isn’t reflective of consumer behaviour as it is what the manufacturer has pushed out, rather than the consumer pulling in.
Use Volume or Units over Revenue and Profit where possible. Its simpler to model Units/Volume and you can extract price / promotional impacts
Sales are usually the closest metric to profit, and therefore the best barometer of company performance without bringing in cost changes
In retail, sales can be obscured by stores/sites opening. I.e if there are a 100 stores opening in a given year, there is bound to be sales growth but this muddies the actual organic performance.
To avoid errors with aggregation in sales data, always source at the lowest level possible and aggregate up to ensure 100% confidence in the data.
Can become less than robust if product sales are low – this introduces something like sample error as there are more random fluctuations. This occurs for expensive, less frequently purchased items or for B2B businesses.
|Like for Like|
For retail, analysing like for like sales (i.e. removing opening or closed stores) helps understand true performance. It can also be good practice to further remove any other stores which follow an erratic pattern (i.e. due to refits, nearby stores closing, etc) to get a really clean measure. Just need to ensure the number of stores left is still robust / representative (>60%).
Always use EPOS data over ex-factory for FMCG.
Media data (other types: Creative, PR)
Media measures are generally a great barometer of who are consuming media.
TV measures tend to be the most reliable offline data, but cant record eyeballs: i.e. doesn’t know whether someone is watching the program/ad or whether tthey are out of the room. However, more often than not pretty accurate.
Imperative to check - Ensure campaigns / media data are where they say they are. Its good hygiene to cross reference with media plans (actual vs planned match up) and other third party sources.
TV data is especially good, being able to split down to a daily and regional level. In most markets this data is sourced from a large panel, so is sample data but tends not to have sampling issues.
|Radio is Blunt|
Radio can be a bit blunt due to being a small sample of diarised users.
Fairly similarly to research data, smoothing the data (using decays or adstocks) can make it more reflective of consumer behaviour patterns – i.e. memory decay effects.
|Digital highly detailed|
Digital data is generally very good: again daily and hourly data is easily accessible, and the data is population based, meaning its very reliable. Impressions are not very reliable, clicks are much more useful.
|Outdoor difficult to capture|
Outdoor data tends to be not very reliable – something the outdoor industry is working on. Digital OOH should help here, as well as all of the recent developments in tracking monitors.
Radio is pretty good, sourced from a diarised panel of users.
|Press very blunt|
Press data is number of insert multiplied by 6 monthly readership – not great
Economy (other types: Seasonality, Climate, Regulatory Changes)
Macroeconomic movements are key to many businesses performance, so the varied amount of data that exist on this for free is a great resource
|Tricky to source|
The ONS website is a very tricky website to navigate, and the data can be difficult to extract
Its best to avoid using ONS and use an aggregator such as euromonitor which is much easier to extract
The ONS has a huge wealth of data, different metrics with regional, demographic and social splits
Data from the ONS is slow to come out (3 month lag – data comes out 3 months after the event) and subject to revisions, meaning that actual data can be revised and therefore changed on later releases – very irritating for modelling and forecasting purposes
To avoid spikes in data and revisions of data, moving averages can be used on ONS data to soften these impacts
|Accurate and representative|
Consumer confidence is recorded monthly and tends to be a very good reflection of how consumers change behaviour due to economic conditions across many different industries
Frequency of data tends to be quarterly, which as a frequency is fairly unusable unless you are investigating very long term trends
|GfKs Consumer Confidence|
Consumer confidence (and business confidence) is published with a 3 week lag, is available monthly and splits down into various other sections: personal financial situation , business confidence, etc. This data is very useful, and very reflective of the economy
So an analysts job isn’t easy: there are lots of data sources, with their own problems. It is critical to understand the data you are dealing with:
- What is it?
- Where was it sourced?
- Is there a breakdown / more granular data available?
- What’s the frequency?
- Does it look right?
The last point is key – chart the data, look at it and use your common sense to interpret if it looks correct.
Data should be at the core of decision making. If you have accurate data and analytics in place, you will begin to live off data and the insight it brings.
For that reason, I would like to use this piece as a discussion point and invite readers to add and improve this list to help with our collective understanding of data and ways to overcome issues.
Happy data hunting!
Brightblue Consulting Ltd
Brightblue is a specialist optimisation and ROI consultancy. Our experience ranges from consulting to market mix modelling (econometrics) to global budget setting and optimisation. We take a clear, dynamic and commercial approach to ensure results are used to maximum effect. Our key purpose is to deliver clarity from complexity.
Brightblue have been in the top 100 startups in the UK for 2014 and 2015.