How to improve your data quality in four easy steps

Data quality is a factor of extreme importance. Nowadays, most big companies collect data with the intent of making data-driven decisions to create value. Facilitated by improving technology, analyzing data can provide companies with valuable insights that enable better decisions and lead to a competitive advantage. In order to create value from data, the data itself must be of a high quality. This means it is complete, correct, up-to-date, and that data from different sources do not contradict each other. Collecting data can be compared to building the foundation of a house. If the foundation is bad, everything built on it could eventually collapse. When basing decisions on data, it is therefore imperative to make sure the data – the foundation of all decisions – is of high quality.

Common problems

There are several steps preceding data analysis and data-driven decision making, ranging from data collection to data interpretation:

Low data quality can occur due to problems in any one of these stages. Errors can occur when collecting data, when delivering it to your database, when storing the data, when combining data from different data sources and when interpreting it. Causes of low quality data can include:

  • Dependence on manual data imports, which is prone to human errors.
  • Poorly designed automatic data imports.
  • Usage of varying data formats.
  • Failures of systems at the source or on your own side.
  • Greater focus on data quantity than data quality.
  • Absence of ownership of data quality within an organization.

Solution

The following four steps offer a guideline to increasing data quality.

Step 1

Keep a clear goal in mind when collecting data. If the collected data is irrelevant, it will only take unnecessary time and effort to process and if ignored, unseen errors may deteriorate data quality. You must be able to justify the need to collect this data.

  • Why are you collecting this data?
  • Which question will this data help you answer?
  • What will you do with this data?
    • For example, what kind of analyses are you planning to run and is this data suitable?

Step 2

Make sure you understand the data. With a better understanding of the data, you will spot errors more easily.

  • What does the data really mean?
  • Does the collected data make sense?
  • Are the values reasonable?

Step 3

Be critical and thoroughly inspect all data. The more data sources are used, the higher the chance that one or more of those sources contain bugs.

  • Is the data clean?
    • Are there no instances of CapITAlizATIoN that interfere with search and filter functions?
    • Have dates been converted to the same format upon import or export?
    • Are any abbreviations used consistent?
    • Are there no blank spaces between words or numbers interfering with search and filter functions?
    • Are your separators – dots, dashes, spaces – consistent?
    • Have duplicates been removed?
    • Are missing values coded correctly and consistently?
  • Have you compared data collected from different data sources to test whether all values correspond between datasets?
    • For example, do the amount of website entries per channel correspond to the clicks per channel?
  • Are you checking data at the point of entry?

Step 4

If possible, set up automatic data imports. Errors in data are never completely avoidable, as data collection is prone to flaws in both human and technical operations. It is possible, however, to reduce the margin of error. Eliminate dependence on regular manual imports, since they are especially prone to errors. Automatic imports also save a lot of time.

  • Do you know what types of data need to be imported?
  • What does the data need to look like?
  • Are calculations and formulas tested for accuracy?
  • Do you regularly compare your new import data to static data before the import to test whether the past values did not change, indicating nothing went wrong during the last import?

Checking all the data for errors manually is time-consuming. Ideally, one would design a system that checks the data automatically and continuously. Design the system so that automatic notifications are sent to oneself when critical errors appear in the data. It is also advised to implement a naming convention policy to tag data consistently, to make sure all the data can be analyzed at a later point in time.

Upon finding an error in the data, there are several steps to take. First, decide whether the error is critical. If it is not, it might not be necessary to act. If the error is critical, you have three options:

  • Correct it. When the error can be fixed without too much effort, it should be. Also, investigate whether the problem can be solved categorically. Improve your automated data import, taking the problem into account and solving it automatically should it occur again.
  • Remove it. When the error is too difficult to fix, cannot be fixed or the impact of the error is too large, you should remove the data from your analyses.
  • If it is impossible to fix the error without large amounts of effort and when the impact of the error is negligible, you might choose to accept the error. This is not the preferred option, as this may lead to unforeseen consequences when analyzing the data, but it is extremely difficult to achieve one hundred percent correct data. Only resort to this option when you know it will not have significant consequences for your data quality.

Conclusion

Following the four simple steps can help improve your data quality. It is important to make decisions based on high-quality data. Low data quality can lead to bad insights and wrong decisions, potentially worsening the company’s performance. It may be difficult to identify problems in your data quality, as they can occur at many different stages. So, keep the goal in mind, understand the data, be critical and set up automatic data imports whenever possible. It is important to be aware of possible errors and to clean the data. This is a time-consuming process, but will help you take future errors into account to improve the data import. Keep working on your data quality step by step. Achieving a high data quality requires continuous work, it does not just happen by accident. Investing in this process will save you time and effort in the long run, and prevent you from making bad decisions based on low-quality data.

Matthijs Aantjes

Matthijs Aantjes

Data Scientist