Stop Dirty Data: 5 Steps to Data Cleaning
One of the biggest issues we see our clients have is dirty data; that is, inaccurate, erroneous, or otherwise incomplete data. While this may seem like a tiny blip in the realm of big data and analytics, it is quite possibly the biggest barrier in enhancing a company’s overall data insight capabilities. Here are 5 steps to data cleaning and ensuring data integrity.
1.) Know Your Data
Before you even attempt to start cleaning up your data, you need to ask yourself: What is the data used for currently and how it will be used in the future? Without a direction of how the data will be used, it will be difficult to clean it up because you won’t know what fields to eliminate (if any) or understand what glaring gaps lie within the data. We were given a task from by a client to clean up their data and bring it all in one place. When we presented the pristine, cleaned up, data it to them, we asked them what it would be used for and their response was they had no idea!
Bottom line: if you don’t know what your data will be used for, you will have a much harder time knowing where to even start the clean-up process.
2.) Sanitize Your Inputs
Let’s talk about how your data ends up in a database. Jane registers for a website and enters her name, e-mail address, and zip code. Her e-mail address is stored within the database with an extra space due to the input of the form, so instead of “email@example.com”, it reads as “firstname.lastname@example.org “. While this may seem insignificant, having that extra space will make the email field unreliable and could cause problems with how the data is read in other programs. Cleaning up your currently stored data is all well and good, but if you haven’t purified your inputs, there will be a continuous loop of bad data.
3.) Identify a Unique Field
This may seem like a no brainer, but without a unique field, harvesting your dataset will be much more difficult. Databases such as SQL and MongoDB insist on storing a unique field, labeled Primary Key or Object ID respectively. This unique field is important in identifying links between two tables, so you are able to append data from one table source to another.
For example, let’s say you have two separate tables: your customer list and your e-mail campaign data. Each customer has a reference number which is located in both tables. In order to link the customer data to the e-mail campaign data, having the unique reference number is vital in connecting the two tables.
4.) Validate Your Fields
If you pay attention to only one step in this article, this should be it. Validating your fields is vital in cleaning up your data.
Having spaces in your data could directly affect the readability if another software or API needs to read the information. Be proactive in removing any unnecessary spaces in your fields. If there is truly a space in the cell, the =TRIM function in Excel eliminates the unnecessary spaces.
Remove Illegal Characters
Sometimes illegal characters make their way into your otherwise clean data. We all know naming file names on your computer with characters such as #$%^& will elicit an error response, and the same goes with cell data. For example, let’s say you have a “date” column, and for whatever reason the “#” symbol is being inserted in every row. The date column will not read as a date column but as a text or general column.
Do Not Store the Date Field as a Text String
The date column is crucial in determining when an event took place. Let’s say you want to view the revenue for the past year that were attributed to your e-mail campaigns. If the e-mail sent date field is reading as text and not an actual date, you won’t be able to filter the data to gather your revenue.
Store E-mail Address as Lowercase
For consistency purposes, it’s important to store your customers’ e-mail addresses as lowercase. If you store a customer’s e-mail address as uppercase and he/she then decides she wants to unsubscribe from communications, the e-mail address may not get unsubscribed because of case mismatch.
5.) Keep Tomorrow in Mind
Looking ahead when it comes to data organization and cleaning will save you valuable chunks of time in the future. Not all data fields may be useful immediately, so hiding and safekeeping seemingly useless information instead of deleting fields will help save time and money when business needs change. Process guidelines for data structure guidelines may be time consuming to make, but it will mean a world of difference when explaining data entry and manipulation to new hires and other team members.
Consistency is the key to a cleaner data future.