Data cleansing is the process of removing incorrect, inaccurate, irrelevant, or corrupt data from a data set, thus “cleansing” it and making it more accurate. A cleansed dataset will be considered higher-quality. This means it can be used to make more accurate, reliable, and consistent decisions.
The steps in cleansing a data set are:
- Data Auditing – reviewing the original dataset to identify any major issues. This can be done manually or by using a tool that automates the process.
- Error Detection – identifying data entries that are incorrect, corrupt, misformatted, misspelled, or inconsistent with the rest of the dataset.
- Data Correction – correcting errors identified in the previous two steps
- Handling Missed Data – filling in any gaps or holes within the datasets. There may be formulas to generate missing values or default values may be used.
- Normalization/Standardization – ensuring all the data is in the same format across the dataset.
- Deduplication – removing any duplicates within the dataset
- Validation – Creating a system or rules to ensure that data within the dataset is valid, eliminating the need to go through the cleansing process again