1.
Handling Missing Data
- The common problem in datasets is missing data. Three strategies for handling
missing data are Removing Records, Importing Values, and Using Algorithms.
Identifying missing values or blanks and format. For example, if you have missing
contact details, the phone number recorded as the address means that the
information must be accurate and corrected.
2. Remove Irrelevant Data
- Duplicates may cause inaccurate results on your system. In this process, you will
apply the Deduplication. Deduplication includes Identifying Duplicate Entries, Removing
Duplicate Records, Identifying Redundant Observations, and Eliminating Irrelevant
Information. Reducing the redundancy. For example, in your database, you have
recorded information twice for that user but because of this process, you will now solve
that problem by identifying and removing these duplicates.
3. Fix Structural Errors
In this process you will fix the inconsistent data formats, naming conventions, or
variable types. This step involves Standardizing Data Formats, Correcting Naming
Discrepancies, and Ensuring the Uniformity of your Data Representation. Ensuring
that the format is consistent and verifying the represented consistently. For example,
is the date on your system. Sometimes the format of your date is not consistent such
as MM/DD/YYYY and YYYY-MM-DD which may cause the inconsistency of your
database.
3. Handle Missing Data
- Missing data can affect the integrity of your system. In this process you can handle your
missing data by using the Imputing Missing Values, Removing Records with Missing
Values, and Employing Advanced Imputation Techniques. These strategies may help you
to fill in all the missing values or remove the records using the missing values. For
example, you have an e-commerce website that has a database. Sometimes the price
column of our dataset is missing a value that could impact the analysis of your revenue.
5. Normalize Data
In organizing data you need to use data normalization to improve the storage
efficiency. You may use Splitting Data into Multiple Tables, and Ensuring Data
Consistency. You may divide the data into separate tables and verify that the data is
structured in the right facilities. For example, the customer database stores all the
information in one table. In splitting them by doing the Normalization Forms the data
consistency may improved.
6. Identify and Manage Outliers
Outliers are data points that are usually used to identify the results in graphs or
tables. Remove Outlier and Transform Outliers are applied in this step depending on
their context. For example, on the Midterm Exam, the scores will range from 70 to
90, but one student got a higher score of 200. Now you can see here that this seems
unrealistic in this case. Now you must remove that score of your student to
accurately reflects on the other students.
References: https://www.geeksforgeeks.org/what-is-data-cleaning/