DatašŸ§¹Data Cleaning in Data Science / ML

Gursewak Singh
4 min readJan 7, 2023

--

Once the data is sourced as we discussed in my data sourcing article. Now we need to clean the data.

Why we need to clean it?

Nice question! when we receive the data either from any government website or repo. It is messy, unformatted and has some irregularities like missing values etc. If this data is not cleaned or used as it, it will drastically affect our further analysis and assumption. And consequently, will hamper our Machine Learning Model building process.

Letā€™s understand it better with some examples.

Here I have data from some banking marketing campaign,

Thats quite messy, check the first two rows!! I mean who cares about those two rows and believe me I canā€™t do any quality analysis or at least it makes my life harder if those keep hanging around there.

So, what I am going to do is skip them,

I am using Pandas as pd, and skipping the first two rows while importing my csv file.

Letā€™s check our dataset now.

And this is something I need. šŸ‘

Here is another example. In snapshot above, customerid column is of no use to me so what I would do is drop that column.

And letā€™s check new dataset.

Perfect !! this is an example of redundant or no value column. As they are of no use for my analysis.

Lets see an example where my column is in wrong data type.

I am interested in finding the average age, thats fine. But I dont believe that age should be in float or decimals (for now lets treat float and decimal same for simplicity). I want my age in integer.

Letā€™s convert that in integer. Age columns has some NaN values so simply replaced them with 0. Age as 0 wont affect my analysis.

Now we my age column is in a perfect state for my further analysis.

These were some examples of data cleaning in practice.

Hope this article was helpful and to understand why data cleaning is important part of Data science and Machine Learning.

Checklist For Data Cleaning

Here is the checklist that I follow for data cleaning

For fixing row

  • Delete summary rows: Total and Subtotal rows
  • Delete incorrect rows: Header row and footer row
  • Delete extra rows: Column number, indicators, blank rows, page number

For fixing columns

  • Merge the two or more columns to create unique identifier for a row
  • Splite column to get more data
  • Add column name if they are missing
  • Delete redundant column
  • And proper aligned, especially when row is skewed to the right or columns are shifted to the right (if dealing with lot of data then deleting them would be better choice if skewed rows are few like 10ā€“20 )

Important Points to keep in mind

  • Do not blindly remove the column if it has lots of NaN, maybe those columns are critical for our analysis. So, make clever decision before dropping any column.
  • split the columns to get more variables if that yields any unique and useful information.
  • Bucketing by age group may yield new insights about the perspective of different age groups
  • It's a good habbit to get overview of null values in your dataset.

For example

This will give you better understanding of your dataset.

Checkout my other articles on EDA (Exploratory Data Analysis) and
Data Sourcing.

Till then Happy Data Science

--

--

Gursewak Singh
Gursewak Singh

Written by Gursewak Singh

šŸ§‘ā€šŸ’»Software Engineer and a šŸ¤©Passionate Data Scientist | šŸŒ²Finds peace in writing| LinkedIn šŸ‘‰www.linkedin.com/in/ gursewak-singh-cosmic

No responses yet