Why You Might Be Failing At Data Cleaning

7 min readJun 5, 2022

Ultimate Guide to do Data Cleaning with Python

When it comes to data, most of you will agree that our analysis and predictions are only as good as the data you are using, trash data in is trash analysis out. In terms of machine learning, the model will function according to the data we give them, therefore dirty data leads to poor model performance whereas clean is required for high model performance. Of course, clean data does not always imply good performance; the appropriate model selection is also critical, but even the best model cannot perform at the necessary level without clean data.

In this article, I’ll be talking about the difficulties that we face during data cleaning and various remedies to make an effective and efficient model.

Missing Data

Sometimes it’s better not to fill the missing data in your dataset, as the empty and missing cells can reveal useful information about the dataset. Consider the following scenario:

Missing or NA values present in the tail or in the middle of the dataset indicate that there may be a technical issue with data collection. In this case, you have the analyze the data collection procedure for that particular sample sequence and try to figure out what’s causing the problem.
If more than 70% to 80% of a column is missing, the entire column can be dropped or removed.
If NA values appear in a column that’s a form’s an optional question, that column can be decoded as the user answered as 1 or not answered as 0. This process is commonly known as feature engineering.

In order to check the missing values in the dataset conveniently, we will use the missingno Python library. matrix function of this library is very handy for cases like this. You can check all the features of the missingno library here.

import missingno as msno
msno.matrix(df)

When it comes to imputation, there are various effective ways we can use mean, median, mode, kNN, zero or constant, etc. Just keep in mind that different methods have upsides and shortages against one another and there is no ”perfect” technique for data imputation in all the cases.

Inconsistent Data

Inconsistent data occurs when the representations of a column’s unique classes differ. In the gender column, for example, there are two options: male/female and M/F. There will be four classes in this situation, although there are only two.

As we all know, there is no automation for this task, thus we must manually analyze the classes. In Pandas library, the .unique function is for this purpose. Let us understand this with an example using Car Data.

df['CarName'] = df['CarName'].str.split().str[0]
print(df['CarName'].unique())

We can clearly see that some values are the same but have different representations, like maxda-mazda, Nissan-nissan, porcshce-porsche, toyouta-toyota are some examples for inconsistent data problems. I use .loc function of pandas to solve this.

df.loc[df['CarName'] == 'maxda', 'CarName'] = 'mazda'
df.loc[df['CarName'] == 'Nissan', 'CarName'] = 'nissan'
df.loc[df['CarName'] == 'porcshce', 'CarName'] = 'porsche'
df.loc[df['CarName'] == 'toyouta', 'CarName'] = 'toyota'
df.loc[df['CarName'] == 'vokswagen', 'CarName'] = 'volkswagen'
df.loc[df['CarName'] == 'vw', 'CarName'] = 'volkswagen'

Outliers

We can define an outlier as an extremely high or extremely low data point relative to the nearest data point and the rest of the neighboring co-existing values in a data graph or dataset you’re working with. In simple terms, outliers are extreme values that stand out greatly from the overall pattern of values in a dataset of graphs.

Their existence dramatically affects mathematical models’ performance. Let’s look at this simple example for understanding the reasons behind the downgrade of performance:

In the left plot, there is no outlier and the line is quite fits to the data points. On the other hand, in the right plot, there is an outlier and as the line tries to cover all the points of the dataset, the presence of one outlier changes the position of the line and it doesn’t fit at least half of the points.

From above we got a idea why outliers are bad for datasets, let’s move on to some mathematical concepts.

It is necessary to define what is meant by “extremely” large or small data points. The values 1.5*IQR(Interquartile range) higher/smaller than Q3/Q1 are called outliers. IQR is the difference between Q3 and Q1(IQR = Q3-Q1).

You can check the number of outliers in the dataset by referring the below function.

def number_of_outlier(df):
	df = df.select_dtypes(exclude = 'object')	Q1 = df.quantile(0.25)
	Q3 = df.quantile(0.75)
	IQR = Q3 - Q1	return((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()

Now coming to the treatment of the outliers, one is to make them equal to Q3 to Q1. By using NumPy and pandas libraries, the below function does this task. Here .lower_upper_range function finds the range whose outside are outliers. Then with NumPy clip function the values are clipped to the ranges.

def lower_upper_range(datacolumn):
    sorted(datacolumn)
    Q1,Q3 = np.percentile(datacolumn , [25,75])
    IQR = Q3 - Q1
    lower_range = Q1 - (1.5 * IQR)
    upper_range = Q3 + (1.5 * IQR)
    return lower_range,upper_range
  
for col in columns:  
    lowerbound,upperbound = lower_upper_range(df[col])
    df[col]=np.clip(df[col],a_min=lowerbound,a_max=upperbound)

Duplicate Data

It is a duplicate data case when there are identical rows in the dataset. It can happen because of the data combination mistake, the user might submit his or her answer twice, etc. The ideal way to handle the issue is just to delete the copy rows.

Pandas duplicated function can be used to check whether there is a duplicate row or not.

df.loc[df.duplicated()]

After identification, pandas .drop_duplicates function is drop duplicate rows.

df.drop_duplicates()

Invalid Data

This category contains the values which are simply not logically correct. For example, someone’s age is 480, the time spend is -9 hrs or height of a person is 1300cm-s, etc.

Pandas .describe function can be used to identify such errors.

df.describe()

This kind of error can be made because of the following two reasons:

Data Collection errors — The data engineer can type 1699 instead of 169 for the height column. This kind of random mistake can be taken as a null value and imputed alongside other NAs.
Data Manipulation Errors — Some columns of the dataset can be the output of function-coded developers. For example, a function calculates age from birthdate and answers are negative. It means the equation is incorrect.

Data Leakage Problem: A Bonus

Before building the model, the dataset is split into train and test sets. The test set is the unseen data used to evaluate model performance. If the test set is somehow “seen” by the model during data cleaning or data preprocessing steps, it is called data leakage. That is why data should be split before the cleaning and preprocessing steps.

Let’s choose missing value imputation as an example. There are NAs in the numerical column and it is imputed with the mean method. When it is done before the split, the mean of the whole dataset is used, but if it is done after split, mean of train and test separately will be used.

The problem with the first scenario is that the imputed values in the test set will be related to the train set because the mean is of the whole dataset. So when the model is built with the train set, it will also “see” the test set. But our goal in splitting is to keep the test set completely fresh and use it as new data for performance evaluation. That is why the dataset has to be split before operations.

As we discussed why the second scenario is the correct one, now let’s move on to how it will be implemented in the code. From the first impressions, you can say that it is not efficient and it can be right. But as the data leakage problem is very important and is a priority, it is the way how you should do it. In order to make it efficient sklearn library provides pipelines. Simply, pipelines are the combination of all manipulation steps to which you send data as input, and output is the clean data.

Hope I have addressed all the issues we as a data scientist face when it comes to data cleaning, if this blog in any way have helped you gain some knowledge please follow me and give me a clap.

You can feel free to communicate and connect to me on LinkedIn==>https://www.linkedin.com/in/akshatrajvanshi/