Real world data often has missing values. So, why does data has missing values? There can be various reasons for missing data such as data missed during collection or data gets corrupted and it is important to know why the data is missing to get the intuition on sensible ways to treat it. Since machine learning algorithms do not support missing values so you have no option but to treat them.
In general, when you come across any dataset, the missing values are generally blanks in the dataset or are filled with values like“null”, “NaN” or “NA” values but that’s not it. Missing values can also appear as a question mark (?) or zero or minus one (-1) or any unconventional number. I came across a dataset where a numerical variable had values in the range 1-100 whereas some of the values in that variable were “999”. So, it was not actually a value of the observation but a missing value which was filled with “999”. It may make a huge difference if you don’t treat these kinds of variables before building your model.
What do we do with missing values? I will discuss some of the possible ways to treat missing values here. However, what should be done depends on the nature of the dataset and missing values.
# Gives you count of null values in each column train.isnull().sum() # Gives you sorted count of null values in columns train.isnull().sum().sort_values(ascending = False)
Dropping missing values:
This is the naive way of handling missing values. But, this method is rarely used because it reduces the size of your dataset which may reduce the quality of your model, since you remove all the rows with missing values. And in machine learning data is gold so you don’t want to waste it just by deleting it.
# Drops columns if they have atleast 1 null train.dropna(axis = 1, how = 'any') # Drops rows if all the values are null train.dropna(axis = 0, how = 'all')
Imputing missing values:
There are number of ways to fill missing values and this is the most common way of handling missing values:
Filling values with mean, median or mode. However, mean is greatly affected by outliers.
Let’s say we have a variable with values. 4,3,6,1,3,5,8,5,2,1000.
The mean of the variable will be 102, whereas the median will be 5. So, here the median is a better measure than mean.
You can also fill missing values with forward fill or backward fill methods, which will fill the missing values with previous values and future values respectively.
# Fill null values with forward fill in the dataframe train.ffill() # Fill null values with backward fill only in 1 Column train.Age.bfill() # Fill null values with a constant value train.Age.fillna(25) # Fill null values with mean value train.Age.fillna(train.Age.mean())
Or you can also fill missing values as something like average of n previous values.
Predictive model for imputing missing values:
This method works pretty well in practical. Depending on the nature of the missing value you can use either regression on classification model to predict the missing data. So, how do you do that? Well, follow the steps given below-
Consider the variable having missing values as your output variable.
Consider all the variables with no missing value as your input variables.
Now, split the datasets as the train set should contain all the records where the output variable has no missing values whereas, the test set should contain all the records where the output variable has missing value.
You can now build regressing or classification (based on your output variable) to fit it to train set and the predict the test set.
Take missing values as another feature:
If you have significant number of missing values in your variable then, you can make another feature from that variable. Let’s say you have a variable with missing values.
Then you can also make a new feature with this variable in which the value is 0 if it has missing value and 1 otherwise.
Will be the new feature created with the given variable.
# Convert the Cabin column of the Dataframe into binary variable, # taking 0 for missing values and 1 otherwise train.Cabin[~train.Cabin.isnull()] = 1 train.Cabin.fillna(0, inplace = True)
The approach of handling missing values depends highly on the nature of your data. Also, there is no best way to handle missing values, you have to try different methods and see what works best for you. The more you work with different types of data, you will gain an intuition about handling missing values.