What are outliers? How does outliers affect our models? Why is it important to identify and deal with outliers? So, according to Wikipedia
According to Wikipedia
In statistics, an outlier is an observation point that is distant from other observations.
Outlier is something which is different or you can say which stands out from the crowd.
Let’s say you have a variable
In above variable, all the values are within range 1-10, but one value is well above that range. We can call this value as an outlier for this variable.
Outliers may be due to the variability in the data or sometimes due to errors.
Now you must be thinking that how does an outlier effects a model.
In above image, the blue line is the best fit for the values of x and y, but due to an outlier we get the red line as an actual fit to the data. The outlier has shifted the best fit line towards itself making the model worse. This is why it is important to deal with outliers before building a model. Sometimes, It could be worse than shown here.
There are two types of outliers namely Univariate and Multivariate. Univariate outliers can be found when we see the distribution of one variable whereas multivariate outliers are the outliers in n-dimensional space.
Most commonly used methods to detect outliers is by visualization techniques like Box-plot, Scatter Plot (as shown above in Figure 2 and Figure 1 respectively). Some other methods to detect outliers are-
Box-plots use the IQR method to detect outliers.What is IQR? According to Wikipedia:
According to Wikipedia
The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR= Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a boxplot on the data. It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used robust measure of scale.
If we sort a variable and divide it into 4 parts then the mid-point is called the median whereas Q1 and Q3 are the 25th and 75thpercentile respectively. IQR is the range of difference between Q3 and Q1.
IQR = Q3 – Q1
The upper bound in the box-plot is nothing but Q3 + 1.5*IQR whereas the lower bound is Q1 – 1.5*IQR. All the points lying above the upper bound and lying below the lower bound are considered as outliers.
In standard deviation, all the points with values more than or less than 3 times the standard deviation from mean are considered as an outlier.
# remove observations where Age is greater or less than 3 # times the standard deviation train = train[np.abs(train.Age - train.Age.mean()) <= (3*train.Age.std())]
Okay, so now you know what are outliers and how to identify them but, what to do after that? How do you deal with them?
- Naive way of dealing with outliers is to remove them from your dataset. But that is only suggested when the outliers are in small numbers.
- Transforming the variable can be helpful in dealing with outliers like binning the variable or taking the log transformation of the variable.
- Imputing the outliers (just like imputing missing values) or capping them with maximum value can also be used to deal with them.
- We can also build a model to predict the model output and then remove the observations where the actual value is farthest away from the predicted value with some cutoff.
Outliers are one of the main problems when building a predictive model. Outliers are not always caused due to errors. Let’s say there is a retail shop and you have the daily revenue data of the shop. The shop revenue will be higher on holidays (as people tend to buy more on holidays),which can be flagged as an outlier, but it is also an important information that the revenue goes up during holidays. So, outliers may not always be bad but sometimes may convey important information also.