R-bloggers

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Outlier detection and treatment with R

Posted on December 9, 2016 by Selva Prabhakaran in R bloggers | 0 Comments

[This article was first published on DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Outliers in data can distort predictions and affect the accuracy, if you don’t detect and handle them appropriately especially in regression models.

Why outliers treatment is important?

Because, it can drastically bias/change the fit estimates and predictions. Let me illustrate this using the cars dataset.

To better understand the implications of outliers better, I am going to compare the fit of a simple linear regression model on cars dataset with and without outliers. In order to distinguish the effect clearly, I manually introduce extreme values to the original cars dataset. Then, I predict on both the datasets.

# Inject outliers into data. cars1 

outliers effect on model fit

Plot of original data without outliers:

Notice the change in slope of the best fit line after removing the outliers. Had we used the outliers to train the model(left chart), our predictions would be exagerated (high error) for larger values of speed because of the larger slope.

Detect Outliers

Univariate approach
For a given continuous variable, outliers are those observations that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ is the difference between 75th and 25th quartiles. Look at the points outside the whiskers in below box plot.

outlier_values 

Bivariate approach

Visualize in box-plot of the X and Y, for categorical X’s

Box-plot:

What is the inference? The change in the level of boxes suggests that Month seem to have an impact in ozone_reading while Day_of_week does not. Any outliers in respective categorical level show up as dots outside the whiskers of the boxplot.

# For continuous variable (convert to categorical if needed.) boxplot(ozone_reading ~ pressure_height, data=ozone, main="Boxplot for Pressure height (continuos var) vs Ozone") boxplot(ozone_reading ~ cut(pressure_height, pretty(inputData$pressure_height)), data=ozone, main="Boxplot for Pressure height (categorial) vs Ozone", cex.axis=0.5)

Boxplot Continuous

You can see few outliers in the box plot and how the ozone_reading increases with pressure_height . Thats clear.

Multivariate Model Approach

Cook’s Distance
Cook’s distance is a measure computed with respect to a given regression model and therefore is impacted only by the X variables included in the model. But, what does cook’s distance mean? It computes the influence exerted by each data point (row) on the predicted outcome.

The cook’s distance for each observation i measures the change in Ŷ Y^ (fitted Y) for all observations with and without the presence of observation i, so we know how much the observation i impacted the fitted values.

Influence measures
In general use, those observations that have a cook’s distance greater than 4 times the mean may be classified as influential. This is not a hard boundary.

plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") # plot cook's distance abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red") # add labels

Influential Obs by Cooks distance:

Now lets find out the influential rows from the original data. If you extract and examine each influential row 1-by-1 (from below output), you will be able to reason out why that row turned out influential. It is likely that one of the X variables included in the model had extreme values.

influential 4*mean(cooksd, na.rm=T))]) # influential row numbers head(ozone[influential, ]) # influential observations. #> Month Day_of_month Day_of_week ozone_reading pressure_height Wind_speed Humidity #> 19 1 19 1 4.07 5680 5 73 #> 23 1 23 5 4.90 5700 5 59 #> 58 2 27 5 22.89 5740 3 47 #> 133 5 12 3 33.04 5880 3 80 #> 135 5 14 5 31.15 5850 4 76 #> 149 5 28 5 4.82 5750 3 76 #> Temperature_Sandburg Temperature_ElMonte Inversion_base_height Pressure_gradient #> 19 52 56.48 393 -68 #> 23 69 51.08 3044 18 #> 58 53 58.82 885 -4 #> 133 80 73.04 436 0 #> 135 78 71.24 1181 50 #> 149 65 51.08 3644 86 #> Inversion_temperature Visibility #> 19 69.80 10 #> 23 52.88 150 #> 58 67.10 80 #> 133 86.36 40 #> 135 79.88 17 #> 149 59.36 70

Lets examine the first 6 rows from above output to find out why these rows could be tagged as influential observations.

Outliers Test

The function outlierTest from car package gives the most extreme observation based on the given model.

car::outlierTest(mod) #> No Studentized residuals with Bonferonni p Largest |rstudent|: #> rstudent unadjusted p-value Bonferonni p #> 243 3.045756 0.0026525 0.53845

This output suggests that observation in row 243 is most extreme.

Outliers package

The outliers package provides a number of useful functions to systematically extract outliers. Some of these are convenient and come handy, especially the outlier() and scores() functions.

Outliers
outliers gets the extreme most observation from the mean. If you set the argument opposite=TRUE, it fetches from the other side.

set.seed(1234) y=rnorm(100) outlier(y) #> [1] 2.548991 outlier(y,opposite=TRUE) #> [1] -2.345698 dim(y) [1] 2.415835 1.102298 1.647817 2.548991 2.121117 outlier(y,opposite=TRUE) #> [1] -2.345698 -2.180040 -1.806031 -1.390701 -1.372302

Scores
There are two aspects the the scores() function.
Compute the normalised scores based on “z”, “t”, “chisq” etc
Find out observations that lie beyond a given percentile based on a given score.

set.seed(1234) x = rnorm(10) scores(x) # z-scores => (x-mean)/sd scores(x, type="chisq") # chi-sq scores => (x - mean(x))^2/var(x) #> [1] 0.68458034 0.44007451 2.17210689 3.88421971 0.66539631 . . . scores(x, type="t") # t scores scores(x, type="chisq", prob=0.9) # beyond 90th %ile based on chi-sq #> [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE scores(x, type="chisq", prob=0.95) # beyond 95th %ile scores(x, type="z", prob=0.95) # beyond 95th %ile based on z-scores scores(x, type="t", prob=0.95) # beyond 95th %ile based on t-scores

Treatment

Once the outliers are identified, you may rectify it by using one of the following approaches.

Imputation
Imputation with mean / median / mode. This method has been dealt with in detail in the discussion about treating missing values. Another robust method which we covered at DataScience+ is multivariate imputation by chained equations.

Capping
For missing values that lie outside the 1.5 * IQR limits, we could cap it by replacing those observations outside the lower limit with the value of 5th %ile and those that lie above the upper limit, with the value of 95th %ile. Below is a sample code that achieves this.

Prediction
In yet another approach, the outliers can be replaced with missing values NA and then can be predicted by considering them as a response variable. We already discussed how to predict missing values.

If you liked this post, you might find my video courses Introduction to R Programming and Mastering R Programming or to visit My Blog.

  1. R for Publication: Lesson 6, Part 2 – Linear Mixed Effects Models
  2. R for Publication: Lesson 6, Part 1 – Linear Mixed Effects Models
  3. Cross-Validation: Estimating Prediction Error
  4. Interactive Performance Evaluation of Binary Classifiers
  5. Predicting wine quality using Random Forests