R²

A personal blog about my fumblings with statistics, finance and anything R

lm() vs lmRob()

Let’s explore some robust regression. After having identified good quality predictors for your our purposes, it makes more sense for us to use an estimation technique that is not easily affected by discrepancies in the data and is reliable even if the data consists of outliers.

Data and Libraries

Let’s get some data to work with. I am going to use the quantmod package . Let’s just use a simple example.

I’ll be using the dynlm() function instead of lm() since it can handle timeseries data directly.

Where there is lm() there is summary()

Bootstrapping - OLS

Let’s bootstrap the OLS estimate and see what the bootstrapped standard error is.

I’ll use the boot() function to do this. Let’s also create a function which can fit a lm model and return the coefficient on the Industrial Production Index.

The coefficient on the industrial production index has a rather high standard error but a much smaller bias value. The bias and standard error can be very easily calculated as shown below. Also, we can plot a histogram and a q-q plot of the bootstrapped estimates using the plot() function. Note that t here refers to the value of the coefficient on Industrial Production Index. Outlier

It is well known that OLS estimates are sensitive to small changes in data and or outliers. We can show this emperically. Let’s artificially introduce some outliers into our data. Let’s first identify the data point with the highest cook’s distance. Looks like it’s the 14th datapoint in our GDP dataset. Let’s try to distort this specific datapoint.  The coefficient on the Industrial Production Index has changed by 13.3450079 %. We can bootstrap the estimate again to check how the standard error of the bootstrapped estimate changes. Notice that the plots seem to suggest that the bootstrapped estimates are skewed (unlike in the first case). Let’s go trough the same exercise using lmRob()

lmRob()

Let’s first fit the data and see what the coefficients look like and then we can bootstrap the estimates.

Note the lower R-Squared value. Now let’s introduce the outlier as we did earlier and see what happens. Note how the coefficient on Industrial Production Index is the same when compared to the case when there was no outlier. The bootstrapped estimates are fairly normally distributed. But note that the boot strapped estimates have a higher bias but a lower standard error. In the wake of outliers robust estimation might make more sense. This effect is much better seen in the following plot of the residuals from the two (OLS vs Robust) fits.    Thoughts? Feel free to comment below ! Thanks !