Comparison Between Theil-Sen Regressor and Simple Linear Regression

Theil-Sen Regression

Objectives: To show Theil-Sen Regression is more robust against outliers compared to Simple Linear Regression

#import models and plotting tools
from sklearn.linear_model import TheilSenRegressor, LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import r2_score

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from IPython.core.display import HTML
np.random.seed(42)

Create a dataset with X as the number of coin flips and y as the number of heads from the coin flips.

Then, add 5 outliers of 100% heads and 100% tails into the data.

# Create a dataset
def coin():
    return np.random.choice(['Heads', 'Tails'])

X = list(range(1,101))

df = pd.DataFrame({
    'X' : list(range(1,101)),
    'y' : [sum([1 if coin() == 'Heads' else 0 for _ in range(i)]) for i in X]
})

# df.head()

#Add outliers to the dataframe
df_outliers = pd.DataFrame({
    'X' : [ 50, 30, 40, 35, 45, 80, 90, 100, 100, 60, 70],
    'y' : [ 100, 70, 80, 20, 5, 80, 95, 10, 120, 6, 8]
})

df = pd.concat((df,df_outliers))
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111 entries, 0 to 10
Data columns (total 2 columns):
X    111 non-null int64
y    111 non-null int64
dtypes: int64(2)
memory usage: 2.6 KB

Plot the dataset to see the outliers.

sns.lmplot('X', 'y', data=df);

png

Here, I did a train test split by splitting the data into 50/50.

# Do a Train-test-split
X = df[['X']]
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=42)

Instantiate the linear regression model and train it with X_train and y_train from the train test split.

Then, score the model with X_train, X_test and do a cross validation of 5 folds from the original data to find the R2 scores.

# Create linear model
linear_model = LinearRegression()

linear_model.fit(X_train, y_train)
print(linear_model.score(X_train, y_train))
print(linear_model.score(X_test, y_test))
cross_val_score(linear_model, X, y, cv=5).mean()

0.3788042856591475
0.7084611406650192





-0.28135931886744714

Instantiate the TheilSen regresson and train it with X_train and y_train from the train test split.

Then, score the model with X_train, X_test and do a cross validation of 5 folds from the original data to find the R2 scores.

# Create Theilsen model
theilsen_model = TheilSenRegressor()

theilsen_model.fit(X_train, y_train)
print(theilsen_model.score(X_train, y_train))
print(theilsen_model.score(X_test, y_test))
cross_val_score(theilsen_model, X, y, cv=5).mean()
# theilsen_model.intercept_

0.3525989639206396
0.8499956799863967





0.2970249639222725

Plot the prediction from each model with X_train to have a better understanding of what happened.

yhat_train_linear = linear_model.predict(X_train)
yhat_train_theilsen = theilsen_model.predict(X_train)

plt.figure(figsize=(10,5))
plt.xlabel('Number of coins flipped')
plt.ylabel('Number of heads')
plt.plot(X_train, yhat_train_linear, color='r', label='Linear Model')
plt.plot(X_train, yhat_train_theilsen, color='b', label='TheilSen Model')
plt.legend()
plt.scatter(X_train, y_train);

png

Plot the prediction from each model with X_test

yhat_test_linear = linear_model.predict(X_test)
yhat_test_theilsen = theilsen_model.predict(X_test)

plt.figure(figsize=(10,5))
plt.xlabel('Number of coins flipped')
plt.ylabel('Number of heads')
plt.plot(X_test, yhat_test_linear, color='r', label='Linear Model')
plt.plot(X_test, yhat_test_theilsen, color='b', label='TheilSen Model')
plt.legend()
plt.scatter(X_test, y_test);

png

We can see that even though linear regression scores well with the X_train, it doesn't do well with X_test and the cross validation.

From the graph, we can see that TheilSen model ignores the outliers and tends to fit the majority of the y values whereas the linear model shifts upwards/downwards due to outliers.

Additional notes from research:

Theil-Sen Regressor has a breakdown point of about 29.3% in case of a simple linear regression which means that it can tolerate arbitrary corrupted data (outliers) of up to 29.3% in the two-dimensional case.
The slope/coefficient is the median of all slopes between paired values.