TLDR; Learn what is regression, how can you train a simple model to solve a regression problem and how to evaluate it

In this tutorial, you'll learn:

- What is regression?
- Train a simple regression model
- How to evaluate regression tasks

Dwight is doing very well for himself. He starts thinking about expanding his business. The farm is too small to accommodate his desire for total world domination. He needs more farmland.

He collected some offers from a local webpage (scrantonfarmers.com) and made a spreadsheet out of them. Dwight wants to know which offers are good:

import pandas as pddata = pd.DataFrame(dict(area=[100, 200, 300],clay=[0.2, 0.15, 0.3],sand=[0.2, 0.4, 0.5],soil_depth=[2, 3, 1],price=[20000, 40000, 25000],))data

area | clay | sand | soil_depth | price |
---|---|---|---|---|

100 | 0.20 | 0.2 | 2 | 20000 |

200 | 0.15 | 0.4 | 3 | 40000 |

300 | 0.30 | 0.5 | 1 | 25000 |

He is hunting for a deal. Something that has a high predicted price and is selling for cheap. Dwight wants fertile soil - with just the right amounts of clay, sand and soil depth.

According to his research, this means around 20 percent clay, 40 percent sand and 20 percent silt. The depth should be more than 4 feet. He is looking for value (and the possibility to brag about his achievements in front of colleagues).

The plan is to make a model that tries to predict the price of the land. Later on, when an ad for a new property comes around, we'll compare the predicted price with that from the ad. If the predicted price is significantly lower - Dwight will take a look at the property.

We'll start by splitting the features from the target (predicted) variables:

X = data[["area", "clay", "sand", "soil_depth"]]y = data.price

X

area | clay | sand | soil_depth |
---|---|---|---|

100 | 0.20 | 0.2 | 2 |

200 | 0.15 | 0.4 | 3 |

300 | 0.30 | 0.5 | 1 |

y

0 200001 400002 25000Name: price, dtype: int64

Using the LinearRegression model from scikit-learn is a simple way to build a regression model:

from sklearn.linear_model import LinearRegressionregressor = LinearRegression()regressor = regressor.fit(X, y)

Dwight found a new ad and collected the data:

new_land_data = dict(area=150, clay=0.2, sand=12, soil_depth=4)new_land_data

{'area': 150, 'clay': 0.2, 'sand': 12, 'soil_depth': 4}

And got the predicted price from the trained model:

new_data = [list(new_land_data.values())]predictions = regressor.predict(new_data)predictions

array([51910.44815217])

The advertised price of the property is 53,000 - the model did well.

Our model learns parameter values (just as it did in the classification example) based on the training data. Each feature gets a parameter:

regressor.coef_

array([ 82.81744737, -773.46661605, 386.87823856, 11602.20628431])

Another parameter (specific to this model) is the intercept that we should take into account:

regressor.intercept_

-11408.839630300161

To obtain the final prediction, the model calculates a weighted sum between the parameters and values and add it with the intercept:

import numpy as npnp.dot(new_data[0], regressor.coef_.flatten()) + regressor.intercept_

51910.44815217175

To evaluate how well our regression model is doing, we can call the `score()`

method:

regressor.score(X, y)

1.0

Under the hood, the scoring method is using the $R^2$ coefficient of determination. Intuitively, $R^2$ tells us how accurate the model is - the closer we're to 1.0 - the better. In other words, it tells us how much of the possible error is eliminated by our model.

The coefficient is defined as:

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$

where $SS_{tot}$ is the total sum of squares:

$SS_{tot} = \sum_{i}(y_i - \overline{y})^2$

where $\overline{y}$ is the mean of the observed data:

$\overline{y} = \frac{1}{n}\sum_{i=1}^n y_i$

Let's calculate $SS_{tot}$ for our data:

y = y.to_numpy()mean_price = np.mean(y)print(f"Property mean price: {mean_price}")squared_differences = np.square((y - mean_price))ss_tot = np.sum(squared_differences)print(f"Total sum of squares: {ss_tot}")

Property mean price: 28333.333333333332Total sum of squares: 216666666.66666666

The $SS_{res}$ refers to the residual sum of the squares.

$SS_{res}=\sum_i(y_i - f_i)^2$

where $f_i$ is the prediction at position $i$. We define the residuals as the difference between the real and predicted value:

predictions = regressor.predict(X)residuals = (y - predictions)residuals

array([-3.63797881e-12, 0.00000000e+00, 3.63797881e-12])

predictions

array([20000., 40000., 25000.])

Now we can calculate the residual sum of squares:

ss_res = np.sum(np.square(residuals))print(f"Residual sum of squares: {ss_res}")

Residual sum of squares: 2.6469779601696886e-23

y

array([20000, 40000, 25000])

We have all the components to calculate the coefficient of determination $R^2$:

r2 = 1 - (ss_res / ss_tot)r2

1.0

According to this metric, our model is just perfect (on the training data). We can double-check that using the `r2_score()`

function from scikit learn:

from sklearn.metrics import r2_scorer2_score(y, predictions)

1.0

Looks like our calculations pan out. Now you know about one commonly used way to evaluate regression models!

In this tutorial, you learned: