## What is Regression

TLDR; Learn what is regression, how can you train a simple model to solve a regression problem and how to evaluate it

In this tutorial, you'll learn:

• What is regression?
• Train a simple regression model
• How to evaluate regression tasks

Dwight is doing very well for himself. He starts thinking about expanding his business. The farm is too small to accommodate his desire for total world domination. He needs more farmland.

He collected some offers from a local webpage (scrantonfarmers.com) and made a spreadsheet out of them. Dwight wants to know which offers are good:

import pandas as pd
data = pd.DataFrame(    dict(        area=[100, 200, 300],        clay=[0.2, 0.15, 0.3],        sand=[0.2, 0.4, 0.5],        soil_depth=[2, 3, 1],        price=[20000, 40000, 25000],    ))
data
areaclaysandsoil_depthprice
1000.200.2220000
2000.150.4340000
3000.300.5125000

He is hunting for a deal. Something that has a high predicted price and is selling for cheap. Dwight wants fertile soil - with just the right amounts of clay, sand and soil depth.

According to his research, this means around 20 percent clay, 40 percent sand and 20 percent silt. The depth should be more than 4 feet. He is looking for value (and the possibility to brag about his achievements in front of colleagues).

The plan is to make a model that tries to predict the price of the land. Later on, when an ad for a new property comes around, we'll compare the predicted price with that from the ad. If the predicted price is significantly lower - Dwight will take a look at the property.

We'll start by splitting the features from the target (predicted) variables:

X = data[["area", "clay", "sand", "soil_depth"]]y = data.price
X
areaclaysandsoil_depth
1000.200.22
2000.150.43
3000.300.51
y
0    200001    400002    25000Name: price, dtype: int64

## Training a model

Using the LinearRegression model from scikit-learn is a simple way to build a regression model:

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()regressor = regressor.fit(X, y)

Dwight found a new ad and collected the data:

new_land_data = dict(area=150, clay=0.2, sand=12, soil_depth=4)new_land_data
{'area': 150, 'clay': 0.2, 'sand': 12, 'soil_depth': 4}

And got the predicted price from the trained model:

new_data = [list(new_land_data.values())]predictions = regressor.predict(new_data)predictions
array([51910.44815217])

The advertised price of the property is 53,000 - the model did well.

## How does it work?

Our model learns parameter values (just as it did in the classification example) based on the training data. Each feature gets a parameter:

regressor.coef_
array([   82.81744737,  -773.46661605,   386.87823856, 11602.20628431])

Another parameter (specific to this model) is the intercept that we should take into account:

regressor.intercept_
-11408.839630300161

To obtain the final prediction, the model calculates a weighted sum between the parameters and values and add it with the intercept:

import numpy as np
np.dot(new_data[0], regressor.coef_.flatten()) + regressor.intercept_
51910.44815217175

## How are Regression models evaluated?

To evaluate how well our regression model is doing, we can call the score() method:

regressor.score(X, y)
1.0

Under the hood, the scoring method is using the $R^2$ coefficient of determination. Intuitively, $R^2$ tells us how accurate the model is - the closer we're to 1.0 - the better. In other words, it tells us how much of the possible error is eliminated by our model.

The coefficient is defined as:

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$

where $SS_{tot}$ is the total sum of squares:

$SS_{tot} = \sum_{i}(y_i - \overline{y})^2$

where $\overline{y}$ is the mean of the observed data:

$\overline{y} = \frac{1}{n}\sum_{i=1}^n y_i$

Let's calculate $SS_{tot}$ for our data:

y = y.to_numpy()
mean_price = np.mean(y)print(f"Property mean price: {mean_price}")
squared_differences = np.square((y - mean_price))ss_tot = np.sum(squared_differences)
print(f"Total sum of squares: {ss_tot}")
Property mean price: 28333.333333333332Total sum of squares: 216666666.66666666

The $SS_{res}$ refers to the residual sum of the squares.

$SS_{res}=\sum_i(y_i - f_i)^2$

where $f_i$ is the prediction at position $i$. We define the residuals as the difference between the real and predicted value:

predictions = regressor.predict(X)residuals = (y - predictions)residuals
array([-3.63797881e-12,  0.00000000e+00,  3.63797881e-12])
predictions
array([20000., 40000., 25000.])

Now we can calculate the residual sum of squares:

ss_res = np.sum(np.square(residuals))print(f"Residual sum of squares: {ss_res}")
Residual sum of squares: 2.6469779601696886e-23
y
array([20000, 40000, 25000])

We have all the components to calculate the coefficient of determination $R^2$:

r2 = 1 - (ss_res / ss_tot)r2
1.0

According to this metric, our model is just perfect (on the training data). We can double-check that using the r2_score() function from scikit learn:

from sklearn.metrics import r2_score
r2_score(y, predictions)
1.0

Looks like our calculations pan out. Now you know about one commonly used way to evaluate regression models!

## Summary

In this tutorial, you learned: