TLDR; Learn how to approach classification problems and how classification models work under the hood.

Classifying an object is the action of picking a category for it, from a set of predefined ones. In the simplest case, you must pick from two categories - this is known as **binary classification**. When you have a more general problem (more than two categories), the problem is known as **multiclass classification**.

Classification problems are common in practice. A wide variety of algorithms exist that can solve them. We'll look at some of them in this tutorial.

In this tutorial, you'll learn:

- What is classification?
- How can you classify two categories of objects
- How can you classify multiple categories
- How simple classification models work under the hood

From a Machine Learning perspective, you need to choose one label from an array of existing ones based on a set of features that describe this item.

Binary classification is one of the most commonly used type of classification. It answers yes/no questions. Some examples include:

- Is this email spam?
- Is this transaction fraudulent?
- Is this customer going to sign up for the pro plan?

Examples of multiclass classification include:

- Classifying tweet sentiment (negative, neutral or positive)
- Assigning a category to a news article
- Choosing the type of expert that should help a customer support ticket.

There is a crisis on Dwight's farm. The beets are ready for harvest and organization. Of course, different sizes of beets should have different prices, so they need to be sorted. With more than 10,000 beets doing this by hand is not going to be easy. Can Machine Learning help?

Let`s look at some of the data that Dwight has collected:

import pandas as pddata = pd.DataFrame(dict(width=[1.2, 1.3, 1.5, 1.3],height=[1.4, 1.6, 2.2, 2.0],leaves_count=[4, 6, 7, 8],price=["LOW", "LOW", "HIGH", "HIGH"],))data

width | height | leaves_count | price |
---|---|---|---|

1.2 | 1.4 | 4 | LOW |

1.3 | 1.6 | 6 | LOW |

1.5 | 2.2 | 7 | HIGH |

1.3 | 2.0 | 8 | HIGH |

We have width, height and the number of leaves for some beets (features). Our job is to predict the price bracket for each one - low, or high (target variable - the one we want to predict).

Let's split those into two different variables:

X = data[["width", "height", "leaves_count"]]y = data.price

X

width | height | leaves_count |
---|---|---|

1.2 | 1.4 | 4 |

1.3 | 1.6 | 6 |

1.5 | 2.2 | 7 |

1.3 | 2.0 | 8 |

y

0 LOW1 LOW2 HIGH3 HIGHName: price, dtype: object

Training a model consists of finding good parameters (turn the knobs at correct positions) using the data we have. We'll use one of the available classifiers from the scikit-learn library to apply a Machine Learning algorithm for us:

from sklearn.linear_model import LogisticRegressionclassifier = LogisticRegression()classifier = classifier.fit(X, y)

We'll go into details of how the `LogisticRegression`

works in later tutorials. For now, let's check how well it works on the training data.

How good is your model? A fundamental question that you will have to answer time and time again. One rough estimate (when doing classification) is the number of correct predictions divided by the total examples. This gives a number between 0 and 1.

correct_predictions = 8total_examples = 12correct_predictions / total_examples

0.6666666666666666

The `score()`

method does exactly that:

classifier.score(X, y)

1.0

Wow, 100% correct. Is this good? Is it realistic? You should **always** be suspicious if your metrics are showing 100% accuracy. In this example, our model is powerful enough to memorize all the data (we have only 3 data points).

Calling the `predict()`

method on our model. Here's a new beet data for which we want to predict the price range:

new_beet_data = dict(width=1.5, height=3, leaves_count=12)new_beet_data

{'height': 3, 'leaves_count': 12, 'width': 1.5}

Note that we're supplying everything except the price bracket (which we're trying to predict). We need to transform this dictionary to a 2D array and pass that to our model:

new_data = [list(new_beet_data.values())]predictions = classifier.predict(new_data)predictions

array(['HIGH'], dtype=object)

Dwight is happy with the prediction that the model is making. He thinks that the task is complete. But something keeps him awake at night. It ain't Moses trying to run barely clothed in the field. How is the model doing its thing?

In reality, our model knows that we have two different classes - low and high price. Here they are:

classifier.classes_

array(['HIGH', 'LOW'], dtype=object)

And assigns probability/belief for each one:

for class_name, probability in zip(classifier.classes_, classifier.predict_proba(new_data)[0]):print(f"{class_name}: {probability}")

HIGH: 0.995383561239819LOW: 0.004616438760180996

So, the model is very confident that this beet deserves a high paying price. That's great, but how are those probabilities found?

Recall that during training, the parameters (knobs) of our model get adjusted to make accurate predictions on the training data. Here are the final parameter values after the training is complete:

classifier.coef_

array([[-0.07588991, -0.28935367, -0.88655414]])

This particular model has another parameter, called intercept (we'll discuss the details of Logistic Regression in another tutorial):

classifier.intercept_[0]

6.2470409466048835

new_data[0]

[1.5, 3, 12]

To obtain the prediction, we need to multiply each data point with each parameter

dot_product = (-0.07588991 * 1.5) + (-0.28935367 * 3.0) + (-0.88655414 * 12.0)dot_product

-11.620545555

This is known as a dot product between two vectors. You can also use the `dot()`

from NumPy:

import numpy as npnp.dot(new_data[0], classifier.coef_.flatten())

-11.62054552144412

Finally, we need to add the intercept to get the final value:

np.dot(new_data[0], classifier.coef_.flatten()) + classifier.intercept_[0]

-5.373504574839236

But this doesn't look like probability prediction. There is another trick that will convert it to one. We'll use the `sigmoid`

function:

$Sigmoid(x) = \frac{1}{1 + \rm e^{-x}}$

Here is a straightforward implementation of it:

def sigmoid(x):return 1 / (1 + np.exp(-x))

Let's plot this to get a better understanding of it:

import matplotlib.pyplot as pltplt.figure(figsize=(10, 6))x = np.linspace(-10, 10, 100)z = sigmoid(x)plt.axvline(0, color="black", linewidth=0.3)plt.axhline(0.5, color="black", linewidth=0.3)plt.plot(x, z)plt.yticks(np.arange(0, 1.01, step=0.25))plt.xlabel("x")plt.ylabel("Sigmoid(x)");

The `sigmoid`

squashes all values between 0 and 1. Note that the output for 0 is 0.5. Let's apply it on the output of our model:

low_price_probability = sigmoid(dot_product + classifier.intercept_[0])"{:f}".format(low_price_probability)

0.004616

And here are the probability predictions of our model:

for class_name, probability in zip(classifier.classes_, classifier.predict_proba(new_data)[0]):print(f"{class_name}: {probability}")

HIGH: 0.995383561239819LOW: 0.004616438760180996

Note that this is the same probability as our model predictions for the low price bracket. And this is how our model comes up with the final probabilities.

Recall that our model can predict the final price bracket for the new data:

classifier.predict(new_data)

array(['HIGH'], dtype=object)

To convert the probability to the actual prediction, we'll apply a *threshold* at which we'll consider a probability value high enough, so we'll use that:

threshold = 0.5

A `threshold`

of 0.5 is one common value. For some practical applications, you might want to use another - depending on the type of errors you're willing to tolerate.

To get the final prediction, we'll compare the probability of low price prediction with the `threshold`

:

low_price_probability > threshold

False

So, the model thinks this is not low price beet, it should be in the high price bracket. This result seems reasonable to Dwight and he is happy with the work he did.

Dwight's business is growing. The model is working great, and life is good. But he wants something a bit extra. What if there was a higher price bracket for some of the premium beets? So, an idea was born. Here is some of the new data:

data = pd.DataFrame(dict(width=[2.0, 1.3, 1.5, 1.3, 1.2],height=[2.0, 1.1, 2.2, 2.0, 1.3],weight=[4.0, 1.3, 2.0, 2.0, 1.7],leaves_count=[12, 6, 9, 10, 4],price=["PREMIUM", "LOW", "HIGH", "HIGH", "LOW"],))data

width | height | weight | leaves_count | price |
---|---|---|---|---|

2.0 | 2.0 | 4.0 | 12 | PREMIUM |

1.3 | 1.1 | 1.3 | 6 | LOW |

1.5 | 2.2 | 2.0 | 9 | HIGH |

1.3 | 2.0 | 2.0 | 10 | HIGH |

1.2 | 1.3 | 1.7 | 4 | LOW |

Again, he needs to split the data into features and a target variable (the price bracket):

X = data[["width", "height", "weight", "leaves_count"]]y = data.price

Now, Dwight has three categories to choose from, so the previous model won't work here. He can use another model:

from sklearn.tree import DecisionTreeClassifierclassifier = LogisticRegression()classifier = classifier.fit(X, y)

Dwight got some more data to test his new model on:

new_beet_data = dict(width=2.0, height=2.5, weight=4.5, leaves_count=14)new_beet_data

{'height': 2.5, 'leaves_count': 14, 'weight': 4.5, 'width': 2.0}

Just as before, we'll pass this to the trained model:

new_data = [list(new_beet_data.values())]predictions = classifier.predict(new_data)predictions

array(['PREMIUM'], dtype=object)

The premium price bracket prediction seems reasonable to Dwight. But how does it work?

The inner workings of this model are more involved than taking a dot product of parameters and data information - so we'll skip that for now. But in the end, a new function is used that does a similar job to what the `sigmoid`

did. The name is `softmax`

:

$Softmax(x_i) = \frac{\rm e^{x_i}}{\sum_{j}^{n}\rm e^{x_j}}$

Here is an implementation:

def softmax(x):return np.exp(x) / sum(np.exp(x))

The job of this function is to convert unnormalized values to a probability distribution. Let's look at an example:

low_price_prediction = 1.2high_price_prediction = 4.5premium_price_prediction = 10.42probabilities = softmax([low_price_prediction, high_price_prediction, premium_price_prediction])print(["{:f}".format(p) for p in probabilities])

['0.000099', '0.002678', '0.997223']

`Softmax`

squashes the vector of predictions, so that the sum is equal to 1.

Our new model can also output probabilities:

classifier.classes_

array(['HIGH', 'LOW', 'PREMIUM'], dtype=object)

for class_name, probability in zip(classifier.classes_, classifier.predict_proba(new_data)[0]):print(f"{class_name}: {probability}")

HIGH: 0.09047630577203172LOW: 0.0002293276653952118PREMIUM: 0.909294366562573

The model is very certain that this beet belongs to the premium price bracket. But it has some doubts about the high price bracket. This is good, we'll see why in later tutorials.

Classification problems are very common when doing real-world Machine Learning work. Recognizing them will allow you to use a large toolbox of predefined models that can solve them. Of course, every real-world problem has a bit of "strangeness" around it but knowing the fundamentals will be of huge help!

In this tutorial, you learned:

- What is classification?
- How can you classify objects into two categories (with an example using Logistic Regression)
- How can you do multiclass classification (with a Decision tree)