MLExpert logo

MLExpert

What is an imbalanced dataset? How can you deal with it?

An imbalanced dataset has heavily skewed proportions of the target categories. Often, anomaly detection datasets have such properties.

Let's say you have a dataset with X-ray images of patients with probable pneumonia. About 90% of the images have no signs of illness and the rest 10% have.

Dealing with imbalanced datasets

By far, the easiest way to deal with imbalanced datasets is to collect more data when you know that this will balance it. Of course, that is something that might not be feasible in practice. The tactics you'll use depend on the problem you'll want to solve, but there are some general approaches:

Oversampling and undersampling

Sampling strategies try to balance the dataset by changing the default uniform sampling.

Undersampling - removes samples from the category that contains much more examples:

Undersampling
Undersampling

Oversampling - adds samples from the category that contains few examples:

Oversampling
Oversampling

Data augmentation

Another approach you can use is to create synthetic data from underrepresented categories. In our example, you can flip, rotate and crop images of Pneumonia patients. One popular library for doing Vision-related data augmentation is Albumentations.

You can use some simple data augmentation techniques when working with text, too! Some of them are synonym replacement, synonym insertion at random places, and random word deletion.

Changing the metrics

Using accuracy to measure the performance of a trained model on our dataset will give us a false sense of good performing model. A random model will give us 90% accuracy (due to the dataset imbalance).

Metrics like Precision, Recall, and F1-score capture the model performance better when dealing with imbalanced datasets.

References

Copyright © 2021 MLExpert by Venelin Valkov. All rights reserved.