An imbalanced dataset has heavily skewed proportions of the target categories. Often, anomaly detection datasets have such properties.
Let's say you have a dataset with X-ray images of patients with probable pneumonia. About 90% of the images have no signs of illness and the rest 10% have.
By far, the easiest way to deal with imbalanced datasets is to collect more data when you know that this will balance it. Of course, that is something that might not be feasible in practice. The tactics you'll use depend on the problem you'll want to solve, but there are some general approaches:
Sampling strategies try to balance the dataset by changing the default uniform sampling.
Undersampling - removes samples from the category that contains much more examples:
Oversampling - adds samples from the category that contains few examples:
Another approach you can use is to create synthetic data from underrepresented categories. In our example, you can flip, rotate and crop images of Pneumonia patients. One popular library for doing Vision-related data augmentation is Albumentations.
You can use some simple data augmentation techniques when working with text, too! Some of them are synonym replacement, synonym insertion at random places, and random word deletion.
Using accuracy to measure the performance of a trained model on our dataset will give us a false sense of good performing model. A random model will give us 90% accuracy (due to the dataset imbalance).
Metrics like Precision, Recall, and F1-score capture the model performance better when dealing with imbalanced datasets.