Scaling Data

Cindy Reiner
6 min readJan 25, 2021

Data scaling is a useful tool in machine learning. It creates a common scale for the numeric features of a dataset and thus eliminates the problem of data with larger values having a stronger influence on the machine learning model.

There are many ways to scale data, each of which influences the data’s values and histogram in different ways. Here, we will look at several different methods to scale data available in Scikit Learn. To examine how each method changes the data, I have created a dataset with 3 features with different distributions as follows-

skewed = [2, 2, 4, 6, 8, 8, 10, 10, 10]

normal = [2, 4, 4, 6, 6, 6, 8, 8, 10]

outliers = [2, 4, 4, 6, 6, 8, 8, 10, 40]

As you can see in the DataFrame below, I rearranged the numbers so that the first 5 numbers match across features. This will make it easier to see how the different methods of scaling data change the same value depending on the distribution of the feature as a whole. We will also look at the kernel density estimation plot to see how the feature distributions are changed with each scaling method.

Starting data and corresponding kernel density estimation (kde) graphs

Standard Scaler

The Standard Scaler uses the mean and standard deviation of the feature to scale the data around a mean of 0 and standard deviation of 1 using the equation:

With the Standard Scaler, most values fall into the range of -3 to 3. Because this scaler uses the mean and standard deviation to scale the feature, skewness of distribution and outliers will affect the outcome. This is evident in the outlier data, where the bulk of the data is squeezed into the range of -0.71 to 0.02 so that the outlier can fit in to the -3 to 3 range. In the skewed data, the values are also different from their counterparts in the normally distributed data.

The Standard Scaler is best used on features with a normal distribution and no outliers. More information on how to use the StandardScaler from sklearn can be found here.

Data scaled using sklearn’s StandardScaler and corresponding kde graphs

Min Max Scaler

The Min Max Scaler uses the minimum and maximum values of the feature to scale the data into a range between 0 and 1 using this equation:

Because outliers affect the minimum and maximum values of the feature, this method is sensitive to outliers. However, skewness of the distribution will not affect the scaled values.

The Min Max Scaler is best for features with a skewed distribution and no outliers. More information on how to use the MinMaxScaler from sklearn can be found here.

Data scaled using sklearn’s MinMaxScaler and corresponding kde graphs

Robust Scaler

The Robust Scaler uses the feature’s median(Q2(x)) and inner quartile range (Q3(x)-Q1(x)) to scale the data to a range where most of the data fall between -1 and 1 using the equation:

Outliers do not affect the feature’s median or inner quartile range. Skewness can effect these measures, so it will effect the outcome.

The Robust Scaler is best used on features with a normal distribution that contain outliers that cannot be removed. More information on how to use the RobustScaler from sklearn can be found here.

Data scaled using sklearn’s RobustScaler and corresponding kde graphs

Max Absolute Scaling

The Max Absolute Scaler uses the absolute maximum value of the feature to scale the data into a range between 0 and 1 using this equation:

Because this scaler uses only the absolute maximum value to scale the data, it is affected by outliers, but not by skewness. The Max Abs Scaler was specifically designed to conserve the sparseness of data.

The Max Abs Scaler is best used on features with sparse data and no outliers. More information on how to use the MaxAbsScaler from sklearn can be found here.

Data scaled using sklearn’s MaxAbsScaler and corresponding kde graphs

Normalizer

Unlike the previously mentioned scalers that change feature values based on properties of the feature, Normalizer changes values based on the other values in the row so that the values in a row have the unit norm. That is, if each value in the row is squared, the sum of the row will be 1. The equation for a dataset with features x, y, & z is:

The Normalizer is a feature engineering method, not a scaling method. It is useful to look for groups of patterns in how features relate to each other. More information on how to use the Normalizer from sklearn can be found here.

Data scaled using sklearn’s Normalizer and corresponding kde graphs

Power Transformer

The Power Transformer both scales and normalizes data using either the Yeo-Johnson (default) or Box-Cox transformation. The outcome will be a more Gaussian distribution centered around 0, with data falling into a range of -2 & 2. The Yeo-Johnson transformation can be used for all values, while the Box-Cox transformation can only be used when feature values are greater than 0.

These transformations are useful in dealing with data with high heteroskedasticity or when the data contains outliers and you want a normal distribution. More information on how to use the PowerTransformer from sklearn can be found here.

Data scaled using sklearn’s PowerTransformer and corresponding kde graphs

Quantile Transformer

The Quantile Transformer is a non-linear transformation that places all values in the range 0 to 1. This transformation bring all outliers to the edges of the range, which can cause saturation artifacts. Distances and correlations between and within the features are distorted.

More information on how to use the QuantileTransformer from sklearn can be found here.

Data scaled using sklearn’s QuantileTransformer and corresponding kde graphs

--

--

Cindy Reiner

Data Scientist | Cell & Molecular Biologist | Live in the PNW and love all it entails- tech, coffee, mountains, and even the drizzle.