Skip to main content

Skewness

Skewness

Skewness - "Lack of Symmetry"
    We Study skewness to have an idea about the shape of the curve  which we draw with the help of given data. If for a given data Mean = Median = Mode, then we say that the data is Symmetrical or Not skewed whereas the given data is said to be Asymmetrical or Skewed if:
1. Mean, Median and Mode doesn't fall at same point.
2. Quartiles are not equidistant.
3. The curve drawn is not symmetrical but stretched more to one side than the other.
and such Asymmetrical distributions could either be Positively Skewed or Negatively Skewed.
a) Symmetrical Distribution
In the above diagram we have Mean = Median = Mode, so we can say that the data is distributed symmetrically along the center(Mean). We usually refer this kind of distributions as Normal Distributions or Gaussian Distribution which has a bell shaped curve. Gaussian Distribution is very important from statistical point of view and has wide range of applications in the field.
b) Positively Skewed Distribution

A given data distribution is said to be Positively Skewed or Right-Skewed if the data frequencies (number of times a data point is occurring in the data) are spread out over a greater range of values on the right hand side than they are on the left hand side.
In Positively Skewed data we have Mean > Median> Mode.
c) Negatively Skewed Distribution

A given data distribution is said to be Negatively Skewed or Left-Skewed if the data frequencies (number of times a data point is occurring in the data) are spread out over a greater range of values on the left hand side than they are on the right hand side.
In Negatively Skewed data we have Mean < Median< Mode.
Measure of Skewness
Of the various measures of Skewness one important and most widely used is Karl - Peasrson's Coefficient of Skewness and is given by: 
Karl - Pearson's Coefficient of Skewness = 3(Mean - Median)/Standard Deviation(S.D)
and the limits for Coefficient are -3 to +3.
To demonstrate with example I have taken the Big Market Sales Dataset from Analytics Vidhya ML Competitions.

To visualize I have taken the target feature from the data viz, Total Outlet Sales. From the figure we can clearly say that our data is Positively Skewed as we have majority of the values lie in between 0 to 4000(left side), and frequencies of higher data values are spread on a greater range which caused a long tail/steepness towards the right side.
Now, What happens if our data is skewed and Why do we care?
Let us recall the actual definition of Machine Learning - Machine Learning is unlike actual programming where we pass a set of rules, it learns the rules to do the task by observing and analyzing the patterns in the training data that we pass to our machine. 
Assume we pass skewed data to our machine. The data points are concentrated on either left side or right side of the distribution. Our machine then trains more on the data where it's concentration is more(left hand side of data in the above example) and consequently it ignores less concentrated data and thus trains less on the data points that are lying in the tail of the distribution. When we are performing the prediction task using the model that is created on such skewed data, if our model observes a data point that lie in the tail of the distribution then the prediction might not be the best one. This will obviously leads to more error term value that implies more variance which is the violation of one of the Linear Regression assumptions i.e, constant error term variance (Homoscedasticity). Which in turn results to bad model metrics.
So, What can we do?
We have Transformation methods to turn our skewed data to look like Normal (Gaussian) distribution. (we are transforming our data to look like Normal Distribution, not exactly converting into Normal Distribution). The Important Transformation techniques we use are:
1. Log Transformation:
Here we perform the Logarithm operation on our skewed data. Lets apply Log Transformation to the above discussed example and see the result. (All Transformations are applied in ax3 - third sub-plot)
Output of above code: 
The top left figure shows the distribution of original data and the bottom left figure shows the Log Transformed data. The Log Transformation does change the shape of our skewed data, but it doesn't look like exactly Normal. It is slightly skewed towards left thus making it negatively skewed. The Log Transformation doesn't quite worked well on our data, so we try performing another transformations.
(The Probability plots on the right side of figure, we will discuss those in coming blogs. If you are new to Data Science, please ignore those and continue.)
2. Square Root Transformation:
Here we perform the Square Root operation on our skewed data. Below is the code to implement the transformation.
Output of above code:
The Square Root Transformation made a significant change in our skewed distribution. Now it almost looks like a Normal and so satisfying. But wait till we try the most important transformation coming next.
3. Box - Cox Transformation:
Box-Cox Transformation is one of the most advanced and widely used Transformation technique in the regression analysis to improve the normality of the data. The math behind the transformation is bit complex and not important at this stage. It can be easily implemented using Scikit Learn or Scipy library. Here I use Scipy in the code below:
Output of above code:
The Box-Cox Transformation has given a better Normality look when compared with the above two. Now our distribution is looking more like Normal with slight disturbances on the left side of the curve. We are more satisfied now. aren't we? 
But there is a catch! All the transformations we have seen till now will work only if we have STRICTLY POSITIVE values in the data. They cannot be used to transform if we have a single negative value in the data. 
SO what now? I have data which have negative values and it is skewed. I need to transform it to improve it's Normality. What I need to do? 
To answer this, Yeo-Johnson had came up with new transformation technique called Yeo-Johnson Transformation which is an extended version of Box-Cox Transformation.
4. Yeo-Johnson Transformation: 
Yeo-Johnson is the transformation technique we use to improve the Normality of the data which has negative values and it is skewed. The math behind this is not required now. Although we don't have any negative values in our data, we still try to transform using YJ Transformation. It can also be easily implemented using Scikit Learn or Scipy library. Here I use Scipy in the code below:
Output of above code:
We haven't observed
 much changes from Box-Cox to Yeo-Johnson transformation. The only place that YJ can come into play is when we have negative values in our data otherwise Box-Cox is a significant transformation technique. 

Thank you. See you in the next blog.



Resources: 

 

Comments

Popular Posts

Normal Distribution - Properties, Z Scores, Area Under the Curve and Central Limit Theorem.

Photo by Alex Knight on Unsplash Introduction The Normal Distribution was first discovered by Abraham de Moivre in 1733. Due to historical error it was credited to Carl Friedrich Guass , who made first reference to the Normal Distribution in 1809 as the distribution of errors in Astronomy. Since then the distribution was widely used in various fields and is in continuous development.  The Normal Distribution is the most important continuous probability distribution and plays a very important role in various statistical analysis. It fits the most of naturally occurring variables like Population Age, Height, Weight, Blood Pressure, IQ Scores etc. All these follow Normal Distribution when we have significantly large samples.  Definition A Continuous Variable "X" is said to follow Normal Distribution with parame...

Hypothesis Testing - Standard Error, Level of Significance, p-value and Critical Values.

  Image by Gerd Altmann from Pixabay " In Statistics we study the Chances or Probabilities of Occurrence or Happening of an Event or Phenomenon. In the world of Statistical Analysis presenting the results with 100% strong evident is impossible "   Introduction In general gaining the information regarding the characteristics of whole population is practically not possible. It incurs a lot of money, time, labor and other constraints. So we take a sample from the population, study about it's characteristics and try to draw conclusions or to estimate the population  characteristics from it. For example, A doctor records the Blood Pressure(BP) readings of 100 patients suffering with Hypertension and computes the average(Sample average) of those readings. Doctor uses this sample average information to draw conclusion about the average Blood Pressure (Population average) of whole patients who are suffering from Hypertension. Parameter and Statistic The Statistical characte...