Skip to main content

Statistics Basics for Machine Learning


Statistics has become significantly popular in the recent years with the exponential growth in the applications in various fields like Economics, Business, Healthcare, Logistics, Risk Management, Policy Making, Government Institutions and every possible industry where data is generated. Data is the new oil, with the application of blend of cutting edge technologies and Statistical Methods we can solve most complex business problems. With the advent of Big Data, Machine Learning and Artificial Intelligence, majority of the companies are incorporating data analytics and data driven solutions when building strategic applications and decision making.

What is Statistics?

Statistics is a branch of applied mathematics which specializes in data.’  - This is the most straightforward yet simple definition of statistics.

Definitions by Great Statisticians:
1. "Statistics is the science of counting. The science of averages." -  A.L. Bowley
2. "Statistics is the science of Estimates and Probabilities"Boddington
3. "Statistics is the science and art of handling aggregates of facts – observing, enumeration, recording, classifying and otherwise systematically treating them" Harlow
These are few of many definitions of Statistics which are changing by time.

In these series of blogs I'll be sharing the important statistical concepts and methods that are using in the present day Machine Learning Projects.

Let's now look into the Data Types that we use in the Statistics. 

Types of Data

    We generally face two types of data in our machine learning projects:

1.       Numerical Data

2.       Categorical Data

Numerical Data: Numerical data again classified into two types -
1. Discrete data – which assumes only integer values just like int variable.
Example: Age of a person,
    Number of Subjects passed in a semester,
    Number of Children in a family, etc.
2. Continuous data - which assumes integer as well as decimal values just like float variable.
Example: Height and weight of a person,

Average speed of your car trip,

Temperature in a room, etc.

Categorical Data: Unlike numerical data, categorical data doesn’t have any numerical values to represent the data, it rather uses “Names” or “Labels”. Categorical data is again divided into two types:
1. Nominal Data – In Nominal Data, the Names/Labels are non-measurable and all have equal importance, no Label is placed higher or No Label is placed lower, all are ranked same. (We cannot put labels on a scale and measure them)
Example: a) What is your gender?
You either choose “Male” or “Female”, here two Names/Labels are equally ranked. No Label has the priority over the other. 
(Or we cannot measure the gender)
b) In grammar, the parts of speech: Noun, Verb, Preposition, Article, Pronoun, etc.
(we cannot measure the parts of speech)
2. Ordinal Data – In Ordinal Data, the Names/Labels are measurable and are ranked on the Ordinal Scale. (you often see these as rating scales in different applications)
Example: a) If an order is delivered to you, and asked to give the rating for their service on a scale of 1-10. 1 being very poor and 10 being Fantastic.
(Here we are measuring their service on a scale.)
b) How is your Coffee? - Very Strong, Strong, Moderate, Light, Very Light.
(Here we are measuring the taste of a coffee.)
    Presenting the data in the descriptive from is the foremost step in any Statistical Analysis or building a Machine Learning model. It gives insights on the data that we are working on and allows us to spot any patterns in the data.

Descriptive Statistics

Measures of Central Tendency

MeanArithmetic Mean or simply the Mean is defined as the sum of the observations divided by the number of observations.
Output: 16.2

Median: The Median is that value which divides the data into two equal parts, one part comprising all values greater, and the other, all values lower than Median. 
Output: 17

Quartiles: The Quartiles divide the given data into four parts. There are three quartiles. The second quartile divides the data into two halves and therefore is the same as the Median. The first (lower) quartile (Q1) marks off the first one-fourth, the third (upper) quartile (Q3) marks off the three-fourth.

Inter-Quartile Range: It is defined as the difference between the third (upper) quartile (Q3) and the first (lower) quartile (Q1).
                                               Interquartile Range = Q3-Q1
Quartile Deviation: It is half of the difference between the first (lower) quartile (Q1) and the third (upper) quartile (Q3). Hence, it is called Semi Inter Quartile Range.
                                             Quartile Deviation = (Q3-Q1)/2 

Mode: The Mode refers to that value in the data, which occur most frequently. It is an actual value, which has the highest concentration of items in and around it.

Measures of Dispersion

Mean Deviation: Mean Deviation is the arithmetic mean of the deviations of a series computed from any measure of central tendency; i.e, Mean, Median and Mode. All the deviations are taken as positive i.e, signs are ignored. 
According to Clark and Schekade: "Average deviation is the average amount scatter of the items in a distribution from either the mean or the median, ignoring the signs of the deviations."
We usually compute mean deviation from any one of the three averages, Median is most preferred over others. But in general practice and due to wide applications of mean, the mean deviation is generally computed from mean.
Standard Deviation: Karl Pearson introduced the concept of Standard Deviation in 1893. It is the most important measure of dispersion and is widely used in many statistical formulae. Standard Deviation is also called as Root-Mean Square Deviation (Error)[RMSE], which is the most important metric we use while checking the fit of our Machine Learning Regression models. 
Definition: "It is defined as the positive square-root of the arithmetic mean of the square of the deviations of the given data from their arithmetic mean."

Variance: Square of the Standard Deviation is called Variance.


Popular Posts

Normal Distribution - Properties, Z Scores, Area Under the Curve and Central Limit Theorem.

Photo by Alex Knight on Unsplash Introduction The Normal Distribution was first discovered by Abraham de Moivre in 1733. Due to historical error it was credited to Carl Friedrich Guass , who made first reference to the Normal Distribution in 1809 as the distribution of errors in Astronomy. Since then the distribution was widely used in various fields and is in continuous development.  The Normal Distribution is the most important continuous probability distribution and plays a very important role in various statistical analysis. It fits the most of naturally occurring variables like Population Age, Height, Weight, Blood Pressure, IQ Scores etc. All these follow Normal Distribution when we have significantly large samples.  Definition A Continuous Variable "X" is said to follow Normal Distribution with parame...


Skewness Skewness - "Lack of Symmetry"     We Study skewness to have an idea about the shape of the curve  which we draw with the help of given data. If for a given data Mean = Median = Mode, then we say that the data is Symmetrical or Not skewed whereas the given  data is said to be Asymmetrical or Skewed if: 1. Mean, Median and Mode doesn't fall at same point. 2. Quartiles are not equidistant. 3. The curve drawn is not symmetrical but stretched more to one side than the other. and such Asymmetrical distributions could either be Positively Skewed or Negatively  Skewed. a) Symmetrical Distribution In the above diagram we have Mean = Median = Mode, so we can say that the data is distributed symmetrically along the center(Mean). We usually refer this kind of distributions as Normal Distributions or Gaussian Distribution which has a bell shaped curve. Gaussian Distribution is very important from statistical point of view and has wide range of applications in the fi...

Hypothesis Testing - Standard Error, Level of Significance, p-value and Critical Values.

  Image by Gerd Altmann from Pixabay " In Statistics we study the Chances or Probabilities of Occurrence or Happening of an Event or Phenomenon. In the world of Statistical Analysis presenting the results with 100% strong evident is impossible "   Introduction In general gaining the information regarding the characteristics of whole population is practically not possible. It incurs a lot of money, time, labor and other constraints. So we take a sample from the population, study about it's characteristics and try to draw conclusions or to estimate the population  characteristics from it. For example, A doctor records the Blood Pressure(BP) readings of 100 patients suffering with Hypertension and computes the average(Sample average) of those readings. Doctor uses this sample average information to draw conclusion about the average Blood Pressure (Population average) of whole patients who are suffering from Hypertension. Parameter and Statistic The Statistical characte...