Skip to main content

Overview of PCA

 

Image from Pixabey
What is PCA?

Working with high-dimensional data is always a challenging task. In this modern technological era, we are more capable of capturing data in many aspects(variables) than ever. We can capture an instance with thousands of variables. Analyzing all these variables and finding each variable's effect(coefficients) on the target variable requires a huge computation power and electricity. We are not sure our built model is the best one even by doing so. Principal Component Analysis will address this very issue and provide us a solution, Principal Components. This blog discusses the Terminology and implementation of Principal Component Analysis.

Overview of PCA

Principal Component Analysis is an Unsupervised Machine Learning technique used to reduce the dimensionality of the data by preserving the statistical information. Principal Component Analysis tries to find new variables i.e. Principal Components which are linear combinations of existing variables in the dataset with a minimal loss of information. Which successively maximize the variance and are uncorrelated with each other. We have as many Principal Components as the number of variables in the data. But with successively maximizing the variance characteristic we assume enough Principal Components which can explain enough variability and further addition of those doesn’t add much variability.

Curse of Dimensionality

In any Machine Learning model adding more variables will result in a performance improvement, but too much of anything is bad. If we go on increasing variables which eventually becomes more than our instances, in such a case several algorithms strive hard to build efficient models. This is termed the Curse of Dimensionality.




Comments

Popular Posts

Normal Distribution - Properties, Z Scores, Area Under the Curve and Central Limit Theorem.

Photo by Alex Knight on Unsplash Introduction The Normal Distribution was first discovered by Abraham de Moivre in 1733. Due to historical error it was credited to Carl Friedrich Guass , who made first reference to the Normal Distribution in 1809 as the distribution of errors in Astronomy. Since then the distribution was widely used in various fields and is in continuous development.  The Normal Distribution is the most important continuous probability distribution and plays a very important role in various statistical analysis. It fits the most of naturally occurring variables like Population Age, Height, Weight, Blood Pressure, IQ Scores etc. All these follow Normal Distribution when we have significantly large samples.  Definition A Continuous Variable "X" is said to follow Normal Distribution with parame...

Skewness

Skewness Skewness - "Lack of Symmetry"     We Study skewness to have an idea about the shape of the curve  which we draw with the help of given data. If for a given data Mean = Median = Mode, then we say that the data is Symmetrical or Not skewed whereas the given  data is said to be Asymmetrical or Skewed if: 1. Mean, Median and Mode doesn't fall at same point. 2. Quartiles are not equidistant. 3. The curve drawn is not symmetrical but stretched more to one side than the other. and such Asymmetrical distributions could either be Positively Skewed or Negatively  Skewed. a) Symmetrical Distribution In the above diagram we have Mean = Median = Mode, so we can say that the data is distributed symmetrically along the center(Mean). We usually refer this kind of distributions as Normal Distributions or Gaussian Distribution which has a bell shaped curve. Gaussian Distribution is very important from statistical point of view and has wide range of applications in the fi...

Hypothesis Testing - Standard Error, Level of Significance, p-value and Critical Values.

  Image by Gerd Altmann from Pixabay " In Statistics we study the Chances or Probabilities of Occurrence or Happening of an Event or Phenomenon. In the world of Statistical Analysis presenting the results with 100% strong evident is impossible "   Introduction In general gaining the information regarding the characteristics of whole population is practically not possible. It incurs a lot of money, time, labor and other constraints. So we take a sample from the population, study about it's characteristics and try to draw conclusions or to estimate the population  characteristics from it. For example, A doctor records the Blood Pressure(BP) readings of 100 patients suffering with Hypertension and computes the average(Sample average) of those readings. Doctor uses this sample average information to draw conclusion about the average Blood Pressure (Population average) of whole patients who are suffering from Hypertension. Parameter and Statistic The Statistical characte...