Machine Learning



----------------------------------------------------------------------------------------------------------------------
BASIC TERMS

Mean: Where there are numerical values (Salary, Marks)

Median: Where there are observational values (Height)

Mode: Where there is Frequency values (Absenteeism)

Note: All are used to find Average
----------------------------------------------------------------------------------------------------------------------
Data distribution:

When line of Mean, Median and Mode are at coincides then data is said to be distributed normally.
----------------------------------------------------------------------------------------------------------------------
4 types of Analysis:
  • Diagnostic: What happened?
  • Descriptive: Why happened? BI reporting
  • Predictive: What can happen?
  • Prescriptive: What can better happen?
----------------------------------------------------------------------------------------------------------------------
Types of Machine Learning:




Supervised: Provide feature & label

Unsupervised: Provide feature & ask for labels
----------------------------------------------------------------------------------------------------------------------
  
Statistical technique lifecycle:

1) Data collection

2) Data processing, tabulation & presentation

3) Measure of central tendency (Average=Mean, Median, Mode)

4) Measure of dispersion: Standard deviation and variance

5) Correlation

6) Regression (Finding dependent- X axis & Independent feature- Y axis)

7) Advance statistical techniques
----------------------------------------------------------------------------------------------------------------------

TYPES OF GRAPH:






HISTOGRAM

  


Box plot:


Outliers in Box plot (Eg:)

Person who is having very less salary (Cleaning staff)

Person who is having very high salary (CEO)
----------------------------------------------------------------------------------------------------------------------

Minimum Score

The lowest score, excluding outliers (shown at the end of the left whisker).

Lower Quartile

Twenty-five percent of scores fall below the lower quartile value (also known as the first quartile).

Median

The median marks the mid-point of the data and is shown by the line that divides the box into two parts (sometimes known as the second quartile). Half the scores are greater than or equal to this value and half are less.

Upper Quartile

Seventy-five percent of the scores fall below the upper quartile value (also known as the third quartile). Thus, 25% of data are above this value.

Maximum Score

The highest score, excluding outliers (shown at the end of the right whisker).

Whiskers

The upper and lower whiskers represent scores outside the middle 50% (i.e. the lower 25% of scores and the upper 25% of scores).

The Interquartile Range (or IQR)

This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and 75th percentile).


 ----------------------------------------------------------------------------------------------------------------------  
Python Libraries used for ML:

  • Numpy: Used for Multidimensional Array,
  • Pandas: Used to read from different data sources.
  • Matplotlib: Visualisation library
  • Seaborn: Visualisation library
  • Scikits: Contains ML algorithms
---------------------------------------------------------------------To be continued