Difference between revisions of "Data Science"

From Sinfronteras
Jump to: navigation, search
(Replaced content with "{{Sidebar}} <accesscontrol> Autoconfirmed users </accesscontrol> ==Projects portfolio== <div style="margin-left: 20px; width: 550pt; margin-top: 50px !important"> <ul> {{...")
(Tag: Replaced)
Line 4: Line 4:
 
</accesscontrol>
 
</accesscontrol>
  
 
<br />
 
 
==Projects portfolio==
 
==Projects portfolio==
 
<div style="margin-left: 20px; width: 550pt; margin-top: 50px !important">
 
<div style="margin-left: 20px; width: 550pt; margin-top: 50px !important">
Line 308: Line 306:
  
 
===Data Levels and Measurement===
 
===Data Levels and Measurement===
Levels of Measurement - Measurement scales
+
Levels of M
 
 
https://www.statisticssolutions.com/data-levels-and-measurement/
 
 
 
 
 
There are four Levels of Measurement in research and statistics: Nominal, Ordinal, Interval, and Ratio.
 
 
 
 
 
In Practice:
 
* Most schemes accommodate just two levels of measurement: nominal and ordinal
 
* There is one special case: dichotomy (otherwise known as a "boolean" attribute)
 
 
 
 
 
{| class="wikitable"
 
! rowspan="2" |
 
! rowspan="2" |
 
! rowspan="2" style="width:80px; background-color:#E6B0AA" |Values have meaningful order
 
! rowspan="2" style="width:80px; background-color:#A9DFBF" |Distance between values is defined
 
! colspan="3" style="width:80px; background-color:#FDEBD0" |'''Mathematical operations make sense'''
 
(Values can be used to perform '''mathematical operations)'''
 
! rowspan="2" style="width:80px; background-color:#AED6F1" |There is a meaning ful zero-point
 
! colspan="5" style="width:80px; background-color:#D7BDE2" |Values can be used to perform statistical computations
 
! rowspan="2" |Example
 
|-
 
! style="width:80px; background-color:#FDEBD0" | '''Comparison operators'''
 
! style="width:80px; background-color:#FDEBD0" | Addition and subtrac tion
 
! style="width:80px; background-color:#FDEBD0" | Multiplica tion and division
 
! style="width:80px; background-color:#D7BDE2" | "Counts", aka, "Fre quency of Distribu tion"
 
! style="width:80px; background-color:#D7BDE2" | Mode
 
! style="width:80px; background-color:#D7BDE2" | Median
 
! style="width:80px; background-color:#D7BDE2" | Mean
 
! style="width:80px; background-color:#D7BDE2" | Std
 
|-
 
!'''Nominal'''
 
|Values serve only as labels. Also called "categorical", "enumerated", or "discrete". However, "enumerated" and "discrete" imply order
 
| colspan="11" style="margin: 0; padding: 0;" |
 
{| class="mw-collapsible mw-collapsed wikitable" style="margin: 0; padding: 0;"
 
|- style="vertical-align:middle;"
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px; vertical-align:top; padding-top:70px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
|- style="vertical-align:top;"
 
| style="height:100px; text-align:left; width:80px;" |
 
Values don't have any meaningful order
 
| style="height:100px; text-align:left; width:80px;" |
 
No distance between values is defined
 
| colspan="3" style="height:100px; text-align:left; width:80px;" |
 
Values don't carry any mathematical meaning
 
|
 
| style="height:100px; text-align:left; width:80px;" |
 
| style="height:100px; text-align:left; width:80px;" |
 
| colspan="3" style="height:100px; text-align:left; width:80px;" |
 
Values cannot be used to perform many statistical computations, such as mean and standard deviation
 
|-
 
| colspan="11" |Even if the values are numbers.  For example, if we want to categorize males and females, we could use a number of 1 for male, and 2 for female. However, the values of 1 and 2 in this case don't have any meaningful order or carry any mathematical meaning. They are simply used as labels. <nowiki>https://www.statisticssolutions.com/data-levels-and-measurement/</nowiki>
 
|}
 
|For an  '''«outlook»''' attribute from weather data, potential values could be "sunny", "overcast", and "rainy".
 
|-
 
!'''Ordinal'''
 
|Ordinal attributes are called "numeric", or "continuous", however "continuous" implies mathematical continuity
 
| colspan="11" style="margin: 0; padding: 0;" |
 
{| class="mw-collapsible mw-collapsed wikitable" style="margin: 0; padding: 0;"
 
|- style="vertical-align:middle; margin: 0; padding: 0;"
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px; vertical-align:top; padding-top:70px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
|- style="vertical-align:top;"
 
| style="height:100px; text-align:left; width:80px;" |
 
Values have a meaningful order
 
| style="height:100px; text-align:left; width:80px;" |
 
No distance between values is defined
 
| style="height:100px; text-align:left; width:80px;" |
 
Only comparison operators make sense
 
| colspan="2" |Mathematical operations such as addition, subtraction, multiplication, etc. do not make sense
 
|
 
| style="height:100px; text-align:left; width:80px;" |
 
| style="height:100px; text-align:left; width:80px;" |
 
| style="height:100px; text-align:left; width:80px;" |
 
|
 
|
 
|-
 
| colspan="11" |For example, an '''«Education level»''' attribute with possible values of '''«high school»''', '''«undergraduate degree»''', and '''«graduate degree»'''. There is a definitive order to the categories (i.eº., graduate is higher than undergraduate, and undergraduate is higher than high school), but we cannot make any other arithmetic assumption.  For instance, we cannot assume that the difference in education level between undergraduate and high school is the same as the difference between graduate and undergraduate.
 
 
 
Distinction between nominal and ordinal not always clear (e.g., attribute "outlook")
 
|}
 
|A '''«temperature»''' attribute in weather data with potential values fo: "hot" > "warm" > "cool"
 
|-
 
!'''Interval'''
 
|
 
| colspan="11" style="margin: 0; padding: 0;" |
 
{| class="mw-collapsible mw-collapsed wikitable" style="margin: 0; padding: 0;"
 
|- style="vertical-align:middle;"
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: grey;  font-size: 15pt; text-align: center;"><div style="text-align: center;"><span style="color: grey; font-size: 15pt;  text-align: center;">✘</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px; vertical-align:top; padding-top:60px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
|- style="vertical-align:top;"
 
| style="height:100px; text-align:left; width:80px;" |
 
| style="height:100px; text-align:left; width:80px;" |
 
Distance between values is defined. In other words, we can quantify the difference between values
 
| style="height:100px; text-align:left; width:80px;" |
 
Comparison operators make sense
 
|Addition, subtraction, make sense
 
|Multiplication, and division do not make sense
 
|Interval variables often do not have a meaningful zero-point.
 
| style="height:100px; text-align:left; width:80px;" |
 
| style="height:100px; text-align:left; width:80px;" |
 
| style="height:100px; text-align:left; width:80px;" |
 
|
 
|(not sure)
 
|-
 
| colspan="11" |An example of an interval variable would be a '''«Temperature»''' attribute.  We can correctly assume that the difference between 70 and 80 degrees is the same as the difference between 80 and 90 degrees.  However, the mathematical operations of multiplication and division do not apply to interval variables.  For instance, we cannot accurately say that 100 degrees is twice as hot as 50 degrees.  Additionally, interval variables often do not have a meaningful zero-point.  For example, a temperature of zero degrees (on Celsius and Fahrenheit scales) does not mean a complete absence of heat.
 
 
 
 
 
An interval variable can be used to compute commonly used statistical measures such as the average (mean), standard deviation, and the Pearson correlation coefficient. <nowiki>https://www.statisticssolutions.com/data-levels-and-measurement/</nowiki>
 
|}
 
|a '''«Temperature»''' attribute composed by numeric measures of such property
 
|-
 
!'''Ratio'''
 
|
 
| colspan="11" style="margin: 0; padding: 0;" |
 
{| class="mw-collapsible mw-collapsed wikitable" style="margin: 0; padding: 0;"
 
|- style="vertical-align:middle;"
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
| style="height:100px; text-align:center; width:80px; vertical-align:top; padding-top:60px;" |<div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;"><div style="text-align: center;"><span style="color: blue; font-size: 20pt; text-align: center;">✔</span></div></span></div>
 
|- style="vertical-align:top;"
 
| style="height:100px; text-align:left; width:80px;" |
 
| style="height:100px; text-align:left; width:80px;" |
 
| colspan="3" style="height:100px; text-align:left; width:80px;" |
 
All arithmetic operations are possible on a ratio variable
 
|Ratio variables have a meaningful zero-point
 
| style="height:100px; text-align:left; width:80px;" |
 
| style="height:100px; text-align:left; width:80px;" |
 
| style="height:100px; text-align:left; width:80px;" |
 
|
 
|
 
|-
 
| colspan="11" |An example of a ratio variable would be weight (e.g., in pounds).  We can accurately say that 20 pounds is twice as heavy as 10 pounds.  Additionally, ratio variables have a meaningful zero-point (e.g., exactly 0 pounds means the object has no weight).
 
 
 
 
 
A ratio variable can be used as a dependent variable for most parametric statistical tests such as t-tests, F-tests, correlation, and regression. <nowiki>https://www.statisticssolutions.com/data-levels-and-measurement/</nowiki>
 
|}
 
|The '''«weight»''' (e.g., in pounds)
 
 
 
Other examples: gross sales and income of a company.
 
|}
 
 
 
 
 
<br />
 
 
 
===What is an example===
 
[[File:Observations-data_sciences.png|250px|thumb|right|Taken from https://www.youtube.com/watch?v=XAdTLtvrkFM]]
 
 
 
An example, also known in statistics as '''an observation''', is an instance of the phenomenon that we are studying. An observation is characterized by one or a set of attributes (variables).
 
 
 
 
 
In data science, we record observations on the rows of a table.
 
 
 
 
 
For example, imaging that we are recording the vital signs of a patient. For each observation we would record the «date of the observation», the «patient's heart» rate, and the «temperature»
 
 
 
 
 
<br />
 
 
 
===What is a dataset===
 
[Noel Cosgrave slides]
 
 
 
* A dataset is typically a matrix of observations (in rows) and their attributes (in columns).
 
 
 
* It is usually stored as:
 
:* Flat-file (comma-separated values (CSV)) (tab-separated values (TSV)). A flat file can be a plain text file or a binary file.
 
:* Spreadsheets
 
:* Database table
 
 
 
* It is by far the most common form of data used in practical data mining and predictive analytics. However, it is a restrictive form of input as it is impossible to represent relationships between observations.
 
 
 
 
 
<br />
 
===What is Metadata===
 
Metadata is information about the background of the data. It can be thought of as "data about the data" and contains: [Noel Cosgrave slides]
 
 
 
* Description of the variables.
 
* Information about the data types for each variable in the data.
 
* Restrictions on values the variables can hold.
 
 
 
 
 
<br />
 
 
 
==What is Data Science==
 
There are many different terms that are related and sometimes even used as synonyms. It is actually hard to define and differentiate all these related disciplines such as:
 
 
 
Data Science - Data Analysis - Data Analytics - Predictive Data Analytics - Data Mining - Machine Learning - Big Data - AI and even more - Business Analytics.
 
 
 
https://www.loginworks.com/blogs/top-10-small-differences-between-data-analyticsdata-analysis-and-data-mining/
 
 
 
https://www.quora.com/What-is-the-difference-between-Data-Analytics-Data-Analysis-Data-Mining-Data-Science-Machine-Learning-and-Big-Data-1
 
 
 
 
 
[[File:Data_science-Data_analytics-Data_mining.png|500px|thumb|right|]]
 
 
 
 
 
[[File:Data_science-Data_mining.jpg|500px|thumb|right|]]
 
 
 
 
 
<br />
 
'''Data Science'''
 
<blockquote>
 
I think the broadest term is Data Sciences. Data Sciences is a very broad discipline (an umbrella term) that involves (encompasses) many subsets such as Data Analysis, Data Analytics, Data Mining, Machine Learning, Big data (could also be included), and several other related disciplines.
 
 
 
A general definition could be that Data Science is a multi-disciplinary field that uses aspects of statistics, computer science, applied mathematics, data visualization techniques, and even business analysis, with the goal of getting new insights and new knowledge (uncovering useful information) from a vast amount of data, that can help in deriving conclusion and usually in taking business decisions http://makemeanalyst.com/what-is-data-science/  https://www.loginworks.com/blogs/top-10-small-differences-between-data-analyticsdata-analysis-and-data-mining/
 
</blockquote>
 
 
 
 
 
<br />
 
'''Data analysis'''
 
<blockquote>
 
Data analysis is still a very broad process that includes many multi-disciplinary stages, it goes from: (that are not usually related to Data Analytics or Data Mining)
 
 
 
* Defining a Business objective
 
* Data collection (Extracting the data)
 
* Data Storage
 
* Data Integration: Multiple data sources are combined. http://troindia.in/journal/ijcesr/vol3iss3/36-40.pdf
 
* Data Transformation: The data is transformed or consolidated into forms that are appropriate or valid for mining by performing various aggregation operations. http://troindia.in/journal/ijcesr/vol3iss3/36-40.pdf
 
 
 
* Data cleansing, Data modeling, Data mining, and Data visualizing, with the goal of uncovering useful information that can help in deriving conclusions and usually in taking business decisions. [EDUCBA] https://www.loginworks.com/blogs/top-10-small-differences-between-data-analyticsdata-analysis-and-data-mining/
 
:* This is the stage where we could use Data mining and ML techniques.
 
 
 
* Optimisation: Making the results more precise or accurate over time.
 
</blockquote>
 
 
 
 
 
<br />
 
'''Data Mining''' (Data Analytics - Predictive Data Analytics) (Hasta ahora creo que estos términos son prácticamente lo mismo)
 
<blockquote>
 
 
 
https://www.loginworks.com/blogs/top-10-small-differences-between-data-analyticsdata-analysis-and-data-mining/
 
 
 
We can say that Data Mining is a Data Analysis subset. It's the process of (1) Discovering hidden patterns in data and (2) Developing predictive models, by using statistics, learning algorithms, and data visualization techniques.
 
 
 
 
 
Common methods in data mining are: See Styles of Learning - Types of Machine Learning section
 
 
 
</blockquote>
 
 
 
 
 
<br />
 
'''Big Data'''
 
<blockquote>
 
Big data describes a massive amount of data that has the potential to be mined for information but is too large to be processed and analyzed using traditional data tools.
 
</blockquote>
 
 
 
 
 
<br />
 
'''Machine Learning'''
 
<blockquote>
 
Al tratar de encontrar una definición para Machine Learning me di cuanta de que muchos expertos coinciden en que no hay una definición standard para ML.
 
 
 
En este post se explica bien la definición de ML: https://machinelearningmastery.com/what-is-machine-learning/
 
 
 
Estos vídeos también son excelentes para entender what ML is:
 
 
 
:https://www.youtube.com/watch?v=f_uwKZIAeM0
 
:https://www.youtube.com/watch?v=ukzFI9rgwfU
 
:https://www.youtube.com/watch?v=WXHM_i-fgGo
 
:https://www.coursera.org/lecture/machine-learning/what-is-machine-learning-Ujm7v
 
 
 
 
 
Una de las definiciones más citadas es la definición de Tom Mitchell. This author provides in his book Machine Learning a definition in the opening line of the preface:
 
 
 
<blockquote>
 
{| style="color: black; background-color: white; width: 100%; padding: 0px 0px 0px 0px; border:1px solid #ddddff;"
 
| style="width: 20%; height=10px; background-color: #D8BFD8; padding: 0px 5px 0px 10px; border:1px solid #ddddff; vertical-align:center; moz-border-radius: 0px; webkit-border-radius: 0px; border-radius:0px;" |
 
<!--==============================================================================-->
 
<span style="color:#0000FF">
 
'''''Tom Mitchell'''''
 
</span>
 
<!--==============================================================================-->
 
|-
 
| style="width: 20%; background-color: #2F4F4F; padding: 5px 5px 5px 10px; border:1px solid #ddddff; vertical-align:top;" |
 
<!--==============================================================================-->
 
<span style="color:#FFFFFF">
 
'''The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.'''
 
</span>
 
<!--==============================================================================-->
 
|}
 
 
 
'''So, in short we can say that ML is about writing''' <span style="background:#D8BFD8">'''computer programs that improve themselves'''</span>.
 
</blockquote>
 
 
 
 
Tom Mitchell also provides a more complex and formal definition:
 
 
 
<blockquote>
 
{| style="color: black; background-color: white; width: 100%; padding: 0px 0px 0px 0px; border:1px solid #ddddff;"
 
| style="width: 20%; height=10px; background-color: #D8BFD8; padding: 0px 5px 0px 10px; border:1px solid #ddddff; vertical-align:center; moz-border-radius: 0px; webkit-border-radius: 0px; border-radius:0px;" |
 
<!--==============================================================================-->
 
<span style="color:#0000FF">
 
'''''Tom Mitchell'''''
 
</span>
 
<!--==============================================================================-->
 
|-
 
| style="width: 20%; background-color: #2F4F4F; padding: 5px 5px 5px 10px; border:1px solid #ddddff; vertical-align:top;" |
 
<!--==============================================================================-->
 
<span style="color:#FFFFFF">
 
'''A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.'''
 
</span>
 
<!--==============================================================================-->
 
|}
 
 
 
Don't let the definition of terms scare you off, this is a very useful formalism. It could be used as a design tool to help us think clearly about:
 
 
 
:'''E:''' What data to collect.
 
:'''T:''' What decisions the software needs to make.
 
:'''P:''' How we will evaluate its results.
 
 
 
Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to better filter spam. In this case: https://www.coursera.org/lecture/machine-learning/what-is-machine-learning-Ujm7v
 
 
 
:'''E:''' Watching you label emails as spam or not spam.
 
:'''T:''' Classifying emails as spam or not spam.
 
:'''P:''' The number (or fraction) of emails correctly classified as spam/not spam.
 
</blockquote>
 
 
 
 
 
''Machine Learning'' and ''Data Mining'' are terms that overlap each other. This is logical because both use the same techniques (weel, many of the techniques uses in Data Mining are also use in ML). I'm talking about Supervised and Unsupervised Learning algorithms (that are also called Supervised ML and Unsupervised ML algorithms).
 
 
 
The difference is that in ML we want to construct computer programs that automatically improve with experience (computer programs that improve themselves)
 
 
 
We can, for instance, use a Supervised learning algorithm (Naive Bayes, for example) to build a model that, for example, classifies emails as spam or no-spam. So we can use labeled training data to build the classifier and then use it to classify unlabeled data.
 
 
 
So far, even if this classifier is usually called an ML classifier, it is NOT strictly a ML program. It is just a Data Mining or Predictive data analytics task. It's not a strict ML program because the classifier is not automatically improving itself with experience.
 
 
 
Now, if we are able to use this classifier to develop a program that automatically gathers and adds more training data to rebuild the classifier and updates the classifier when its performance improves, this would now be a strict ML program;  because the program is automatically gathering new training data and updating the model so it will automatically improve its performance.
 
 
 
 
 
</blockquote>
 
 
 
 
 
<br />
 
 
 
==Styles of Learning - Types of Machine Learning==
 
<br />
 
[[File:Machine_learning_types.jpg|700px|thumb|center|]]
 
 
 
 
 
<br />
 
===Supervised Learning===
 
Supervised Learning (Supervised ML):
 
 
 
https://en.wikipedia.org/wiki/Supervised_learning
 
https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/
 
 
 
 
 
Supervised learning is the process of using training data (training data/labeled data: input (x) - output (y) pairs) to produce a mapping function (<math>f</math>) that allows mapping from the input variable (<math>x</math>) to the output variable (<math>y</math>). The training data is composed of input (x) - output (y) pairs.
 
 
 
Put more simply, in Supervised learning we have input variables (<math>x</math>) and an output variable (<math>y</math>) and we use an algorithm that is able to produce an inferred mapping function from the input to the output.
 
 
 
<math>y = f(x)</math>
 
 
 
The goal is to approximate the mapping function so well that when we have new input data (<math>x</math>), we can predict the output variable <math>y</math>.
 
 
 
 
 
The dependent variable is the variable that is to be predicted (<math>y</math>). An independent variable is the variable or variables that is used to predict or explain the dependent variable (<math>x</math>).
 
 
 
 
 
<span style='color:red; font-size:13pt'>It is not so easy to see and understand the mathematical conceptual difference between Regression and Classification techniques. In both methods, we determine a function from an input variable to an output variable. It is clear that regressions methods predict <span style='color:blue'>continuous</span> variables (the output variable is continuous), and classification predicts <span style='color:blue'>discrete</span> variables. Now, if we think about the mathematical conceptual difference, we must notice that '''regression is estimating the mathematical function that most closely fits the data'''. In some classification methods, it is clear that we are not estimating a mathematical function that fits the data, but just a method/algorithm/mapping_function (no sé cual sería el término más adecuado) that allows us to map the input to the output. This is, for example, clear in K-Nearest Neighbors where the algorithm doesn't generate a mathematical function that fits the data but only a mapping function (de nuevo, no sé si éste sea el mejor término) that actually (in the case of KNN) relies on the data (KNN determines the class of a given unlabeled observation by identifying the k-nearest labeled observations to it). So, the mapping function obtained in KNN is attached to the training data. In this case, is clear that KNN is not returning a mathematical function that fits the data. In Naïve Bayes, the mapping function obtained is not attached to the data. That is to say, when we use the mapping function generate by NB, this doesn't require the training data (of course we require the training data to build the NB Mapping function, but not to apply the generated function to classify a new unlabeled observation which is the case of KNN). However, we can see that the mathematical concept behind NB is not about finding a mathematical function that fits the data but it relies on a probabilistic approach (tengo que analizar mejor lo último que he dicho aquí sobre NB). Now, when it comes to an algorithm like Decision Trees, it is not so clear to see and understand the mathematical conceptual difference between Regression and Classification. in DT, I think that (even if the output is a discrete variable) we are generating a mathematical function that fits the data. I can see that the method of doing so is not so clear as in the case of, for example, Linear regression, but it would in the end be a mathematical function that fits the data. I think this is why, by doing simples variation in the algorithms, decision trees can also be used as a regression method.</span>
 
 
 
 
 
<span style='color:blue; font-size:13pt'>De hecho, Regression and classification methods are so closely related that: </span>
 
 
 
* <span style='color:blue; font-size:13pt'>Some algorithms can be used for both classification and regression with small modifications, such as Decision trees, SVM, and Artificial neural networks.</span> https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/
 
 
 
* <span style='color:blue; font-size:13pt'>A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.</span> https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/
 
 
 
* <span style='color:blue; font-size:13pt'>A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label. </span>
 
:* Logistic Regression: Contrary to popular belief, logistic regression IS a regression model. The model builds a regression model to predict the probability that a given data entry belongs to the category numbered as "1". Just like Linear regression assumes that the data follows a linear function, Logistic regression models the data using the sigmoid function. https://www.geeksforgeeks.org/understanding-logistic-regression/
 
 
 
* <span style='color:blue; font-size:13pt'>There are methods for implementing Regression using classification algorithms: </span>
 
:* https://www.sciencedirect.com/science/article/abs/pii/S1088467X97000139
 
:* https://www.dcc.fc.up.pt/~ltorgo/Papers/RegrThroughClass.pdf
 
 
 
 
 
 
 
<br />
 
<blockquote>
 
* '''Regression techniques''' (Correlation methods)
 
 
 
: A regression algorithm is able to approximate a mapping function (<math>f</math>) from input variables (<math>x</math>) '''to a <span style='color:red'>continuous</span> output variable''' (<math>y</math>). https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/
 
: For example, the price of a house may be predicted using regression techniques.
 
 
 
 
 
: Quizá podríamos decir que Regression analysis is the process of finding the mathematical function that most closely fits the data. The most common form of regression analysis is linear regression, which is the process of finding the line that most closely fits the data.
 
 
 
 
 
: The purpose of regression analysis is to: [Noel]
 
:* Predict the value of the dependent variable as a function of the value(s) of at least one independent variable.
 
:* Explain how changes in an independent variable are manifested in the dependent variable
 
 
 
 
 
<br />
 
:* Linear Regression
 
:* Decision Tree Regression
 
:* Support Vector Machines (SVM): It can be used for classification and regression analysis
 
:* Neural Network Regression
 
 
 
 
 
:: '''Regression algorithms are used for:'''
 
::* Prediction of continuous variables: future prices/cost, incomes, etc.
 
::* Housing Price Prediction: For example, a regression model could be used to predict the value of a house based on location, number of rooms, lot size, and other factors.
 
::* Weather forecasting: For example. A «temperature» attribute of weather data.
 
 
 
 
 
* '''Classification techniques'''
 
: Classification is the process of identifying to which of a set of categories a new observation belongs. https://en.wikipedia.org/wiki/Statistical_classification
 
 
 
: A classification algorithm is able to approximate a mapping function (<math>f</math>) from input variables (<math>x</math>) '''to a <span style='color:red'>discrete</span> output variable''' (<math>y</math>). So, the mapping function predicts the class/label/category of a given observation. https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/
 
 
 
 
 
: For example, an email can be classified as "spam" or "not spam".
 
 
 
 
 
:* K-Nearest Neighbors
 
:* Decision Trees
 
:* Random Forest
 
:* Naive Bayes
 
:* Logistic Regression
 
:* Support Vector Machines (SVM): It can be used for classification and regression analysis
 
:* Neural Network Classification
 
:* ...
 
 
 
 
 
:: '''Classification algorithms are used for:'''
 
::* Text/Image classification
 
::* Medical Diagnostics
 
::* Weather forecasting: For example. An «outlook» attribute of weather data with potential values of "sunny", "overcast", and "rainy"
 
::* Fraud Detection
 
::* Credit Risk Analysis
 
</blockquote>
 
 
 
 
 
<br />
 
 
 
===Unsupervised Learning===
 
Unsupervised Learning (Unsupervised ML)
 
<br />
 
<blockquote>
 
 
 
 
 
* '''Clustering'''
 
: It is the task of dividing the data into groups that contain similar data (grouping data that is similar together).
 
: For example, in a Library, We can use clustering to group similar books together, so customers that are interested in a particular kind of book can see other similar books
 
 
 
 
 
:* K-Means Clustering
 
 
 
:* Mean-Shift Clustering
 
 
 
:* Density-based spatial clustering of applications with noise (DBSCAN)
 
 
 
 
 
:: Clustering methods are used for:
 
 
 
::* Recommendation Systems: Recommendation systems are designed to recommend new items to users/customers based on previous user's preferences. They use clustering algorithms to predict a user's preferences based on the preferences of other users in the user's cluster.
 
::: For example, Netflix collects user-behavior data from its more than 100 million customers. This data helps Netflix to understand what the customers want to see. Based on the analysis, the system recommends movies (or tv-shows) that users would like to watch. This kind of analysis usually results in higher customer retention. https://www.youtube.com/watch?v=dK4aGzeBPkk
 
 
 
 
 
::* Customer Segmentation
 
 
 
 
 
::* Targeted Marketing
 
 
 
 
 
 
 
* '''Dimensionally reduction'''
 
 
 
 
 
:: Dimensionally reduction methods are used for:
 
 
 
::* Big Data Visualisation
 
::* Meaningful compression
 
::* Structure Discovery
 
 
 
 
 
 
 
* '''Association Rules'''
 
 
 
</blockquote>
 
 
 
 
 
<br />
 
 
 
===Reinforcement Learning===
 
 
 
 
 
<br />
 
 
 
==Some real-world examples of big data analysis==
 
 
 
* '''Credit card real-time data:'''
 
: Credit card companies collect and store the real-time data of when and where the credit cards are being swiped. This data helps them in fraud detection. Suppose a credit card is used at location A for the first time. Then after 2 hours the same card is being used at location B which is 5000 kilometers from location A. Now it is practically impossible for a person to travel 5000 kilometers in 2 hours, and hence it becomes clear that someone is trying to fool the system. https://www.youtube.com/watch?v=dK4aGzeBPkk
 
 
 
 
 
<br />
 
 
 
==Statistic==
 
 
 
* Probability vs Likelihood: https://www.youtube.com/watch?v=pYxNSUDSFH4
 
 
 
 
 
<br />
 
==Descriptive Data Analysis==
 
Rather than find hidden information in the data, descriptive data analysis looks to summarize the dataset.
 
 
 
* Some of the measures commonly included in descriptive data analysis:
 
 
 
:* '''Central tendency''': Mean, Mode, Media
 
:* '''Variability''' (Measures of variation): Range, Quartile, Standard deviation, Z-Score
 
:* '''Shape of distribution''': Probabilistic distribution plot, Histogram, Skewness, Kurtosis
 
 
 
 
 
<br />
 
===Central tendency===
 
https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php
 
 
 
A central tendency (or measure of central tendency) is a single value that attempts to describe a variable by identifying the central position within that data (the most typical value in the data set).
 
 
 
'''The mean''' (often called the average) is the most popular measure of the central tendency, but there are others, such as the '''median''' and the '''mode'''.
 
 
 
'''The mean, median, and mode''' are all valid measures of central tendency, but under different conditions, some measures of central tendency are more appropriate to use than others.
 
 
 
 
 
[[File:Visualisation_mode_median_mean.png|right|thumb|300pt|Geometric visualisation of the mode, median and mean of an arbitrary probability density function Taken from https://en.wikipedia.org/wiki/Probability_density_function <br /> [[:File:Visualisation_mode_median_mean.svg]] ]]
 
 
 
 
 
<br />
 
====Mean====
 
The mean (or average) is the most popular measure of central tendency.
 
 
 
 
 
The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.
 
 
 
 
 
The mean is usually denotated as <math>\mu</math> (population mean) or  <math>\bar{x}</math> (pronounced x bar) (sample mean):
 
 
 
<math>\bar{x} = \frac{(x_1 + x_2 +...+ x_n)}{n} = \frac{\sum x}{n}</math>
 
 
 
 
 
An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.
 
 
 
 
 
<br />
 
=====When not to use the mean=====
 
'''When the data has values that are unusual (too small or too big) compared to the rest of the data set (outliers) the mean is usually not a good measure of the central tendency.'''
 
 
 
 
 
For example, consider the wages of the employees in a factory:
 
 
 
{| class="wikitable"
 
!Staff
 
!1
 
!2
 
!3
 
!4
 
!5
 
!6
 
!7
 
!8
 
!9
 
!10
 
|-
 
!'''Salary'''
 
!<math>15k</math>
 
!<math>18k</math>
 
!<math>16k</math>
 
!<math>14k</math>
 
!<math>15k</math>
 
!<math>15k</math>
 
!<math>12k</math>
 
!<math>17k</math>
 
!<math>90k</math>
 
!<math>95k</math>
 
|}
 
 
 
 
 
The mean salary for these ten employees is $30.7k. However, inspecting the data we can see that this mean value might not be the best way to accurately reflect the typical salary of an employee, as most workers have salaries in a range between $12k to 18k. The mean is being '''skewed''' by the two large salaries. As we will find out later, taking the median would be a better measure of central tendency in this situation.
 
 
 
 
 
<br />
 
'''Another case when we usually prefer the median over the mean (or mode) is when our data is skewed (i.e., the frequency distribution for our data is skewed).'''
 
 
 
'''If we consider the normal distribution - as this is the most frequently assessed in statistics - when the data is perfectly normal, the mean, median, and mode are identical'''. Moreover, they all represent the most typical value in the data set. '''However, as the data becomes skewed the mean loses its ability to provide the best central location for the data'''. Therefore, in the case of skewed data, the median is typically the best measure of the central tendency because it is not as strongly influenced by the skewed values.
 
 
 
 
 
<br />
 
 
 
====Median====
 
The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below:
 
 
 
{| class="wikitable"
 
!65
 
!55
 
!89
 
!56
 
!35
 
!14
 
!56
 
!55
 
!87
 
!45
 
!92
 
|}
 
 
 
 
 
We first need to rearrange that data in order of magnitude:
 
 
 
{| class="wikitable"
 
!14
 
!35
 
!45
 
!55
 
!55
 
!<span style="color:#FF0000">56</span>
 
!56
 
!65
 
!87
 
!89
 
!92
 
|}
 
 
 
 
 
Then, the Median is the middle score. In this case, 56. This works fine when you have an odd number of scores, but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result. So, if we look at the example below:
 
 
 
{| class="wikitable"
 
!65
 
!55
 
!89
 
!56
 
!35
 
!14
 
!56
 
!55
 
!87
 
!45
 
|}
 
 
 
{| class="wikitable"
 
!14
 
!35
 
!45
 
!55
 
!<span style="color:#FF0000">55</span>
 
!<span style="color:#FF0000">56</span>
 
!56
 
!65
 
!87
 
!89
 
|}
 
 
 
We can now take the 5th and 6th scores and calculate the mean. So the Median would be 55.5.
 
 
 
 
 
<br />
 
 
 
====Mode====
 
The mode is the most frequent score in our data set.
 
 
 
On a histogram, it represents the highest bar. For continuous variables, we usually define a bin size, so every bar in the histogram represent a range of values depending on the bin size
 
 
 
 
 
<br />
 
[[File:Mode-1.png|center|thumb|359x359px]]
 
 
 
 
 
Normally, the mode is used for categorical data where we wish to know which is the most common category, as illustrated below:
 
[[File:Mode-1a.png|center|thumb|380x380px]]
 
 
 
 
 
We can see above that the most common form of transport, in this particular data set, is the bus. However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below:
 
<br />
 
[[File:Mode-2.png|center|thumb|379x379px]]
 
 
 
 
 
We are now stuck as to which mode best describes the central tendency of the data. This is particularly problematic when we have continuous data because we are more likely not to have anyone value that is more frequent than the other. For example, consider measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is it that we will find two or more people with '''exactly''' the same weight (e.g., 67.4 kg)? The answer, is probably very unlikely - many people might be close, but with such a small sample (30 people) and a large range of possible weights, you are unlikely to find two people with exactly the same weight; that is, to the nearest 0.1 kg. This is why the mode is very rarely used with continuous data.
 
 
 
 
 
Another problem with the mode is that it will not provide us with a very good measure of central tendency when the most common mark is far away from the rest of the data in the data set, as depicted in the diagram below:
 
<br />
 
[[File:Mode-3.png|center|thumb|379px]]
 
 
 
 
 
In the above diagram the mode has a value of 2. We can clearly see, however, that the mode is not representative of the data, which is mostly concentrated around the 20 to 30 value range. To use the mode to describe the central tendency of this data set would be misleading.
 
 
 
 
 
<br />
 
 
 
====Skewed Distributions and the Mean and Median====
 
We often test whether our data is normally distributed because this is a common assumption underlying many statistical tests. An example of a normally distributed set of data is presented below:
 
<br />
 
[[File:Skewed-1.png|center|thumb|379px]]
 
 
 
When you have a normally distributed sample you can legitimately use both the mean or the median as your measure of central tendency. In fact, in any symmetrical distribution the mean, median and mode are equal. However, in this situation, the mean is widely preferred as the best measure of central tendency because it is the measure that includes all the values in the data set for its calculation, and any change in any of the scores will affect the value of the mean. This is not the case with the median or mode.
 
 
 
However, when our data is skewed, for example, as with the right-skewed data set below:
 
<br />
 
[[File:Skewed-2.png|center|thumb|379px]]
 
 
 
we find that the mean is being dragged in the direct of the skew. In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean. A classic example of the above right-skewed distribution is income (salary), where higher-earners provide a false representation of the typical income if expressed as a mean and not a median.
 
 
 
If dealing with a normal distribution, and tests of normality show that the data is non-normal, it is customary to use the median instead of the mean. However, this is more a rule of thumb than a strict guideline. Sometimes, researchers wish to report the mean of a skewed distribution if the median and mean are not appreciably different (a subjective assessment), and if it allows easier comparisons to previous research to be made.
 
 
 
 
 
<br />
 
====Summary of when to use the mean, median and mode====
 
Please use the following summary table to know what the best measure of central tendency is with respect to the different types of variable:
 
{| class="wikitable"
 
!'''Type of Variable'''
 
!'''Best measure of central tendency'''
 
|-
 
|Nominal
 
|Mode
 
|-
 
|Ordinal
 
|Median
 
|-
 
|Interval/Ratio (not skewed)
 
|Mean
 
|-
 
|Interval/Ratio (skewed)
 
|Median
 
|}
 
 
 
For answers to frequently asked questions about measures of central tendency, please go to: https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median-faqs.php
 
 
 
 
 
<br />
 
 
 
===Measures of Variation===
 
<span style="background: yellow">The Variation or Variability is a measure of the '''spread of the data (of a variable)''' or a measure of '''how widely distributed are the values around the mean | the deviation of a variable from its mean'''.</span>
 
 
 
 
 
<br />
 
====Range====
 
The range is just composed of the min and max values of a variable.
 
 
 
 
 
Range can be used on '''''Ordinal, Ratio and Interval''''' scales
 
 
 
 
 
<br />
 
====Quartile====
 
https://statistics.laerd.com/statistical-guides/measures-of-spread-range-quartiles.php
 
 
 
 
 
The Quartile is a measure of the spread of a data set. To calculate the Quartile we follow the same logic of the Median. Remember that when calculating the Median, we first sort the data from the lowest to the highest value, so the Median is the value in the middle of the sorted data. In the case of the Quartile, we also sort the data from the lowest to the highest value but we break the data set into quarters, and we take 3 values to describe the data. The value corresponding to the 25% of the data, the one corresponding to the 50% (which is the Median), and the one corresponding to the 75% of the data.
 
 
 
 
 
A first example:
 
[2  3  9  1  9  3    5  2  5  11  3]
 
 
 
Sorting the data from the lowest to the highest value:
 
        25%        50%          75%
 
[1  2  "2"  3  3  "3"  5  5  "9"  9  11]
 
 
 
The Quartile is '''[2  3  9]'''
 
 
 
 
 
<br />
 
Another example. Consider the marks of 100 students who have been ordered from the lowest to the highest scores.
 
 
 
 
 
*'''The first quartile (Q1):''' Lies between the 25th and 26th student's marks.
 
**So, if the 25th and 26th student's marks are 45 and 45, respectively:
 
***(Q1) = (45 + 45) ÷ 2 = 45
 
*'''The second quartile (Q2):''' Lies between the 50th and 51st student's marks.
 
**If the 50th and 51st student's marks are 58 and 59, respectively:
 
***(Q2) = (58 + 59) ÷ 2 = 58.5
 
*'''The third quartile (Q3):''' Lies between the 75th and 76th student's marks.
 
**If the 75th and 76th student's marks are 71 and 71, respectively:
 
***(Q3) = (71 + 71) ÷ 2 = 71
 
 
 
 
 
In the above example, we have an even number of scores (100 students, rather than an odd number, such as 99 students). This means that when we calculate the quartiles, we take the sum of the two scores around each quartile and then half them (hence Q1= (45 + 45) ÷ 2 = 45) . However, if we had an odd number of scores (say, 99 students), we would only need to take one score for each quartile (that is, the 25th, 50th and 75th scores). You should recognize that the second quartile is also the median.
 
 
 
 
 
Quartiles are a useful measure of spread because they are much less affected by outliers or a skewed data set than the equivalent measures of mean and standard deviation. For this reason, quartiles are often reported along with the median as the best choice of measure of spread and central tendency, respectively, when dealing with skewed and/or data with outliers. A common way of expressing quartiles is as an interquartile range. The interquartile range describes the difference between the third quartile (Q3) and the first quartile (Q1), telling us about the range of the middle half of the scores in the distribution. Hence, for our 100 students:
 
 
 
 
 
<math>Interquartile\ range = Q3 - Q1 = 71 - 45 = 26</math>
 
 
 
 
 
However, it should be noted that in journals and other publications you will usually see the interquartile range reported as 45 to 71, rather than the calculated '''<math>Interquartile\ range.</math>'''
 
 
 
A slight variation on this is the <math>semi{\text{-}}interquartile range,</math>which is half the <math>Interquartile\ range. </math>Hence, for our 100 students:
 
 
 
 
 
<math>Semi{\text{-}}Interquartile\ range = \frac{Q3 - Q1}{2} = \frac{71 - 45}{2} = 13</math>
 
 
 
 
 
<br />
 
 
 
====Box Plots====
 
 
 
boxplot(iris$Sepal.Length,
 
        col = "blue",
 
        main="iris dataset",
 
        ylab = "Sepal Length")
 
 
 
 
 
<br />
 
====Variance====
 
https://statistics.laerd.com/statistical-guides/measures-of-spread-absolute-deviation-variance.php
 
 
 
The variance is a measure of the deviation of a variable from the mean.
 
 
 
 
 
The deviation of a value is:
 
 
 
<math>Var = x_{i} - \mu</math>
 
 
 
 
 
<math>Var = \sigma^{2} = \frac{\sum_{i=1}^{N}(x_{i} - \mu)^2}{N}</math>
 
 
 
<math>\mu: \text{Population mean};\ \ \ x: \text{Score};\ \ \ N: \text{Number of scores in the population}</math>
 
 
 
<br />
 
* Unlike the '''Absolute deviation''', which uses the absolute value of the deviation in order to "rid itself" of the negative values, the variance achieves positive values by squaring the deviation of each value.
 
 
 
 
 
<br />
 
 
 
====Standard Deviation====
 
https://statistics.laerd.com/statistical-guides/measures-of-spread-standard-deviation.php
 
 
 
The Standard Deviation is the square root of the variance. This measure is the most widely used to express deviation from the mean in a variable.
 
 
 
 
 
<br />
 
: '''Population standard deviation''' (<math>\sigma</math>)
 
<blockquote>
 
<math>\sigma = \sqrt{\frac{\sum_{i=1}^{N}(x_{i} - \mu)^2}{N}}</math>
 
 
 
<math>\mu: \text{population mean};\ \ \ N: \text{Number of scores in the population}</math>
 
</blockquote>
 
 
 
 
 
<br />
 
: '''Sample standard deviation formula'''  (<math>s</math>)
 
<blockquote>
 
Sometimes our data is only a sample of the whole population. In this case, we can still estimate the Standard deviation; but when we use a sample as an estimate of the whole population, the Standard deviation formula changes to this:
 
 
 
<math>s = \sqrt{\frac{\sum_{i=1}^{n}(x_{i} - \bar{x})^2}{n -1}}</math>
 
 
 
 
 
<math>\bar{x}: \text{Sample mean};\ \ \ n: \text{Number of scores in the sample}</math>
 
</blockquote>
 
 
 
See Bessel's correction: https://en.wikipedia.org/wiki/Bessel%27s_correction
 
 
 
 
 
<br />
 
 
 
* A value of zero means that there is no variability; All the numbers in the data set are the same.
 
 
 
* A higher standard deviation indicates more widely distributed values around the mean.
 
 
 
* Assuming the frequency distributions approximately normal, about <math>68%</math> of all observations are within <math>-1</math> and <math> +1 </math> standard deviation from the mean.
 
 
 
<br />
 
 
 
==== Z Score ====
 
Z-Score represents how far from the mean a particular value is based on the number of standard deviations. In other words, a z-score tells us how many standard deviations away a value is from the mean.
 
 
 
Z-Scores are also known as standardized residuals.
 
 
 
Note: mean and standard deviation are sensitive to outliers.
 
 
 
 
 
We use the following formula to calculate a z-score. https://www.statology.org/z-score-python/
 
 
 
<math>
 
z = (x - \mu)/\sigma
 
</math>
 
 
 
* <math>x</math>      is a single raw data value
 
* <math>\mu</math>    is the population mean
 
* <math>\sigma</math> is the population standard deviation
 
 
 
 
 
<br />
 
In Python:
 
 
 
scipy.stats.zscore(a, axis=0, ddof=0, nan_policy='propagate') https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html
 
 
 
Compute the z score of each value in the sample, relative to the sample mean and standard deviation.
 
 
 
<syntaxhighlight lang="python3">
 
a = np.array([ 0.7972,  0.0767,  0.4383,  0.7866,  0.8091,
 
              0.1954,  0.6307,  0.6599,  0.1065,  0.0508])
 
from scipy import stats
 
stats.zscore(a)
 
 
 
Output:
 
array([ 1.12724554, -1.2469956 , -0.05542642,  1.09231569,  1.16645923,
 
      -0.8558472 ,  0.57858329,  0.67480514, -1.14879659, -1.33234306])
 
</syntaxhighlight>
 
 
 
 
 
<br />
 
 
 
===Shape of Distribution===
 
The shape of the distribution of a variable is visualized by building a Probability distribution plot or a histogram. There are also some numerical measures (like the Skewness and the Kurtosis) that provide ways of describing, by a simple value, some features of the shape the distribution of a variable. [Adelo]
 
 
 
 
 
<br />
 
====Probability distribution====
 
No estoy seguro de cuales son los terminos correctos de este tipo de gráficos (Density - Distribution plots)
 
 
 
https://en.wikipedia.org/wiki/Probability_distribution#Continuous_probability_distribution
 
 
 
https://en.wikipedia.org/wiki/Probability_density_function
 
 
 
https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
 
 
 
 
 
<br />
 
=====The Normal Distribution=====
 
https://www.youtube.com/watch?v=rzFX5NWojp0
 
 
 
https://en.wikipedia.org/wiki/Normal_distribution
 
 
 
 
 
<br />
 
====Histograms====
 
.
 
 
 
 
 
<br />
 
====Skewness====
 
https://en.wikipedia.org/wiki/Skewness
 
 
 
https://www.investopedia.com/terms/s/skewness.asp
 
 
 
https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
 
 
 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html
 
 
 
 
 
Skewness is a method for quantifying the lack of symmetry in the probability distribution of a variable.
 
 
 
* <span style="background:#E6E6FA">'''Skewness = 0</span> : Normally distributed'''.
 
 
 
* <span style="background:#E6E6FA">'''Skewness < 0</span> : Negative skew: The left tail is longer.''' The mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-skewed, left-tailed, or skewed to the left, despite the fact that the curve itself appears to be skewed or leaning to the right; left instead refers to the left tail being drawn out and, often, the mean being skewed to the left of a typical center of the data. A left-skewed distribution usually appears as a right-leaning curve. https://en.wikipedia.org/wiki/Skewness
 
 
 
* <span style="background:#E6E6FA">'''Skewness > 0</span> : Positive skew : The right tail is longer.''' the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right, despite the fact that the curve itself appears to be skewed or leaning to the left; right instead refers to the right tail being drawn out and, often, the mean being skewed to the right of a typical center of the data. A right-skewed distribution usually appears as a left-leaning curve.
 
 
 
 
 
[[File:Skewness.png|400px|thumb|center|]]
 
 
 
 
 
[[File:Relationship_between_mean_and_median_under_different_skewness.png|600px|thumb|center|Taken from https://en.wikipedia.org/wiki/Skewness]]
 
 
 
 
 
<br />
 
 
 
====Kurtosis====
 
https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/kurtosis-leptokurtic-platykurtic/
 
 
 
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
 
 
 
https://en.wikipedia.org/wiki/Kurtosis
 
 
 
https://www.simplypsychology.org/kurtosis.html
 
 
 
 
 
The kurtosis is a measure of the "tailedness" of the probability distribution. https://en.wikipedia.org/wiki/Kurtosis
 
 
 
We can say that Kurtosis is a measure of the concentration of values on the tail of the distribution. Which of course gives you an idea of the concentration of values on the peak of the distribution; but it is important to know that the measure provided by the kurtosis is related to the tail. [Adelo]
 
 
 
 
 
* '''The kurtosis of any univariate normal distribution is 3.''' A univariate normal distribution is usually called just normal distribution.
 
 
 
* '''Platykurtic:''' Kurtosis less than 3 (Negative Kurtosis if we talk about the adjusted version of Pearson's kurtosis, the Excess kurtosis).
 
:* A negative value means that the distribution has a light tail compared to the normal distribution (which means that there is little data in the tail).
 
:* An example of a platykurtic distribution is the uniform distribution, which does not produce outliers.
 
 
 
* '''Leptokurtic:''' Kurtosis greater than 3 (Positive Excess kurtosis).
 
:* A positive Kurtosis tells that the distribution has a heavy tail (outlier), which means that there is a lot of data in the tail.
 
:* An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero more slowly than a Gaussian and therefore produce more outliers than the normal distribution.
 
 
 
* This heaviness or lightness in the tails usually means that your data looks flatter (or less flat) compared to the normal distribution.
 
 
 
* It is also common practice to use the adjusted version of Pearson's kurtosis, the excess kurtosis, which is the kurtosis minus 3, to provide the comparison to the standard normal distribution. Some authors use "kurtosis" by itself to refer to the excess kurtosis. https://en.wikipedia.org/wiki/Kurtosis
 
 
 
* It must be noted that the Kurtosis is related to the tails of the distribution, not its peak; hence, the sometimes-seen characterization of kurtosis as "peakedness" is incorrect.  https://en.wikipedia.org/wiki/Kurtosis
 
 
 
 
 
<br />
 
In Python: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosis.html
 
 
 
scipy.stats.kurtosis(a, axis=0, fisher=True, bias=True, nan_policy='propagate')
 
: Compute the kurtosis (Fisher or Pearson) of a dataset.
 
 
 
 
 
<syntaxhighlight lang="python3">
 
import numpy as np
 
from scipy.stats import kurtosis
 
 
 
data = norm.rvs(size=1000, random_state=3)
 
data2 = np.random.randn(1000)
 
 
 
kurtosis(data2)
 
</syntaxhighlight>
 
 
 
 
 
<syntaxhighlight lang="python3">
 
from scipy.stats import kurtosis
 
import matplotlib.pyplot as plt
 
import scipy.stats as stats
 
 
 
x = np.linspace(-5, 5, 100)
 
ax = plt.subplot()
 
distnames = ['laplace', 'norm', 'uniform']
 
 
 
for distname in distnames:
 
    if distname == 'uniform':
 
        dist = getattr(stats, distname)(loc=-2, scale=4)
 
    else:
 
        dist = getattr(stats, distname)
 
    data = dist.rvs(size=1000)
 
    kur = kurtosis(data, fisher=True)
 
    y = dist.pdf(x)
 
    ax.plot(x, y, label="{}, {}".format(distname, round(kur, 3)))
 
    ax.legend()
 
</syntaxhighlight>
 
 
 
 
 
[[File:Kurtosis.png|400px|thumb|center|Recreated from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosis.html]]
 
 
 
 
 
<br />
 
 
 
====Visualization of measure of variations on a Normal distribution====
 
https://en.wikipedia.org/wiki/Probability_density_function
 
 
 
https://en.wikipedia.org/wiki/Normal_distribution
 
 
 
 
 
[[File:Boxplot_vs_PDF.svg|600px|thumb|center|Visualization of measure of variations on a Normal distribution. Each band has a width of 1 standard deviation]]
 
 
 
 
 
<br />
 
 
 
==Simple and Multiple regression==
 
 
 
* 17/06: Recorded class - Correlation & Regration
 
:* https://drive.google.com/drive/folders/1TW494XF-laGGJiLFApz8bJfMstG4Yqk_
 
 
 
 
 
<br />
 
===Correlation===
 
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. https://en.wikipedia.org/wiki/Correlation_and_dependence
 
 
 
 
 
When moderate to strong correlations are found, we can use this to create a regression model to make predictions about one of the variables given that the other variable is known.
 
 
 
The following are examples of correlations:
 
* There is a correlation between ice cream sales and temperature.
 
* Blood alcohol level and the odds of being involved in a traffic accident
 
* Phytoplankton population at a given latitude and surface sea temperature
 
 
 
 
 
[[File:Correlation1.png|800px|thumb|center|]]
 
 
 
 
 
<br />
 
====Measuring Correlation====
 
 
 
 
 
<br />
 
=====Pearson correlation coefficient - Pearson s r=====
 
The Pearson correlation coefficient (PCC), also referred to as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC),
 
 
 
Karl Pearson (1857-1936)
 
 
 
 
 
The Pearson correlation coefficient is a measure of the degree and direction of a '''linear''' correlation between two variables.
 
 
 
 
 
<math>
 
r = \frac{\sum_{i=1}^{n}((x_i - \bar{x})(y_i - \bar{y}))}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2\sum_{i=1}^n(y_i - \bar{y})^2}}
 
</math>
 
 
 
 
 
* Where <math>\bar{x}</math> and <math>\bar{y}</math> are the means of the <math>x</math> (independent) and <math>y</math> (dependent) variables, respectively, and <math>x_i</math> and <math>y_i</math> are the individual observations for each variable.
 
 
 
* Values of Pearson's r range between -1 and +1.
 
 
 
 
 
* '''The direction of the correlation:'''
 
:* Values greater than zero indicate a positive correlation, with 1 being a perfect positive correlation.
 
:* Values less than zero indicate a negative correlation, with -1 being a perfect negative correlation.
 
 
 
 
 
* '''The degree of the correlation:'''
 
<blockquote>
 
{| class="wikitable"
 
|+
 
!Degree of correlation
 
!Interpretation
 
|-
 
|0.8  to  1.0
 
|Very strong
 
|-
 
|0.6  to  0.8
 
|Strong
 
|-
 
|0.4  to 0.6
 
|Moderate
 
|-
 
|0.2  to 0.4
 
|Weak
 
|-
 
|0  to  0.2
 
|Very weak or non-existent
 
|}
 
</blockquote>
 
 
 
 
 
<br />
 
 
 
=====The coefficient of determination <math>R^2</math>=====
 
[Noel]  https://en.wikipedia.org/wiki/Coefficient_of_determination  https://en.wikipedia.org/wiki/Total_sum_of_squares  https://en.wikipedia.org/wiki/Residual_sum_of_squares
 
 
 
 
 
<math>R^2</math> (R squared) is a measure of how well the regression predictions approximate the actual data values. An <math>R^2</math> of 1 means that predicted values perfectly fit the actual data.
 
 
 
 
 
<math>R^2</math> is termed the '''coefficient of determination''' because it measures the proportion of variance in the dependent variable that is determined by its relationship with the independent variables. This is calculated from two values: [Noel]
 
 
 
 
 
* '''The total sum of squares:'''    <math> SS_{tot} = TSS = \sum_{i=1}^n (y_i - \bar{y}_i)^2 </math>
 
: This is the sum of the squared differences between the actual <math>y</math> values and their mean.
 
: Proportional to the '''variance''' of the data.
 
 
 
 
 
* '''The residual sum of squares:''' <math> SS_{res} = RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 </math>  =  <math>\sum_{i=1}^n (y_i - f(x_i))^2 </math>
 
: This is the sum of the squared differences between the predicted <math>y</math> values (<math>\hat{y}_i</math>) and their respective actual values.
 
 
 
 
 
* '''The coefficient of determination:'''
 
: <math>
 
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
 
</math>
 
 
 
 
 
<br />
 
 
 
====Correlation <math>\neq</math> Causation====
 
 
 
Even if you find the strongest of correlations, you should never interpret it as more than just that... a correlation.
 
 
 
 
 
Causation indicates a relationship between two events where one event is affected by the other. In statistics, when the value of one variable, increases or decreases as a result of the value of another variable, it is said that there is causation.
 
 
 
 
 
Let's say you have a job and get paid a certain rate per hour. The more hours you work, the more income you will earn, right? This means there is a relationship between the two events and also that a change in one event (hours worked) causes a change in the other (income). This is causation! https://study.com/academy/lesson/causation-in-statistics-definition-examples.html
 
 
 
 
 
Given any two correlated events A and B, the following options are possible:
 
* A causes B
 
* B causes A
 
* A and B are both the product of a common underlying event, but do not cause each other
 
* Any relationship between A and B is simply the result of coincidence (pure chance)
 
 
 
 
 
<br />
 
'''Some examples: Causality or coincidence?'''
 
 
 
<div style="text-align: center;">
 
<pdf width="2000" height="600">File:Correlation_examples-Causality_vs_coincidence.pdf</pdf>
 
[[File:Correlation_examples-Causality_vs_coincidence.pdf]]
 
</div>
 
 
 
 
 
<br />
 
 
 
====Testing the "generalizability" of the correlation ====
 
See this source to try to understand this section: https://online.stat.psu.edu/stat501/lesson/1/1.9
 
 
 
 
 
Having determined the value of the correlation coefficient ('''r''') for a pair of variables, you should next determine the '''likelihood''' that the value of '''r''' occurred purely by chance. In other words, what is the likelihood that the relationship in your sample reflects a real relationship in the population.
 
 
 
Before carrying out any test, the alpha (<math>\alpha</math>) level should be set. This is a measure of how willing we are to be wrong when we say that there is a relationship between two variables. A commonly-used <math>\alpha</math> level in research is 0.05.
 
 
 
An <math>\alpha</math> level to 0.05 means that you could possibly be wrong up to 5 times out of 100 when you state that there is a relationship in the population based on a correlation found in the sample.
 
 
 
In order to test whether the correlation in the sample can be generalized to the population, we must first identify the null hypothesis <math>H_0</math> and the alternative hypothesis <math>H_A</math>.
 
 
 
This is a test against the population correlation co-efficient (<math>\rho</math>), so these hypotheses are:
 
 
 
 
 
* <math> H_0 : \rho = 0 </math> - There is no correlation in the population
 
* <math> H_0 : \rho \neq 0 </math> - There is correlation
 
 
 
 
 
Next, we calculate the value of the test statistic using the following equation:
 
 
 
 
 
<math>
 
t^* = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}
 
</math>
 
 
 
 
 
So for a correlation coefficient <math>r</math> value of -0.8, an <math>r^2</math> value of 0.9 and a sample size of 102, this would be:
 
 
 
 
 
<math>
 
t^* = \frac{0.8\sqrt{100}}{\sqrt{0.1}} = \frac{8}{0.3162278} = 25.29822
 
</math>
 
 
 
 
 
Checking the t-tables for an <math>\alpha</math> level of 0.005 and a two-tailed test (because we are testing if <math>\rho</math> is less than or greater than 0) we get a critical value of 2.056. As the value of the test statistic (25.29822) is greater than the critical value, we can reject the null hypothesis and conclude that there is likely to be a correlation in the population.
 
 
 
 
 
<br />
 
 
 
===Simple Linear Regression===
 
https://www.youtube.com/watch?v=nk2CQITm_eo&t=267s
 
 
 
 
 
In general, there are 3 main concepts in Linear regression:
 
 
 
: '''1'''. Using '''Least-squares''' to fit a line to the data
 
 
 
: '''2'''. Calculating <math>R^2</math>
 
 
 
: '''3'''. Calculating a <math>p-value</math> for <math>R^2</math>
 
 
 
 
 
<br />
 
: '''1. Using Least-squares to fit a line to the data'''
 
<blockquote>
 
[[File:Linear_regression1.png|400px|thumb|right|Takinf from https://www.youtube.com/watch?v=nk2CQITm_eo&t=267s]]
 
<!-- [[File:SimpleLinearRegression2.png|600px|center|]] -->
 
 
 
: When we use least-squares to fit a line to the data, what we do is the following:
 
 
 
:* First, we define a line through the data.
 
 
 
:* Then, we calculate the '''Residual sum of squares''' for that line. To do so, we measure the distance from each data point to the fit line (residual), square each distance, and then add them up.
 
::: The distance from a line to a data point is called a '''residual'''
 
 
 
:* Then, we rotate the line a little bit and calculate again the RSS for the new line.
 
 
 
:* The algorithm does the same many times, so it tests many different lines.
 
 
 
:* ...
 
 
 
:* Then, the line that closely fits the data (the line of best fit) is the one corresponding to the rotation that has the least RSS.
 
 
 
 
 
:* '''The linear regression equation is:'''
 
 
 
:: <math> y = a + bx </math>
 
 
 
:: The equation is composed of 2 parameters:
 
 
 
::* Slope: <math> b </math>
 
:::  The slope is the amount of change in units of <math>y</math> for each unit change in <math>x</math>.
 
 
 
::* The <math>y-axis</math> intercept: <math> a </math>
 
</blockquote>
 
 
 
 
 
<br />
 
: '''2. Calculating <math>R^2</math>'''
 
<blockquote>
 
 
 
In the following example, they are using different terminology to the one that we saw in Section [[Data_Science#The_coefficient_of_determination_R.5E2]]
 
 
 
It is very important to note how the result of <math>R^2</math> is interpretated. In our example <span style='color: red'>''' There is a 60% reduction in variance when we take the mouse weight into account '''</span> or <span style='color: red'>''' Mouse weight "explains" 60% of the variation in mouse size. '''</span>
 
 
 
[[File:Linear_regression2.png|600px|thumb|center|Takinf from https://www.youtube.com/watch?v=nk2CQITm_eo&t=267s]]
 
 
 
 
 
[[File:Linear_regression3.png|600px|thumb|center|Takinf from https://www.youtube.com/watch?v=nk2CQITm_eo&t=267s]]
 
 
 
</blockquote>
 
 
 
 
 
<br />
 
: '''3. Calculating a <math>p-value</math> for <math>R^2</math>'''
 
<blockquote>
 
We need a way to determine if the <math>R^2</math> value is statistically significant. So, we need a <math>p-value</math>. En otras palabras, we need to test the "generalizability" of the correlation.
 
(no estoy completamente seguro pero creo que lo que Noel explicó como "generalizability" es lo mismo que se explica en este Statquest cuando se refiere a "determine if the <math>R^2</math> value is statistically significant».
 
 
 
</blockquote>
 
 
 
 
 
<br />
 
 
 
===Multiple Linear Regression===
 
With Simple Linear Regression, we saw that we could use a single independent variable (x) to predict one dependent variable (y). Multiple Linear Regression is a development of Simple Linear Regression predicated on the assumption that if one variable can be used to predict another with a reasonable degree of accuracy then using two or more variables should improve the accuracy of the prediction.
 
 
 
 
 
<br />
 
'''Uses for Multiple Linear Regression:'''
 
 
 
When implementing Multiple Linear Regression, variables added to the model should make a unique contribution towards explaining the dependent variable. In other words, the multiple independent variables in the model should be able to predict the dependent variable better than any one of the variables would do ina Simple Linear Regression model.
 
 
 
 
 
<br />
 
'''The Multiple Linear Regression Model:'''
 
 
 
<math>
 
y = a + b_1 x_1 + b_2 x_2 ... + b_n x_n
 
</math>
 
 
 
 
 
<br />
 
'''Multicolinearity:'''
 
 
 
Before adding variables to the model, it is necessary to check for correlation between the independent variables themselves. The greater degree of correlation between two independent variables, the more information they hold in common about the dependent variable. This is known as '''multicolinearity'''.
 
 
 
 
 
Because it is difficult to properly apportion the information each independent variable carries about the dependent variable, including highly correlated independent variables in the model can result in unstable estimates for the coefficients. Unstable coefficient estimates result in unrepeatable studies.
 
 
 
 
 
<br />
 
'''Adjusted <math>R^2</math>:'''
 
 
 
Recall that the coefficient of determination is a measure of how well our model as a whole explains the values of the dependent variable. Because models with larger numbers of independent variables will inevitably explain more variation in the dependent variable, the adjusted <math>R^2</math> value penalises models with a large number of independent variables. As such, adjusted <math>R^2</math> can be used to compare the performance of models with different numbers of independent variables.
 
 
 
 
 
<br />
 
 
 
===RapidMiner Linear Regression examples===
 
<br />
 
* '''Example 1:'''
 
 
 
::* In the parameters for the split Dataoperator, click on theEditEnumerationsbutton and enter two rows in the dialog box that opens. The first value should be 0.7 and the second should be 0.3. You can, of course, choose other values for the train and test split, provided that they sum to 1.
 
 
 
::* If you want the regression to be reproducible, check the «Use Local Random Seedbox» and enter a seed value of your choosing in the local random seedbox.
 
 
 
 
 
::* Linear Regression operator:
 
:::* Set feature selection to none.
 
:::* If you are doing multiple linear regression, check the eliminate collinear features box.
 
:::* If you want to have a Y-intercept calculated, check the use bias box.
 
:::* Set the ridge parameter to 0.
 
 
 
 
 
::* After running the model, clicking on the linear Regression tab in the results, will show you the coefficient values, t statistic and <math>\rho-value</math>.
 
::* Note that if the <math>\rho-value</math> is less than your chosen <math>\alpha-value</math>, you can also reject the null hypothesis.
 
 
 
 
 
[[File:RapidMiner_Linear_regression-examples1.png|900px|thumb|center|Screencast at [[File:RapidMiner Linear_regression-examples1.mp4]]<br /> Data: [[File:Cost_of_Heating.zip]] ]]
 
 
 
<!-- <gallery mode=packed-overlay>
 
File:RapidMiner_Linear_regression-examples1_fig1.png|1
 
File:RapidMiner_Linear_regression-examples1_fig2.png|2
 
File:RapidMiner_Linear_regression-examples1_fig3.png|3
 
</gallery> -->
 
 
 
[[File:RapidMiner_Linear_regression-examples1_fig1.png|750px|center|]]
 
 
 
 
 
[[File:RapidMiner_Linear_regression-examples1_fig2.png|750px|center|]]
 
 
 
 
 
[[File:RapidMiner_Linear_regression-examples1_fig3.png|750px|center|]]
 
 
 
 
 
[[File:RapidMiner_Linear_regression-examples1_fig4.png|750px|thumb|center|[[:File:RapidMiner_Linear_regression-examples1_fig4.svg]] ]]
 
 
 
 
 
<br />
 
 
 
==K-Nearest Neighbour==
 
 
 
* Recorded Noel class (15/06):
 
 
 
:* https://drive.google.com/drive/folders/1BaordCV9vw-gxLdJBMbWioX2NW7Ty9Lm
 
 
 
:* https://drive.google.com/drive/folders/1BaordCV9vw-gxLdJBMbWioX2NW7Ty9Lm
 
 
 
* StatQuest: https://www.youtube.com/watch?v=HVXime0nQeI
 
 
 
 
 
{| class="wikitable"
 
|+
 
! colspan="6" style="text-align: left; font-weight: normal" |
 
KNN classifies a new data point based on the points that are closest in distance to the new point. The principle behind KNN is to find a predefined number of training samples (''K'') closest in distance to the new data point. Then, the class of the new data point will be the most common class in the k nearest training samples. https://scikit-learn.org/stable/modules/neighbors.html [Adelo]
 
In other words, KNN determines the class of a given unlabeled observation by identifying the most common class among the k-nearest labeled observations to it.
 
 
 
This is a simple method, but extremely powerful.
 
|-
 
!style="width: 17%"|'''Regression/Classification'''
 
!style="width: 17%"|'''Applications'''
 
!style="width: 17%"|Strengths
 
!style="width: 17%"|Weaknesses
 
!style="width: 17%"|Comments
 
!style="width: 15%"|Improvements
 
|-style="vertical-align: text-top;"
 
|
 
KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. <nowiki>https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/</nowiki>
 
|
 
* Face recognition
 
* Optical character recognition
 
 
 
* Recommendation systems
 
* Pattern detection in genetic data
 
|
 
* The algorithm is simple and effective
 
* Fast training phase
 
* Capable of reflecting complex relationships
 
* Unlike many other methods, no assumptions about the distribution of the data are made
 
|
 
* Slow classification phase. Requires lots of memory
 
* The method does not produce any model which limits potential insights about the relationship between features
 
* Can not handle nominal feature or missing data without additional pre-processing
 
|
 
k-NN is ideal for classification tasks where relationships among the attributes and target classes are:
 
 
 
* numerous
 
* complex
 
* difficult to interpret and
 
* where instances of a class are fairly homogeneous
 
|
 
:* Weighting training examples based on their distance
 
:* Alternative measures of "nearness"
 
:* Finding "close" examples in a large training set quickly
 
|}
 
 
 
 
 
<br />
 
'''Basic Implementation:'''
 
 
 
* Training Algorithm:
 
:* Simply store the training examples
 
 
 
 
 
* Prediction Algorithm:
 
:# Calculate the distance from the new data point to all points in the data.
 
:# Sort the points in your data by increasing the distance from the new data point.
 
:# Determine the most frequent class among the k nearest points</math>.
 
 
 
 
 
<br />
 
<img src="https://upload.wikimedia.org/wikipedia/commons/e/e7/KnnClassification.svg" style="display: block; margin-left: auto; margin-right: auto; width: 300pt;" />
 
 
 
<div style="text-align: left; display:block; margin-right: auto; margin-left: auto; width:500pt">Example of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle). Taken from https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm</div>
 
 
 
 
 
[[File:KNearest_Neighbors_from_the_Udemy_course_Pierian_data1.mp4|800px|thumb|center|Udemy course, Pierian data https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/]]
 
 
 
 
 
<br />
 
 
 
==Decision Trees==
 
Noel's lecture (10/06): https://drive.google.com/drive/folders/1TW494XF-laGGJiLFApz8bJfMstG4Yqk_
 
 
 
StatQuest: https://www.youtube.com/watch?v=7VeUPuFGJHk  &nbsp; &nbsp;  https://www.youtube.com/watch?v=wpNl-JwwplA  &nbsp; &nbsp;  https://www.youtube.com/watch?v=g9c66TUylZ4  &nbsp; &nbsp;  https://www.youtube.com/watch?v=q90UDEgYqeI
 
 
 
 
 
[[File:Decision_Trees_terminology1.png|400px|thumb|right|]]
 
 
 
[[File:Decision_Trees_terminology2.png|400px|thumb|right|]]
 
 
 
[[File:Decision_Trees_terminology3.png|400px|thumb|right|]]
 
 
 
 
 
DT is an predictive algorithm that build models in the form of a tree structure, that is composed of a series of branching Boolean tests (tests for which the answer is true or false). The princpel is to use these boolean tests to split the data into smaller and smaller subsets to identify patterns that can be used for prediction. [Noel Cosgrave slides]
 
 
 
In a DT, each '''internal node''' represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each '''branch''' represents the outcome of the test, and each '''leaf node''' represents a class label (decision taken after computing all attributes). The paths from root to leaf are called '''decision rules/classification rules'''. https://en.wikipedia.org/wiki/Decision_tree
 
 
 
 
 
<br />
 
'''A dataset can have many possible decision trees'''
 
 
 
In practice, we want small & accurate trees
 
 
 
* What is the inductive bias in Decision Tree learning?
 
:* Shorter trees are preferred over longer trees: Smaller trees are more general, usually more accurate, and easier to understand by humans (and to communicate!)
 
 
 
:* Prefer trees that place high information gain attributes close to the root
 
 
 
 
 
* More succinct hypotheses are preferred. Why?
 
:* There are fewer succinct hypotheses than complex ones
 
:* If a succinct hypothesis explains the data this is unlikely to be a coincidence
 
:* However, not every succinct hypothesis is a reasonable one.
 
 
 
 
 
<br />
 
'''Example 1:'''
 
 
 
A decision tree model can be used to decide whether or not to provide a loan to a customer (wheather or not a custoer is likely to pay a loan)
 
 
 
{|
 
<!-- |[[File:DecisionTree-NoelCosgraveSlides_3.png|350px|thumb|center|]] -->
 
|[[File:Decision_tree_ex1.jpeg|550px|thumb|center|]]
 
| style="vertical-align: text-top" |
 
* <math>16</math> training examples
 
* The <math>x/y</math> values mean that <math>x</math> out of <math>y</math> training examples that reach this leaf node has the class of the leaf. This is the confidence
 
* The <math>x</math> value is the support count
 
* <math>x/total</math> training examples is the support
 
|}
 
 
 
 
 
<br />
 
'''Example 2:'''
 
 
 
A model for predicting the future success of a movie:
 
 
 
{|
 
|
 
{| class="wikitable"
 
!Movie
 
!Number of celebrities
 
!Budget
 
!Success (True label)
 
!Success (Predicted label)
 
|- style="background-color:#ff794d"
 
|Movie 1
 
|low
 
|hight
 
|boxoffice flop
 
|boxoffice flop
 
|- style="background-color:#ff794d"
 
|Movie 2
 
|low
 
|low
 
|boxoffice flop
 
|boxoffice flop
 
|- style="background-color:#ff794d"
 
|Movie 3
 
|low
 
|hight
 
|boxoffice flop
 
|boxoffice flop
 
|- style="background-color:#ff794d"
 
|Movie 4
 
|low
 
|low
 
|boxoffice flop
 
|boxoffice flop
 
|- style="background-color:#ff794d"
 
|Movie 5
 
|low
 
|hight
 
|boxoffice flop
 
|boxoffice flop
 
|- style="background-color:#ff794d"
 
|Movie 6
 
|low
 
|low
 
|boxoffice flop
 
|boxoffice flop
 
|- style="background-color:#ff794d"
 
|Movie 7
 
|low
 
|hight
 
|boxoffice flop
 
|boxoffice flop
 
|- style="background-color:#ff794d"
 
|Movie 8
 
|low
 
|low
 
|boxoffice flop
 
|boxoffice flop
 
|- style="background-color:#ff794d"
 
|Movie 9
 
|low
 
|hight
 
|boxoffice flop
 
|boxoffice flop
 
|- style="background-color:#ff794d"
 
|Movie 10
 
|low
 
|low
 
|boxoffice flop
 
|boxoffice flop
 
|- style="background-color:#66a3ff"
 
|Movie 11
 
|hight
 
|hight
 
|mainstream hit
 
|mainstream hit
 
|- style="background-color:#66a3ff"
 
|Movie 12
 
|hight
 
|hight
 
|mainstream hit
 
|mainstream hit
 
|- style="background-color:#66a3ff"
 
|Movie 13
 
|hight
 
|hight
 
|mainstream hit
 
|mainstream hit
 
|- style="background-color:#66a3ff"
 
|'''Movie 14'''
 
|'''low'''
 
|'''hight'''
 
|'''mainstream hit'''
 
|style="background-color:#ff794d"|'''boxoffice flop'''
 
|- style="background-color:#66a3ff"
 
|Movie 15
 
|hight
 
|hight
 
|mainstream hit
 
|mainstream hit
 
|- style="background-color:#66a3ff"
 
|Movie 16
 
|hight
 
|hight
 
|mainstream hit
 
|mainstream hit
 
|- style="background-color:#66a3ff"
 
|Movie 17
 
|hight
 
|hight
 
|mainstream hit
 
|mainstream hit
 
|- style="background-color:#66a3ff"
 
|Movie 18
 
|hight
 
|hight
 
|mainstream hit
 
|mainstream hit
 
|- style="background-color:#66a3ff"
 
|Movie 19
 
|hight
 
|hight
 
|mainstream hit
 
|mainstream hit
 
|- style="background-color:#66a3ff"
 
|Movie 20
 
|hight
 
|hight
 
|mainstream hit
 
|mainstream hit
 
|- style="background-color:#00e6b0"
 
|Movie 21
 
|hight
 
|low
 
|critical success
 
|critical success
 
|- style="background-color:#00e6b0"
 
|Movie 22
 
|hight
 
|low
 
|critical success
 
|critical success
 
|- style="background-color:#00e6b0"
 
|Movie 23
 
|hight
 
|low
 
|critical success
 
|critical success
 
|- style="background-color:#00e6b0"
 
|'''Movie 24'''
 
|'''low'''
 
|'''hight'''
 
|'''critical success'''
 
|style="background-color:#ff794d"|'''boxoffice flop'''
 
|- style="background-color:#00e6b0"
 
|Movie 25
 
|hight
 
|low
 
|critical success
 
|critical success
 
|- style="background-color:#00e6b0"
 
|'''Movie 26'''
 
|'''hight'''
 
|'''hight'''
 
|'''critical success'''
 
|style="background-color:#66a3ff"|'''mainstream hit'''
 
|- style="background-color:#00e6b0"
 
|Movie 27
 
|hight
 
|low
 
|critical success
 
|critical success
 
|- style="background-color:#00e6b0"
 
|Movie 28
 
|hight
 
|low
 
|critical success
 
|critical success
 
|- style="background-color:#00e6b0"
 
|Movie 29
 
|hight
 
|low
 
|critical success
 
|critical success
 
|- style="background-color:#00e6b0"
 
|Movie 30
 
|hight
 
|low
 
|critical success
 
|critical success
 
|}
 
|
 
[[File:DecisionTree-NoelCosgraveSlides_2.png|500px|thumb|center|]]
 
 
 
[[File:DecisionTree-NoelCosgraveSlides_1.png|500px|thumb|center|]]
 
|}
 
 
 
 
 
 
 
<br />
 
===The algorithm===
 
 
 
 
 
<br />
 
====Basic explanation of the algorithm====
 
https://www.youtube.com/watch?v=7VeUPuFGJHk
 
 
 
 
 
This is a basic but nice explanation of the algorithm.
 
 
 
 
 
In this example, we want to create a tree that uses '''chest pain''', '''good blood circulation''', and '''blocked artery status''' to predict whether or not a patient has heart disease:
 
 
 
 
 
<center>
 
<div style="width: 420pt;">{{#ev:youtube|https://www.youtube.com/watch?v=7VeUPuFGJHk|||||start=110}}
 
</div>
 
</center>
 
 
 
 
 
[[File:Decision_tree-heart_disease_example.png|600px|thumb|center|Taken from https://www.youtube.com/watch?v=7VeUPuFGJHk]]
 
 
 
 
 
'''The algorithm to build the model is based on the following steps:'''
 
 
 
* '''We first need to determine which will be the Root:'''
 
 
 
:* The attribute (Chest pain, Good blood circulation, Blocked arteries) that determines better whether a Patient has '''Heart Disease''' or not will be chosen as the Root.
 
 
 
:* To do so, we need to evaluate the three attributes by calculating what is known as '''Impurity''', which is a measure of how bad the Attribute is able to separate (determine) our label attribute (Heart Disease in our case).
 
 
 
:*There are many ways to measure Impurity, one of the popular ones is to calculate the '''Gini''' impurity.
 
 
 
 
 
:* '''So, let's calculate the Gini impurity for our «Chest pain» attribute:'''
 
 
 
::* We look at chest pain for all 303 patients in our data:
 
 
 
[[File:Decision_trees-StatQuest1.png|600px|thumb|center|Taken from https://www.youtube.com/watch?v=7VeUPuFGJHk]]
 
 
 
 
 
::* We calculate the Gini impurity for each branch (True: The patient has chest pain | False: The patient does not have chest pain):
 
 
 
:::<div style='font-size:15pt'><math>\color{blue}{
 
Gini\ Impurity = 1 - \sum_{i=1}^{N} P_{i}^{2}
 
}
 
</math>
 
</div>
 
 
 
 
 
::* For the '''True''' branch:
 
 
 
:::<math>
 
Gini\ Impurity_{True} = 1 - (Probability\ of\ Yes)^{2} + (Probability\ of\ No)^{2}
 
</math>
 
 
 
:::<math>
 
Gini\ Impurity_{True} = 1 - \biggl(\frac{105}{105 + 39}\biggl)^2 - \biggl(\frac{39}{105 + 39}\biggl)^2 = 0.395
 
</math>
 
 
 
 
 
::* For the '''False''' branch:
 
 
 
:::<math>
 
Gini\ Impurity_{False} = 0.336
 
</math>
 
 
 
 
 
::*<div style='font-size:13pt'><math>\color{blue}{
 
Total\ Gini\ Impurity = Weighted\ average\ of\ Gini\ impurities\ for\ the\ leaf\ nodes\ (branches)}
 
</math></div>
 
 
 
:::<math>
 
Gini\ Impurity_{Chest pain} = \biggl( \frac{144}{144 + 159} \biggl)0.395\ +\ \biggl( \frac{159}{144 + 159} \biggl)0.336 = 0.364
 
</math>
 
 
 
 
 
:* '''Good Blood Circulation'''
 
::<math>
 
Gini\ Impurity_{Good\ Blood\ Circulation} = 0.360
 
</math>
 
 
 
 
 
:* '''Blocked Arteries'''
 
::<math>
 
Gini\ Impurity_{Blocked\ Arteries} = 0.381
 
</math>
 
 
 
 
 
[[File:Decision_trees-StatQuest2.png|600px|thumb|center|Taken from https://www.youtube.com/watch?v=7VeUPuFGJHk]]
 
 
 
 
 
* '''Then, we follow the same method to determine the following nodes'''
 
 
 
 
 
* '''A node becomes a leaf when the Gini impurity for a node before is lower than the Gini impurity after using a new attribute'''
 
 
 
[[File:Decision_trees-StatQuest3.png|600px|thumb|center|Taken from https://www.youtube.com/watch?v=7VeUPuFGJHk]]
 
 
 
 
 
* '''At the end, our DT looks like this:'''
 
 
 
[[File:Decision_trees-StatQuest4.png|600px|thumb|center|Taken from https://www.youtube.com/watch?v=7VeUPuFGJHk]]
 
 
 
 
 
 
 
* '''For numeric data:'''
 
: Ex.: Patient weight
 
 
 
 
 
* '''Ranked data and Multiple choice data:'''
 
: Ranked data: For example, "Rank my jokes on a scale of 1 to 4".
 
: Multiple choice data: For example, "Which color do you like: red, blue or green".
 
 
 
 
 
 
 
<br /> <br />
 
Decision Trees, Part 2 - Feature Selection and Missing Data: https://www.youtube.com/watch?v=wpNl-JwwplA
 
 
 
 
 
<br />
 
Regression Trees: https://www.youtube.com/watch?v=g9c66TUylZ4
 
 
 
 
 
<br />
 
 
 
====Algorithms addressed in Noel s Lecture====
 
 
 
 
 
<br />
 
=====The ID3 algorithm=====
 
ID3 (Quinlan, 1986) is an early algorithm for learning Decision Trees. The learning of the tree is top-down. The algorithm is greedy, it looks at a single attribute and gain in each step. This may fail when a combination of attributes is needed to improve the purity of a node.
 
 
 
 
 
At each split, the question is "which attribute should be tested next? Which logical test gives us more information?". This is determined by the measures of '''entropy''' and '''information gain'''. These are discussed later.
 
 
 
 
 
A new decision node is then created for each outcome of the test and examples are partitioned according to this value. The process is repeated for each new node until all the examples are classified correctly or there are no attributes left.
 
 
 
 
 
<br />
 
=====The C5.0 algorithm=====
 
C5.0 (Quinlan, 1993) is a refinement of C4.5 which in itself improved upon ID3. '''It is the industry standard for producing decision trees'''. It performs well for most problems out-of-the-box. Unlike other machine-learning techniques, it has high explanatory power, it can be understood and explained.
 
 
 
.
 
 
 
.
 
 
 
.
 
 
 
 
 
<br />
 
 
 
===Example in RapidMiner===
 
<div style="text-align: center;">
 
 
 
[[File:Decision_tree_RapidMiner_example-Iris_data.mp4|1500px|thumb|center|]]
 
 
 
 
 
{| style="margin: 0 auto;" |
 
|[[File:decision_tree_RapidMiner_example-Iris_data3.png|300px|thumb|center|]]
 
|[[File:decision_tree_RapidMiner_example-Iris_data2.png|500px|thumb|center|]]
 
|-
 
|<img style='width:300px' src='https://upload.wikimedia.org/wikipedia/commons/7/78/Petal-sepal.jpg' />
 
|[[File:decision_tree_RapidMiner_example-Iris_data1.png|500px|thumb|center|]]
 
|}
 
 
 
 
 
<pdf width="810" height="500">File:Decision_tree_Example_RapidMiner.pdf</pdf>
 
 
 
This is the guide provided by Noel [[File:Decision_tree_Example_RapidMiner.pdf]]
 
 
 
</div>
 
 
 
<br />
 
 
 
==Random Forests==
 
https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=4s
 
 
 
 
 
<br />
 
==Naive Bayes==
 
Multinomial Naive Bayes: https://www.youtube.com/watch?v=O2L2Uv9pdDA
 
 
 
Gaussian Naive Bayes: https://www.youtube.com/watch?v=uHK1-Q8cKAw  &nbsp; &nbsp;  https://www.youtube.com/watch?v=H3EjCKtlVog
 
 
 
 
 
 
 
https://www.youtube.com/watch?v=Q8l0Vip5YUw
 
 
 
https://www.youtube.com/watch?v=l3dZ6ZNFjo0
 
 
 
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
 
 
 
https://scikit-learn.org/stable/modules/naive_bayes.html
 
 
 
 
 
 
 
Noel's Lecture and Tutorial:
 
https://moodle.cct.ie/mod/scorm/player.php?a=4&currentorg=tuto&scoid=8&sesskey=wc2PiHQ6F5&display=popup&mode=normal
 
 
 
Note, on all the Naive Bayes examples given, the Performance operator is Performance (Binomial Classification)
 
 
 
 
 
<br />
 
'''Naive Bayes classifiers''' are a family of "probabilistic classifiers" based on applying the Bayes' theorem to calculate the conditional probability of an event A given that another event B (or many other events) has occurred.
 
 
 
 
 
'''The Naïve Bayes algorithm''' is named as such because it makes a couple of naïve assumptions about the data. In particular, '''it assumes that all of the features in a dataset are equally important and independent ''' [strong independence assumptions between the features «naïve» (features are the conditional events)]
 
 
 
 
 
These assumptions are rarely true in most of the real-world applications. However, in most cases when these assumptions are violated, Naïve Bayes still performs fairly well. This is true even in extreme circumstances where strong dependencies are found among the features.
 
 
 
 
 
Bayesian classifiers utilize training data to calculate an observed probability for each class based on feature values (the values of the conditional events). When such classifiers are later used on unlabeled data, they use those observed probabilities to predict the most likely class, given the features in the new data.
 
 
 
 
 
Due to the algorithm's versatility and accuracy across many types of conditions, Naïve Bayes is often a strong first candidate for classification learning tasks.
 
 
 
 
 
<br />
 
'''Bayesian classifiers have been used for:'''
 
* '''Text classification:'''
 
:* Spam filtering: It uses the frequency of the occurrence of words in past emails to identify junk email.
 
:* Author identification, and Topic modeling
 
 
 
 
 
* '''Weather forecast:''' The chance of rain describes the proportion of prior days with similar measurable atmospheric conditions in which precipitation occurred. A 60 percent chance of rain, therefore, suggests that in 6 out of 10 days on record where there were similar atmospheric conditions, it rained.
 
 
 
 
 
* Diagnosis of medical conditions, given a set of observed symptoms.
 
 
 
 
 
* Intrusion detection and anomaly detection on computer networks
 
 
 
 
 
<br />
 
===Probability===
 
The probability of an event can be estimated from observed data by dividing the number of trials in which an event occurred by the total number of trials.
 
 
 
 
 
* '''Events'''  are possible outcomes, such as a heads or tails result in a coin flip, sunny or rainy weather, or <math>Spam</math> and <math>Non\text{-}spam</math> email messages.
 
 
 
* '''A trial''' is a single opportunity for the event to occur, such as a coin flip, a day's weather, or an email message.
 
 
 
 
 
* '''Examples:'''
 
:* If it rained 3 out of 10 days, the probability of rain can be estimated as 30 percent.
 
:* If 10 out of 50 email messages are spam, then the probability of spam can be estimated as 20 percent.
 
 
 
 
 
* '''The notation''' <math>P(A)</math> is used to denote the probability of event <math>A</math>, as in <math>P(spam) = 0.20</math>
 
 
 
 
 
<br />
 
 
 
===Independent and dependent events===
 
If the two events are totally unrelated, they are called '''independent events'''. For instance, the outcome of a coin flip is independent of whether the weather is rainy or sunny.
 
 
 
On the other hand, a rainy day depends and the presence of clouds are '''dependent events'''. The presence of clouds is likely to be predictive of a rainy day. In the same way, the appearance of the word <math>Viagra</math> is predictive of a <math>Spam</math> email.
 
 
 
If all events were independent, it would be impossible to predict any event using data about other events. Dependent events are the basis of predictive modeling.
 
 
 
 
 
<br />
 
===Mutually exclusive and collectively exhaustive===
 
<!-- Events are '''mutually exclusive''' and '''collectively exhaustive'''. -->
 
 
 
In probability theory and logic, a set of events is '''Mutually exclusive''' or '''disjoint''' if they cannot both occur at the same time. A clear example is the set of outcomes of a single coin toss, which can result in either heads or tails, but not both. https://en.wikipedia.org/wiki/Mutual_exclusivity
 
 
 
 
 
A set of events is '''jointly''' or '''collectively exhaustive''' if at least one of the events must occur. For example, when rolling a six-sided die, the events <math>1, 2, 3, 4, 5,\ and\ 6</math> (each consisting of a single outcome) are collectively exhaustive, because they encompass the entire range of possible outcomes. https://en.wikipedia.org/wiki/Collectively_exhaustive_events
 
 
 
 
 
Is a set of events is Mutially exclusive and Collectively exhaustive, such as <math>heads</math> or <math>tails</math>, or <math>Spam</math> and <math>Non\text{-}spam</math>, then knowing the probability of <math>n-1</math> outcomes reveals the probability of the remaining one. In other words, if there are two outcomes and we know the probability of one, then we automatically know the probability of the other: For example, given the value <math>P(Spam) = 0.20</math>, we are able to calculate <math>P(Non\text{-}spam) = 1 - 0.20 = 0.80</math>
 
 
 
<br />
 
===Marginal probability===
 
The marginal probability is the probability of a single event occurring, independent of other events. A conditional probability, on the other hand, is the probability that an event occurs given that another specific event has already occurred. https://en.wikipedia.org/wiki/Marginal_distribution
 
 
 
 
 
<br >
 
===Joint Probability===
 
Joint Probability (Independence)
 
 
 
 
 
For any two independent events A and B, the probability of both happening (Joint Probability) is:
 
 
 
 
 
<div style="font-size: 14pt; text-align: center; margin-left:0px">
 
<math>P(A \cap B) = P(A) \times P(B)</math>
 
</div>
 
 
 
 
 
[[File:Joint_probability1.png|400px|thumb|center|Taken from https://corporatefinanceinstitute.com/resources/knowledge/other/joint-probability/ <br /> See also: [[Mathematics#Union - Intersection - Complement]] ]]
 
 
 
 
 
Often, we are interested in monitoring several non-mutually exclusive events for the same trial. If some other events occur at the same time as the event of interest, we may be able to use them to make predictions.
 
 
 
 
 
In the case of Spam detection, consider, for instance, a second event based on the outcome that the email message contains the word Viagra. This word is likely to appear in a Spam message. Its presence in a message is therefore a very strong piece of evidence that the email is Spam.
 
 
 
 
 
We know that <math>20%</math> of all messages were <math>Spam</math> and <math>5%</math> of all messages contain the word <math>Viagra</math>. Our job is to quantify the degree of overlap between these two probabilities. In other words, we hope to estimate the probability of both <math>Spam</math> and the word <math>Viagra</math> co-occurring, which can be written as <math>P(Spam \cap Viagra)</math>.
 
 
 
 
 
If we assume that <math>P(Spam)</math> and <math>P(Viagra)</math> are '''independent''' (note, however! that they are not independent), we could then easily calculate the probability of both events happening at the same time, which can be written as <math>P(Spam \cap Viagra)</math>
 
 
 
 
 
Because <math>20%</math> of all messages are Spam, and <math>5%</math> of all emails contain the word Viagra, we could assume that <math>5%</math> of the <math>20%</math> of spam messages contains the word <math>Viagra</math>. Thus, <math>5%</math>. of the <math>20%</math> represents <math>1%</math> of all messages <math>( 0.05 * 0.20 = 0.01 )</math>. So, <math>1%</math> of all messages are <math>Spams\ that\ contain\ the\ word\ Viagra \ \rightarrow \ P(spam \cap Viagra) = 1%</math>
 
 
 
 
 
In reality, it is far more likely that <math>P(Spam)</math> and <math>P(Viagra)</math> are highly '''dependent''', which means that this calculation is incorrect. Hence the importance of the '''conditional probability'''.
 
 
 
 
 
<br />
 
 
 
===Conditional probability===
 
Conditional probability is a measure of the probability of an event occurring, given that another event has already occurred. If the event of interest is <math>A</math> and the event <math>B</math> is known or assumed to have occurred, "the conditional probability of <math>A</math> given <math>B</math>", or "the probability of <math>A</math> under the condition <math>B</math>", is usually written as <math>P(A|B)</math>, or sometimes <math>P_{B}(A)</math> or <math>P(A/B)</math>. https://en.wikipedia.org/wiki/Conditional_probability
 
 
 
 
 
For example, the probability that any given person has a cough on any given day may be only <math>5%</math>. But if we know or assume that the person is sick, then they are much more likely to be coughing. For example, the conditional probability that someone sick is coughing might be <math>75%</math>, in which case we would have that <math>P(Cough) = 5%</math> and <math>P(Cough|Sick) = 75%</math>. https://en.wikipedia.org/wiki/Conditional_probability
 
 
 
 
 
<br />
 
====Kolmogorov definition of Conditional probability====
 
Al parecer, la definición más común es la de Kolmogorov.
 
 
 
 
 
Given two events <math>A</math> and <math>B</math> from the sigma-field of a probability space, with the unconditional probability of <math>B</math> being greater than zero (i.e., <math>P(B)>0</math>), the conditional probability of <math>A</math> given <math>B</math> is defined to be the quotient of the probability of the joint of events <math>A</math> and <math>B</math>, and the probability of <math>B</math>: https://en.wikipedia.org/wiki/Conditional_probability
 
 
 
 
 
<div style="font-size: 14pt; text-align: center; margin-left:-100px">
 
<math>
 
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
 
</math>
 
</div>
 
 
 
 
 
<br />
 
 
 
====Bayes s theorem====
 
Also called Bayes' rule and Bayes' formula
 
 
 
 
 
'''Thomas Bayes (1763)''': An essay toward solving a problem in the doctrine of chances, Philosophical Transactions fo the Royal Society, 370-418.
 
 
 
 
 
Bayes's Theorem provides a way of calculating the conditional probability when we know the conditional probability in the other direction.
 
 
 
 
 
It cannot be assumed that <math>P(A|B) \approx P(B|A)</math>. Now, very often we know a conditional probability in one direction, say <math>P(B|A)</math>, but we would like to know the conditional probability in the other direction, <math>P(A|B)</math>. https://web.stanford.edu/class/cs109/reader/3%20Conditional.pdf. So, we can say that Bayes' theorem provides a way of reversing conditional probabilities: how to find <math>P(A|B)</math> from <math>P(B|A)</math> and vice-versa.
 
 
 
 
 
Bayes's Theorem is stated mathematically as the following equation:
 
 
 
 
 
<div style="font-size: 14pt; text-align: center; margin-left:-100px">
 
<math>
 
P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}
 
</math>
 
</div>
 
 
 
 
 
<math>P(A \mid B)</math> can be read as the probability of event <math>A</math> given that event <math>B</math> occurred. This is known as conditional probability since the probability of <math>A</math> is dependent or '''conditional''' on the occurrence of event <math>B</math>.
 
 
 
 
 
'''The terms are usually called:'''
 
 
 
{|
 
|* <math>\bold{P(B|A)}</math>
 
|:
 
|Likelihood <ref name=":1" />;
 
|
 
|
 
Also called Update <ref name=":2" />
 
|
 
|
 
|-
 
|* <math>\bold{P(B)}</math>
 
|:
 
|Marginal likelihood;
 
|
 
|
 
Also called Evidence <ref name=":1" />;
 
|
 
|Also called Normalization constant <ref name=":2" />
 
|-
 
|* <math>\bold{P(A)}</math>
 
|:
 
|Prior probability <ref name=":1" />;
 
|
 
|Also called Prior<ref name=":2" />
 
|
 
|
 
|-
 
|* <math>\bold{P(A|B)}</math>
 
|:
 
|Posterior probability <ref name=":1" />;
 
|
 
|Also called Posterior <ref name=":2" />
 
|
 
|
 
|}
 
 
 
 
 
<br />
 
=====Likelihood and Marginal Likelihood=====
 
When we are calculating the probabilities of discrete data, like individual words in our example, and not the probability of something continuous, like weight or height, these '''Probabilities''' are also called '''Likelihoods'''. However, in some sources, you can find the use of the term '''Probability''' even when talking about discrete data. https://www.youtube.com/watch?v=O2L2Uv9pdDA
 
 
 
 
 
In our example:
 
* The probability that the word <math>Viagra</math> was used in previous Spam messages is called the '''Likelihood'''.
 
* The probability that the word <math>Viagra</math> appeared in any email (<math>Spam</math> or <math>Non\text{-}spam</math>) is known as the '''Marginal likelihood.'''
 
 
 
 
 
<br />
 
 
 
=====Prior Probability=====
 
Suppose that you were asked to guess the probability that an incoming email was Spam. Without any additional evidence (other dependent events), the most reasonable guess would be the probability that any prior message was Spam (that is, 20% in the preceding example). This estimate is known as the prior probability. It is sometimes referred to as the «initial guess»
 
 
 
 
 
<br />
 
=====Posterior Probability=====
 
Now suppose that you obtained an additional piece of evidence. You are told that the incoming email contains the word <math>Viagra</math>.
 
 
 
By applying Bayes' theorem to the evidence, we can compute the posterior probability that measures how likely the message is to be Spam.
 
 
 
In the case of Spam classification, if the posterior probability is greater than 50%, the message is more likely to be <math>Spam</math> than <math>Non\text{-}spam</math>, and it can potentially be filtered out.
 
 
 
The following equation is the Bayes' theorem for the given evidence:
 
 
 
 
 
<div style="font-size: 14pt; text-align: center; margin-left:-150px">
 
<math>
 
\overbrace{ P(Spam | Viagra) }^{\bold{\color{salmon}{\text{Posterior probability}}}} = \frac{ \overbrace{ P(Viagra|Spam) }^{\bold{\color{salmon}\text{Likelihood}}} \overbrace{P(Spam)}^{\bold{\color{salmon}\text{Prior probability}}} } { \underbrace{P(Viagra) }_{\bold{\color{salmon}\text{Marginal likelihood}}} }
 
</math>
 
</div>
 
<!-- [[File:BayesTheorem-Posterior_probability.png|500px|thumb|center|]] -->
 
 
 
 
 
<br />
 
===Applying Bayes' Theorem===
 
https://stats.stackexchange.com/questions/66079/naive-bayes-classifier-gives-a-probability-greater-than-1
 
 
 
Let's say that we are training a Span classifier.
 
 
 
We need information about the frequency of words in <math>Spam</math> or <math>Non\text{-}spam</math> emails (a <math>Non\text{-}spam</math> is also referred to as <math>Ham</math> emails or just a <math>Normal</math> email). We will assume that the Naïve Bayes learner was trained by constructing a likelihood table for the appearance of these four words in 100 emails, as shown in the following table:
 
 
 
 
 
<div style="text-align: center; margin-left:-150px; font-size: 12pt">
 
{| class="wikitable" style="width: 20px; height: 20px; margin: 0 auto; border: 0px"
 
|+
 
|style="background:white; border: 0px"|
 
! colspan="2" |Viagra
 
! colspan="2" |Money
 
! colspan="2" |Groceries
 
! colspan="2" |Unsubscribe
 
|style="background:white; border: 0px"|
 
|-
 
|style="background:white; border: 0px"|
 
|'''Yes'''
 
|'''No'''
 
|'''Yes'''
 
|'''No'''
 
|'''Yes'''
 
|'''No'''
 
|'''Yes'''
 
|'''No'''
 
|'''Total'''
 
|- style="background: #f7a8b8" |
 
|'''Spam'''
 
|4/20
 
|16/20
 
|10/20
 
|10/20
 
|0/20
 
|20/20
 
|12/20
 
|8/20
 
|'''20'''
 
|- style="background: #92bce8" |
 
|'''Normal'''
 
|1/80
 
|79/80
 
|14/80
 
|66/80
 
|8/80
 
|72/80
 
|23/80
 
|57/80
 
|'''80'''
 
|-
 
|'''Total'''
 
|5/100
 
|95/100
 
|24/100
 
|76/100
 
|8/100
 
|92/100
 
|35/100
 
|65/100
 
|'''100'''
 
|}
 
</div>
 
 
 
 
 
As new messages are received, the posterior probability must be calculated to determine whether the messages are more likely to be Spam or Normal, given the likelihood of the words found in the message text.
 
 
 
 
 
<br />
 
====Scenario 1 - A single feature====
 
Suppose we received a message that contains the word <math>\bold{Viagra}</math>:'''
 
 
 
We can define the problem as shown in the equation below, which captures the probability that a message is Spam, given that the words 'Viagra' is present:
 
 
 
<div style="font-size: 14pt; text-align: center; margin-left:-150px">
 
<math>
 
P(Spam|Viagra) = \frac{P(Viagra|Spam)P(Spam)}{P(Viagra)}
 
</math>
 
</div>
 
 
 
{|
 
|
 
* <math>\bold{P(Viagra|Spam)}</math>
 
|
 
|(Likelihood)
 
|<div style="margin:  5pt">:</div>
 
|The probability that a Spam message contains the term <math>Viagra</math>
 
|<div style="margin: 10pt"><math>\rightarrow</math></div>
 
|<math>4/20 = 0.20 = 20%</math>
 
|-
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|-
 
|
 
* <math>\bold{P(Viagra)}</math>
 
|
 
|(Marginal likelihood)
 
|<div style="margin:  5pt">:</div>
 
|The probability that the word <math>Viagra</math> appeared in any email (Spam or Normal)
 
|<div style="margin: 10pt"><math>\rightarrow</math></div>
 
|<math>5/100 = 0.05 = 5%</math>
 
|-
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|-
 
|
 
* <math>\bold{P(Spam)}</math>
 
|
 
|(Prior probability)
 
|<div style="margin:  5pt">:</div>
 
|The probability that an email is Spam
 
|<div style="margin: 10pt"><math>\rightarrow</math></div>
 
|<math>20/100 = 0.20 = 20%</math>
 
|-
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|-
 
|
 
* <math>\bold{P(Spam|Viagra)}</math>
 
|
 
|(Posterior probability)
 
|<div style="margin:  5pt">:</div>
 
|The probability that an email is Spam given that contain the word <math>Viagra</math>
 
|<div style="margin: 10pt"><math>\rightarrow</math></div>
 
|<math>\frac{0.2 \times 0.2}{0.05} = 0.8 = 80%</math>
 
|-
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|-
 
| colspan="7" |
 
* '''The probability that a message is Spam, given that it contains the word "Viagra" is <math>{\bold{80%}}</math>. Therefore, any message containing this term should be filtered.'''
 
|}
 
 
 
 
 
<br />
 
 
 
====Scenario 2 - Class-conditional independence====
 
Suppose we received a new message that contains the words <math>\bold{Viagra,\ Money}</math> and <math>\bold{Unsubscribe}</math>:
 
 
 
 
 
<div style="font-size: 14pt; text-align: center; margin-left:-150px">
 
<math>
 
P(Spam|Viagra \cap Money \cap Unsubscribe) = \frac{P(Viagra \cap Money \cap Unsubscribe | Spam)P(Spam)}{P(Viagra \cap Money \cap Unsubscribe)}
 
</math>
 
</div>
 
 
 
 
 
<span style="color: #007bff">For a number of reasons, this is computationally difficult to solve. As additional features are added, tremendous amounts of memory are needed to store probabilities for all of the possible intersecting events. Therefore, '''Class-conditional independence''' can be assumed to simplify the problem.</span>
 
 
 
 
 
<br />
 
'''Class-conditional independence'''
 
 
 
The work becomes much easier if we can exploit the fact that Naïve Bayes assumes independence among events. Specifically, Naïve Bayes assumes '''class-conditional independence''', which means that events are independent so long as they are conditioned on the same class value.
 
 
 
Assuming conditional independence allows us to simplify the equation using the probability rule for independent events <math>P(A \cap B) = P(A) \times P(B)</math>. This results in a much easier-to-compute formulation:
 
 
 
 
 
<div style="font-size: 12pt; text-align: left; margin-left:107px">
 
<math>
 
P(Spam\ |\ Viagra \cap Money \cap Unsubscribe) \ \ = \frac{P(Viagra|Spam) \cdot P(Money|Spam) \cdot P(Unsubscribe|Spam) \cdot P(Spam)}{P(Viagra \cap Money \cap Unsubscribe)}
 
</math>
 
</div>
 
 
 
 
 
<div style="font-size: 12pt; text-align: left; margin-left:107px">
 
<math>
 
P(Normal|Viagra \cap Money \cap Unsubscribe) = \frac{P(Viagra|Normal) \cdot P(Money|Normal) \cdot P(Unsubscribe|Normal) \cdot P(Normal)}{P(Viagra \cap Money \cap Unsubscribe)}
 
</math>
 
</div>
 
 
 
 
 
<div style="background: #ededf2; padding: 5px">
 
<span style="color:#007bff; font-weight: bold"> Es <span style="color: red">EXTREMADAMENTE IMPORTANTE</span> notar que the independence assumption made in Naïve Bayes is <span style="color: red; font-weight: bold">Class-conditional</span>. This means that the words a and b appear independently, given that the message is Spam (and also, given that the message is not Spam). This is why we cannot apply this assumption to the denominator of the equation. This is, we CANNOT assume that <span style="border: 0px solid blue; padding: 5px 0px 5px 0px"><math>P(word\ a \cap word\ b) = P(word\ a)P(word\ b)</math></span> because in this case the words are not conditioned to belong to one class (Span or Non-spam). Esto no me queda del todo claro. See this post:</span> https://stats.stackexchange.com/questions/66079/naive-bayes-classifier-gives-a-probability-greater-than-1
 
</div>
 
 
 
 
 
So, we are not able to simplify the denominator. Therefore, what is done in Naïve Bayes is to calculate the nominator for both classes (<math>Spam</math> and <math>Normal</math>). Because the denominator is the same for both classes, we are able to state that the class whose nominator is greater would have the greater conditional probability and therefore is the more likely class for the features given.
 
 
 
 
 
<div style="font-size: 10pt; text-align: left;">
 
<math>
 
P(Viagra|Spam) \cdot P(Money|Spam) \cdot P(Unsubscribe|Spam) \cdot P(Spam) = \frac{4}{20} \cdot \frac{10}{20} \cdot \frac{12}{20} \cdot \frac{20}{100} = 0.012
 
</math>
 
</div>
 
 
 
 
 
<div style="font-size: 10pt; text-align: left;">
 
<math>
 
P(Viagra|Normal) \cdot P(Money|Normal) \cdot P(Unsubscribe|Normal) \cdot P(Normal) = \frac{1}{80} \cdot \frac{14}{80} \cdot \frac{23}{80} \cdot \frac{80}{100} = 0.0005
 
</math>
 
</div>
 
 
 
 
 
Because <math>0.012/0.0005 \approx 24</math>, we can say that this message is 24 times more likely to be <math>Spam</math> than <math>Normal</math>.
 
 
 
 
 
Finally, the probability of Spam is equal to the likelihood that the message is Spam divided by the likelihood that the message is either <math>Spam</math> or <math>Normal</math>:
 
 
 
 
 
<div style="font-size: 10pt; text-align: left;">
 
<math>
 
\text{The probability that the message is}\ Spam\ \text{is} = \frac{0.012}{(0.012 + 0.0005)} = 0.96 = 96%
 
</math>
 
</div>
 
 
 
 
 
<br />
 
 
 
====Scenario 3 - Laplace Estimator====
 
<!-- Naïve Bayes problem -->
 
 
 
Suppose we received another message, this time containing the terms: <math>Viagra</math>, <math>Money</math>, <math>Groceries</math>, and <math>Unsubscribe</math>.
 
 
 
 
 
<div style="font-size: 10pt; text-align: left;">
 
<math>
 
P(Viagra|Spam) \cdot P(Money|Spam) \cdot P(Groceries|Spam) \cdot P(Unsubscribe|Spam) \cdot P(Spam) = \frac{4}{20} \cdot \frac{10}{20} \cdot \frac{0}{20} \cdot \frac{12}{20} \cdot \frac{20}{100} = 0
 
</math>
 
</div>
 
 
 
 
 
Surely this is a misclassification? right?. This problem might arise if an event never occurs for one or more levels of the class. for instance, the term Groceries had never previously appeared in a Spam message. Consequently, <math>P(Spam|groceries) = 0%</math>
 
 
 
This <math>0%</math> value causes the posterior probability of <math>Spam</math> to be zero, giving the presence of the word <math>Groceries</math> the ability to effectively nullify and overrule all of the other evidence.
 
 
 
Even if the email was otherwise overwhelmingly expected to be Spam, the zero likelihood for the word <math>Groceries</math> will always result in a probability of <math>Spam</math> being zero.
 
 
 
 
 
A solution to this problem involves using the '''Laplace estimator'''
 
 
 
 
 
The '''Laplace estimator''', named after the French mathematician Pierre-Simon Laplace, essentially adds a small number to each of the counts in the frequency table, which ensures that each feature has a nonzero probability of occurring with each class.
 
 
 
Typically, the Laplace estimator is set to 1, which ensures that each class-feature combination is found in the data at least once. The Laplace estimator can be set to any value and does not necessarily even have to be the same for each of the features.
 
 
 
Using a value of 1 for the Laplace estimator, we add one to each numerator in the likelihood function. The sum of all the 1s added to the numerator must then be added to each denominator. The likelihood of <math>Spam</math> is therefore:
 
 
 
 
 
<div style="font-size: 10pt; text-align: left;">
 
<math>
 
P(Viagra|Spam) \cdot P(Money|Spam) \cdot P(Groceries|Spam) \cdot P(Unsubscribe|Spam) \cdot P(Spam) = \frac{5}{20} \cdot \frac{11}{20} \cdot \frac{1}{20} \cdot \frac{13}{20} \cdot \frac{20}{100} = 0.0009
 
</math>
 
</div>
 
 
 
 
 
While the likelihood of Normal is:
 
 
 
 
 
<div style="font-size: 10pt; text-align: left;">
 
<math>
 
P(Viagra|Normal) \cdot P(Money|Normal) \cdot P(Groceries|Normal) \cdot P(Unsubscribe|Normal) \cdot P(Normal) = \frac{2}{80} \cdot \frac{15}{80} \cdot \frac{9}{80} \cdot \frac{24}{80} \cdot \frac{80}{100} = 0.0001
 
</math>
 
</div>
 
 
 
 
 
<div style="font-size: 10pt; text-align: left;">
 
<math>
 
\text{The probability that the message is}\ Spam\ \text{is} = \frac{0.0009}{(0.0009 + 0.0001)} = 0.899 \approx 90%
 
</math>
 
</div>
 
 
 
 
 
<div class="mw-collapsible mw-collapsed" style="width:100%; background: #ededf2; padding: 1px 5px 1px 5px">
 
''' The presentation shows this example this way. I think there are mistakes in this presentation: '''
 
<div class="mw-collapsible-content">
 
* Let's extend our Spam filter by adding a few additional terms to be monitored: "money", "groceries", and "unsubscribe".
 
* We will assume that the Naïve Bayes learner was trained by constructing a likelihood table for the appearance of these four words in 100 emails, as shown in the following table:
 
 
 
 
 
[[File:ApplyingBayesTheorem-Example.png|800px|thumb|center|]]
 
 
 
 
 
As new messages are received, the posterior probability must be calculated to determine whether the messages are more likely to be Spam or Normal, given the likelihood of the words found in the message text.
 
 
 
 
 
We can define the problem as shown in the equation below, which captures the probability that a message is Spam, given that the words 'Viagra' and Unsubscribe are present and that the words 'Money' and  'Groceries' are not.
 
 
 
 
 
[[File:ApplyingBayesTheorem-ClassConditionalIndependance.png|800px|thumb|center|]]
 
 
 
Using the values in the likelihood table, we can start filling numbers in these equations. Because the denominatero si the same in both cases, it can be ignored for now. The overall likelihood of Spam is then:
 
 
 
 
 
<math>
 
\frac{4}{20} \cdot \frac{10}{20} \cdot \frac{20}{20} \cdot \frac{12}{20} \cdot \frac{20}{100} = 0.012
 
</math>
 
 
 
 
 
While the likelihood of Normal given the occurrence of these words is:
 
 
 
 
 
<math>
 
\frac{1}{80} \cdot \frac{60}{80} \cdot \frac{72}{80} \cdot \frac{23}{80} \cdot \frac{80}{100} = 0.002
 
</math>
 
 
 
 
 
Because 0.012/0.002 = 6, we can say that this message is six times more likely to be Spam than Normal. However, to convert these numbers to probabilities, we need one last step.
 
 
 
 
 
The probability of Spam is equal to the likelihood that the message is  Spam divided by the likelihood that the message is either Spam or  Normal:
 
 
 
 
 
<math>
 
\frac{0.012}{(0.012 + 0.002)} = 0.857
 
</math>
 
 
 
 
 
The probability that the message is Spam is 0.857. As this is over the threshold of 0.5, the message is classified as Spam.
 
</div>
 
</div>
 
 
 
 
 
<br />
 
 
 
===Naïve Bayes -  Numeric Features===
 
Because Naïve Bayes uses frequency tables for learning the data, each feature must be categorical in order to create the combinations of class and feature values comprising the matrix.
 
 
 
Since numeric features do not have categories of values, the preceding algorithm does not work directly with numeric data.
 
 
 
One easy and effective solution is to discretize numeric features, which simply means that the numbers are put into categories knows as bins. For this reason, discretization is also sometimes called '''binning'''.
 
 
 
This method is ideal when there are large amounts of training data, a common condition when working with Naïve Bayes.
 
 
 
There is also a version of Naïve Bayes that uses a '''kernel density estimator''' that can be used on numeric features with a normal distribution.
 
 
 
 
 
[[File:NaiveBayes-NumericFeatures.mp4|700px|thumb|center|]]
 
 
 
 
 
<br />
 
 
 
===RapidMiner Examples===
 
 
 
* '''Example 1:'''
 
:* [[File:NaiveBayes-RapidMiner_Example1.zip]]
 
 
 
 
 
<br />
 
* '''Example 2:'''
 
[[File:NaiveBayes-RapidMiner_Example2_1.png|950px|thumb|center|Download the directory including the data, video explanation and RapidMiner process file at [[File:NaiveBayes-RapidMiner_Example2.zip]] ]]
 
 
 
 
 
<br />
 
* '''Example 3:'''
 
:* [[File:NaiveBayes-RapidMiner_Example3.zip]]
 
 
 
 
 
<br />
 
 
 
==Perceptrons - Neural Networks and Support Vector Machines==
 
 
 
* 22/06: Recorded class - The perception Algorithm - Suppor Vector Machines & Neural Networks
 
:* https://drive.google.com/drive/folders/1BaordCV9vw-gxLdJBMbWioX2NW7Ty9Lm
 
 
 
 
 
<br />
 
 
 
==Boosting==
 
 
 
 
 
<br />
 
===Gradient boosting===
 
https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
 
 
 
https://freakonometrics.hypotheses.org/tag/gradient-boosting
 
 
 
https://en.wikipedia.org/wiki/Gradient_boosting
 
 
 
https://www.researchgate.net/publication/326379229_Exploring_the_clinical_features_of_narcolepsy_type_1_versus_narcolepsy_type_2_from_European_Narcolepsy_Network_database_with_machine_learning
 
 
 
http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html
 
 
 
 
 
Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant (1988, 1989): "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification. https://en.wikipedia.org/wiki/Gradient_boosting
 
 
 
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
 
 
 
 
 
Boosting is a sequential process; i.e., trees are grown using the information from a previously grown tree one after the other. This process slowly learns from data and tries to improve its prediction in subsequent iterations. Let's look at a classic classification example:
 
 
 
 
 
[[File:How_does_boosting_work.png|600px|thumb|center|]]
 
 
 
 
 
Four classifiers (in 4 boxes), shown above, are trying hard to classify + and - classes as homogeneously as possible. Let's understand this picture well:
 
 
 
*Box 1: The first classifier creates a vertical line (split) at D1. It says anything to the left of D1 is + and anything to the right of D1 is -. However, this classifier misclassifies three + points.
 
 
 
*Box 2: The next classifier says don't worry I will correct your mistakes. Therefore, it gives more weight to the three + misclassified points (see the bigger size of +) and creates a vertical line at D2. Again it says, anything to the right of D2 is - and left is +.  Still, it makes mistakes by incorrectly classifying three - points.
 
 
 
*Box 3: The next classifier continues to bestow support. Again, it gives more weight to the three - misclassified points and creates a horizontal line at D3. Still, this classifier fails to classify the points (in a circle) correctly.
 
 
 
*Remember that each of these classifiers has a misclassification error associated with them.
 
 
 
*Boxes 1,2, and 3 are weak classifiers. These classifiers will now be used to create a strong classifier Box 4.
 
 
 
*Box 4: It is a weighted combination of the weak classifiers. As you can see, it does a good job of classifying all the points correctly.
 
 
 
That's the basic idea behind boosting algorithms. The very next model capitalizes on the misclassification/error of the previous model and tries to reduce it.
 
 
 
 
 
<br />
 
==K Means Clustering==
 
 
 
K Means Clustering is an unsupervised learning algorithm that will attempt to group similar clusters together in your data. So, the overall goal is to divide data into distinct groups such that observations within each group are similar. (Jose Portilla)
 
 
 
K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster and finds the centroid of each cluster. Then, the algorithm iterates through two steps:
 
Reassign data points to the cluster whose centroid is closest. Calculate the new centroid of each cluster. These two steps are repeated till the within-cluster variation cannot be reduced any further. The within-cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids. (Jose Portilla)
 
 
 
 
 
 
 
<div style="width: 420pt; margin: 0 auto">{{#ev:youtube|https://www.youtube.com/watch?v=4b5d3muPQmA&t=215s}} StatQuest: https://www.youtube.com/watch?v=4b5d3muPQmA&t=215s</div>
 
 
 
 
 
 
 
<br />
 
'''So what does a typical clustering problem looks like:''' (Jose Portilla)
 
* Cluster similar documents
 
* cluster customers based on features
 
* Market Segmentation
 
* Identify similar physical groups
 
 
 
 
 
 
 
<br />
 
'''The algorithm:''' (StatQuest)
 
<blockquote>
 
* '''Step 1: Select the number of clusters you want to identify in your data. This is the "K"'''
 
 
 
* '''Step 2: Randomly select 3 distinct data points:''' These will be the initial clusters
 
 
 
* '''Step 3: Measure the distance between every value in the data (data point) and the three initial clusters'''
 
<blockquote>
 
:* 1st point: Measure the distance between the 1st point and the three initial clusters.
 
:* 2st point: Measure the distance between the 2nd point and the three initial clusters.
 
:* 3rd point: Measure the distance between the 3rd point and the three initial clusters.
 
:: .
 
:: .
 
:: .
 
:* n point: Measure the distance between the n point and the three initial clusters. At this stage, all the points (values) will be assigned to a cluster.
 
</blockquote>
 
 
 
* '''Step 4: Calculate the mean of each cluster:''' Now the means become the three clusters reference point.
 
 
 
* '''Step 5: Repeat Step 3 but using the new there cluster reference points (the means of each cluster)'''
 
<blockquote>
 
:* 1st point: Measure the distance between the 1st point and the new there cluster reference points (the means of each cluster)
 
:: .
 
:: .
 
:: .
 
:* n point ...
 
</blockquote>
 
 
 
* '''Step 6: Repeat Step 4 and 5 until the clustering doesn't change with respect to the preview iteration'''
 
 
 
* '''Step 7: Calculate the «Total variation/variance» which is given by the sum of the variations of each cluster:''' The «Total variation/variance» will give us a measure of the quality of the cluster. A lower «Total variation/variance» means a better cluster.
 
 
 
* '''Step 8: Repeat the process from Step 2 to 7:''' So it will do the whole thing over again with different starting points. This will be repeated as many times as you tell it to it.
 
 
 
* '''Step 9: The final cluster will be the one with the lower «Total variation/variance».
 
</blockquote>
 
 
 
 
 
 
 
<br />
 
'''How to calculate the value of K:''' (StatQuest)
 
 
 
 
 
 
 
<br />
 
===Clustering class of the Noel course===
 
Clustering is the task of finding groups of data that are similar when no class label is available.
 
 
 
 
 
This is a type of '''unsupervised learning''' because there is no training stage. Also, because it is unsupervised learning and as such, there is no "ground truth", the results are frequently subjective.
 
 
 
 
 
Clustering can be used as an exploratory technique to discover naturally occurring groups that can be later used in classification.
 
 
 
 
 
X-means clustering is a development of k-means that refines cluster assignment that uses an information criterion such as the Akaike information criterion (AIC) or Bayesian information criterion (BIC) to keep the best splits.
 
 
 
 
 
Unlike supervised learning and in common with all unsupervised approaches, a clustering algorithm runs on the whole data set. There is no train/test split.
 
 
 
It crates cluster labels, usually just a, b, c,... or 1, 2, 3,..., and assigns each observation to one of the cluster labels (exclusive clustering) or one or more cluster labels (fuzzy clustering). As such, there is no intrinsic meaning to cluster labels.
 
 
 
 
 
The assignment of an observation to a cluster label is inferred from some similarity (or dissimilarity) measure.
 
 
 
 
 
No model is generated, so if we obtain new data we have to go through the whole process again from the beginning.
 
 
 
 
 
'''For example''', let's say that we have a list of customers and we want to divide the customers into a few groups. In this case, we can use a clustering algorithm to try to find out patterns of groups (the best way to separate our customer)
 
 
 
 
 
<br />
 
====RapidMiner example 1====
 
<br />
 
[[File:RapidMiner_Clustering_example1-Using_the_Iris_data.mp4]]
 
 
 
* We can try with different values of k (number of clusters): 2, 3, 4, 5 and compare performances.
 
 
 
 
 
<br />
 
==Principal Component Analysis PCA==
 
<div style="width: 420pt; margin: 0 auto">{{#ev:youtube|https://www.youtube.com/watch?v=FgakZw6K1QQ&t=1013s}} StatQuest: https://www.youtube.com/watch?v=4b5d3muPQmA&t=215s</div>
 
 
 
 
 
<br />
 
 
 
==Association Rules - Market Basket Analysis==
 
 
 
 
 
<br />
 
===Association Rules example in RapidMiner===
 
Ensure you have the Weka extension installed. To do this, click onGet more operators from the marketplace at the bottom left of the RapidMiner window.
 
 
 
 
 
This kind of analysis is done with the goal of discovery patterns on data.
 
 
 
 
 
<br />
 
==Time Series Analysis==
 
 
 
E-Lecture: https://moodle.cct.ie/mod/scorm/view.php?id=61932
 
 
 
 
 
<br />
 
 
 
==[[Text Analytics|Text Analytics / Mining]]==
 
<br />
 
 
 
<br />
 
==Model Evaluation==
 
E-learning link: https://moodle.cct.ie/mod/scorm/player.php?a=5&currentorg=tuto&scoid=10&sesskey=4EXk0T1DT7&display=popup&mode=normal
 
 
 
 
 
<br />
 
===Why evaluate models===
 
When we build machine learning models, whether, for classification or regression, we need some indication of how the model will perform on previously unseen data. We need a measure of model quality.
 
 
 
 
 
Also, when we build multiples models fo different types (Naïve Bayes and Decision tree, for example) wee need a means of inter-comparing the performance of the models.
 
 
 
 
 
<br />
 
===Evaluation of regression models===
 
Understanding Regression Error Metrics in Python: https://www.dataquest.io/blog/understanding-regression-error-metrics/
 
<br />
 
{| class="wikitable"
 
|+
 
|-
 
! colspan="3" style="vertical-align:top;" |
 
<div style="text-align:left; vertical-align:top; font-weight: normal">
 
'''Regression Error'''
 
 
 
The evaluation of regression models involves calculation on the errors (also known as residuals or innovations).
 
 
 
Errors are the differences between the predicted values, represented as <math>\hat{y}</math> and the actual values, denoted <math>y</math>.
 
<div style="margin: auto; width: 50%; border: 0px solid blue; padding: 10px;">
 
{|
 
![[File:Regression_errors.png|300px|center|link=Special:FilePath/Regression_errors.png]]
 
!
 
{| class="wikitable" style="width: 20px; height: 20px; margin: 0 auto;"
 
!<math>y</math>
 
!<math>\hat{y}</math>
 
!<math>\left \vert y - \hat{y} \right \vert</math>
 
|-
 
|5
 
|6
 
|1
 
|-
 
|6.5
 
|5.5
 
|1
 
|-
 
|8
 
|9.5
 
|1.5
 
|-
 
|8
 
|6
 
|2
 
|-
 
|7.5
 
|10
 
|2.5
 
|}
 
|}
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Mean Absolute Error - MAE</h5>
 
|The Mean Absolute Error (MAE) is calculated taking the sum of the absolute differences between the actual and predicted values (i.e. the errors with the sign removed) and multiplying it by the reciprocal of the number of observations.
 
 
 
Note that the value returned by the equation is dependent on the range of the values in the dependent variable. it ks '''scale dependent'''.
 
 
 
MAE is preferred by many as the evaluation metric of choice as it gives equal weight to all errors, irrespective of their magnitude.
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
MAE = \frac{1}{n} \sum_{i=1}^{n} \left \vert Y_i - \hat{Y}_i \right \vert
 
</math>
 
<div class="mw-collapsible-content">
 
<br /><math>
 
MAE = \frac{1}{5} \times 8 = 1.6
 
</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Mean Squared Error - MSE</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
The Mean Squared Error (MSE) is very similar to the MAE, except that it is calculated taking the sum of the squared differences between the actual and predicted values and multiplying it by the reciprocal of the number of observations. Note that squaring the differences also removes their sign.
 
<div class="mw-collapsible-content">
 
<br />
 
As with MAE, the value returned by the equation is dependent on the range of the values in the dependent variable. It is '''scale dependent'''.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
MSE = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2
 
</math>
 
<div class="mw-collapsible-content">
 
<br />
 
<math>
 
MSE = \frac{1}{5} \times 14.5 = 2.9
 
</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Root Mean Squared Error</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
The Root Mean Squared Error (MSE) is basically the same as MSE, except that it is calculated taking the square root of sum of the squared differences between the actual and predicted values and multiplying it by the reciprocal of the number of observations.
 
<div class="mw-collapsible-content">
 
<br />
 
As with MAE and MSE, the value returned by the equation is dependent on the range of the values in the dependent variable. It is '''scale dependent'''.
 
 
 
 
 
MSE and its related metric, RMSE, have been both criticized because they both give heavier weight to larger magnitude errors (outliers). However, this property may be desirable in some circumstances, where large magnitude errors are undesirable, even in small numbers.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
RMSE = \sqrt{\frac{1}{n} \sum_{i=i}^n (Y_i - \hat{Y}_i)^2 }
 
</math>
 
<div class="mw-collapsible-content">
 
<br />
 
<math>RMSE = \sqrt{ \frac{1}{5} \times 14.5} = \sqrt{2.9} = 1.7029</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Relative Error</h5>
 
|<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
The relative error (also known as approximation error)is an average measure of the difference between an actual value and the estimate of the value and the given by the average of the absolute of the difference between the values over the actual value.
 
<div class="mw-collapsible-content">
 
<br />
 
'''Corroborar la fórmula porque creo que hay un error en la slide del prof.'''
 
</div>
 
</div>
 
|<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
RE = \frac{1}{n} \sum_{i=1}^n \left \vert \frac{Y_i - \hat{Y}_i}{Y_i} \right \vert
 
</math>
 
<div class="mw-collapsible-content">
 
<br />
 
<math>RE = \frac{1}{5} \times 1.1247 = 0.2249</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Mean Absolute Percentage Error</h5>
 
|<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
Mean Absolute Percentage Error (MAPE) is a '''scale-independent''' measure of the performance of a regression model. It is calculated by summing the absolute value of the difference between the actual value and the predicted value, divided by the actual value. This is then multiplied by the reciprocal of the number of observations. This is then multiplied by 100 to obtain a percentage.
 
<div class="mw-collapsible-content">
 
<br />
 
Although it offers a scale-independent measure, MAPE is not without problems:
 
* It can not be employed if any of the actual values are true zero, as this would result in a division by zero error.
 
* Where predicted values frequently exceed the actual values, the percentage error can exceed 100%
 
* It penalizes negative errors more than positive errors, meaning that models that routinely predict below the actual values will have a higher MAPE.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
MAPE = \frac{1}{n} \sum_{i=1}^n \left \vert \frac{Y_i - \hat{Y}_i}{Y_i} \right \vert \times 100
 
</math>
 
<div class="mw-collapsible-content">
 
<br />
 
<math>MAPE = \frac{1}{5} \times 1.1247 \times 100 = 22.4936</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">R squared</h5>
 
|<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>R^2</math>, or the Coefficient of Determination, is the ratio of the amount of variance explained by a model and the total amount of variance in the dependent variable and is the rage [0,1].
 
 
 
 
 
Values close to 1 indicate that a model will be better at predicting the dependent variable.
 
<div class="mw-collapsible-content">
 
<br />
 
R squared is calculated by summing up the squared differences between the predicted values and the actual values (the top part of the equation) and dividing that by the squared deviation of the actual values from their mean (the bottom part of the equation). The resulting value is then subtracted from 1.
 
 
 
 
 
A high <math>R^2</math> is not necessarily an indicator of a good model, as it could be the result of overfitting.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
 
= 1 - \frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{\sum_{i=1}^n(y_i - \bar{y})^2}
 
</math>
 
<div class="mw-collapsible-content">
 
<br />
 
<math>
 
...
 
</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Spearman’s rho</h5>
 
|<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
Spearman’s rho <math>\rho</math> is a measure of the linear relationship between two variables. Although similar to Pearson’s correlation, it differs in that the value is calculated after the numeric values are replaced with their rank.
 
 
 
 
 
Converting the values to ranks results in the smallest value on <math>x</math> having rank of 1, the second smallest having a rank of 2, and so on.The same ranking is carried out on the <math>y</math> values. A standard Pearson’scorrelation is then carried out on the ranked data.
 
<div class="mw-collapsible-content">
 
<br />
 
...
 
</div>
 
</div>
 
|
 
{|
 
!Given the data in the table below:         
 
 
 
{| class="wikitable"
 
!<math>x</math>
 
!<math>y</math>
 
|-
 
|7
 
|2
 
|-
 
|3
 
|5
 
|-
 
|9
 
|11
 
|-
 
|11
 
|10
 
|}
 
!After ranking the data would be:
 
{| class="wikitable"
 
!<math>x</math>
 
!<math>y</math>
 
|-
 
|2
 
|1
 
|-
 
|1
 
|2
 
|-
 
|3
 
|4
 
|-
 
|4
 
|3
 
|}
 
|}
 
When the correlation betwen ranking is ##
 
|}
 
 
 
 
 
<br />
 
 
 
===Evaluation of classification models===
 
<br />
 
{| class="wikitable"
 
|+
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Confusion Matrix or Coincidence Matrix</h5>
 
|[[File:ConfusionMatrix.png|center|400px]]
 
 
 
 
 
It is very important to notice that the Confusion Matrix can be also the transpose of the above:
 
 
 
{|
 
|-style="border:none; margin: 0 auto;"|
 
|style="border:none; margin: 0 auto;"|
 
{| class="wikitable" style="border:none; margin: 0 auto;"
 
!style="background:white; border:none;" colspan="2" rowspan="2"|
 
!colspan="2" style="background:none;"| Predicted class
 
|-
 
!P
 
!N
 
|-
 
!rowspan="2" style="height:6em;"|<div style="{{transform-rotate|-90}}">Actual<br>class</div>
 
!P
 
|'''TP'''
 
|FN
 
|-
 
!N
 
|FP
 
|'''TN'''
 
|-
 
|}
 
|
 
|}
 
where: P = positive; N = Negative; TP = True Positive; FP = False Positive; TN = True Negative; FN = False Negative. https://en.wikipedia.org/wiki/Confusion_matrix
 
 
 
<span style="background:#E6E6FA ">For example, the «confusion_matrix» function from «sklearn.metrics» use this last structure</span>. See how to created a proper Confusion Matrix using that command and pandas at https://stackoverflow.com/questions/50325786/sci-kit-learn-how-to-print-labels-for-confusion-matrix
 
 
 
|[[File:ConfusionMatrix-Example.png|center|400px]]
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Accuracy</h5>
 
|This is the number of examples correctly predicted as a fraction of the total number.
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
 
</math>
 
<div class="mw-collapsible-content">
 
<br /><math>
 
Accuracy = \frac{72 + 24}{72 + 24 + 16 + 6} = \frac{96}{120} = 0.8
 
</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Balanced Accuracy</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
If the balance in your response variable is close to perfect, i.e. if the number of examples for each class to be predicted are close to each other, and if your emphasis is on the number of corrected predictions, then accuracy is an appropriate metric. However, if your dataset exhibits '''class imbalance''', accuracy is likely to give misleading results. In such cases, '''Balanced Accuracy''' is likely to give a much better indication of how well classes are being predicted.
 
<div class="mw-collapsible-content">
 
<br />
 
In fact, Accuracy is often not a good measure of the performance of a model. Take the example of predicting a nasty, but treatable, illness. 1 in every 10000 people has some disposition to the illness. If we detect this, it is treatable, if not it's fatal.
 
 
 
If we assume our classifier always predicts 'no' as it is lazy and doesn't take into account the data, it will be correct 99.99% of the time. So it will have 99.99% accuracy.
 
 
 
Such a classifier is clearly not doing what it was designed to do, and because it fails to detect the condition of interest, ti is, therefore, worse than useless.
 
 
 
This is a problem of class imbalance: when one or more classes are (often massively) more prevalent than others.
 
 
 
 
 
For reasons such as this, we need other notions of performance and quality for Data Mining and Machine learning methods.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
Balanced Accuracy = \frac{ \frac{TP}{TP + FN} + \frac{TN}{TN + FP} }{2}
 
</math>
 
<div class="mw-collapsible-content">
 
<br />
 
<math>
 
Balanced Accuracy = \frac{\frac{72}{72 + 8} + \frac{24}{24 + 16} }{2} = \frac{0.9 + 0.6}{2} = 0.75
 
</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Sensitivity and Specificity</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
'''Sensitivity:''' Proportion of positive examples correctly classified.
 
 
 
 
 
'''Specificity:''' Proportion of negative examples correctly classified
 
<div class="mw-collapsible-content">
 
<br />
 
Classification is often a balance between conservative and aggressive making.
 
 
 
For example, we could predict that everybody has the fatal disease or we could predict that nobody has the disease. Sensitivity and Specificity capture this trade-off. These terms come from the medical domain.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>Sensitivity = \frac{TP}{(TP + FN)}</math>
 
 
 
<math>Specificity = \frac{TN}{(TN + FP)}</math>
 
<div class="mw-collapsible-content">
 
<br />
 
<math>Sensitivity = \frac{72}{72 + 8} = 0.9</math>
 
 
 
<math>Sencitivity = \frac{24}{24 + 16} = 0.6</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Precision and Recall</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
These are very closely related to sensitivity and specificity; but whereas the former come from the medical domain, these come from the domain of information retrieval.
 
As for sensitivity and specificity, for more real-world problems, it is difficult to have a model be highly precise and also to exhibit high recall.
 
 
 
 
 
'''Precision:'''
 
 
 
Otherwise termed the positive predicted value, is the proportion of predicted positive examples that are truly positive. High precision means that only very likely positives are predicted as positive.
 
Precise models are trustworthy.
 
 
 
For the fatal disease case, hight precision means identifying those who are sufferers.
 
 
 
 
 
'''Recall:'''
 
 
 
Recall is a measure of how complete the results are.
 
 
 
Basically the same as sensitivity, but with a subtle difference in interpretation.
 
 
 
High recall means capturing a large portion of the positive examples.
 
 
 
For prediction the fatal disease, high recall means the majority of those who have the disease are identified
 
<div class="mw-collapsible-content">
 
<br />
 
...
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>Precision = \frac{TP}{(TP + FP)}</math>
 
 
 
 
 
<math>Recall = \frac{TP}{(TP + FN)}</math>
 
<div class="mw-collapsible-content">
 
<br />
 
<math>Precision = \frac{72}{72 + 16} = 0.8182</math>
 
 
 
 
 
<math>Recall = \frac{72}{72 + 8} = 0.90</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">The F1-Score</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
The F1-Score (also called the F-Score or the F-measure) is a way to combine both precisions and recall into a single measure. It is a value in the range [0,1] with 1 indicating perfect precision and recall.
 
<div class="mw-collapsible-content">
 
<br />
 
This makes it easier to compare models, but it does not address the trade-off between precision and recall as it regards them to be equally important.
 
 
 
The F1-Score uses the harmonic mean instead of the arithmetic mean in order to place a higher emphasis on the positive count.
 
 
 
 
 
We could assign wights precision or recall elements of the F1 Score, but it is difficult to do this without the rights being arbitrary.
 
 
 
Instead of weighting the F1-Score, we can use it in combination with other more globally encapsulating measures of a model's strengths and weaknesses.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
F1Score = 2 \times \frac{Precision \times Recall}{Precision + Recall} = \frac{2 \times TP}{2 \times TP + FP + FN}
 
</math>
 
<div class="mw-collapsible-content">
 
<br />
 
<math>
 
F1Score = \frac{2 \times 72}{2 \times 72 + 16 + 8} = 0.8571
 
</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Matthews Correlation Coefficient</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
The F1-Score is adequate as a metric when precision and recall are considered equally important, or when the relative weighting between the two can be determined non-arbitrarily.
 
 
 
An alternative for cases where that does not apply is the '''Matthews Correlation Coefficient'''. It returns a value in the interval <math>[-1,+1]</math>, where -1 suggests total disagreement between predicted values and actual values, 0 is indicative that any agreement is the product of random chance and +1 suggests perfect prediction.
 
<div class="mw-collapsible-content">
 
<br />
 
So, if the value is -1, every value that is true will be predicted as false and everyone that is false will be predicted as true. If the value is 1, every value that is true will be predicted as such and every value that is false will be predicted as such.
 
 
 
 
 
Unlike any of the metrics we have seen in previous slides, the Matthews Correlation coefficient takes into account all four categories in the confusion matrix.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)}}
 
</math>
 
<div class="mw-collapsible-content">
 
<br />
 
<math>
 
MCC = \frac{1728 - 128}{\sqrt{88 \times 80 \times 40 \times 32}} = 0.5330
 
</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Cohen's Kappa</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
Cohen's Kappa is a measure of the amount of agreement between two raters classifying N items into C mutually-exclusive categories.
 
 
 
 
 
It is defined by the equation given below, where <math>\rho_0</math> is the observed agreement between raters and <math>\rho_e</math> is the hypothetical agreement that would be expected to occur by random chance.
 
 
 
 
 
Landis and Koch (1977) suggest an interpretation of the magnitude of the results as follows:
 
 
 
<math>
 
\begin{array}{lcl}
 
0            & = & \text{agreement equivalent to chance}  \\
 
0.10 - 0.20  & = & \text{slight agreement}      \\
 
0.21 - 0.40  & = & \text{fair agreement}        \\
 
0.41 - 0.60  & = & \text{moderate agreement}    \\
 
0.61 - 0.80  & = & \text{substantial agreement}  \\
 
0.81 - 0.99  & = & \text{near perfect agreement} \\
 
1            & = & \text{perfect agreement}      \\
 
\end{array}
 
</math>
 
<div class="mw-collapsible-content">
 
<br />
 
...
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>K = 1  - \frac{1 - \rho_0}{1 - \rho_e}</math>
 
<div class="mw-collapsible-content">
 
<br />
 
*'''Calculate <math>\rho_0</math>'''<blockquote>The agreement on the positive class is 72 instances and on the negative class is 24 instances. So the agreement is 96 instances out of a total of 120<math>\rho_0 = \frac{96}{120} = 0.7</math>  Note this is the same as the accuracy
 
 
 
*'''Calculate the probability of random agreement on the «positive» class:'''
 
 
 
<blockquote>The probability that both actual and predicted would agree on the positive class at random is the proportion of the total the positive class makes up for each of actual and predicted.
 
 
 
For the actual class, this is:
 
 
 
<math>\frac{(72 + 8)}{120} = 0.6666</math>
 
 
 
For the predicted class this is:
 
 
 
<math>\frac{(72 + 16)}{120} = 0.7333</math>
 
 
 
The total probability that both actual and predicted will randomly agree on the positive class is <math>0.6666 \times 0.7333 = 0.4888</math></blockquote>
 
 
 
*'''Calculate the probability of random agreement on the «negative» class:'''
 
 
 
<blockquote>The probability that both actual and predicted would agree on the negative class at random is the proportion of the total the negative class makes up for each of actual and predicted.
 
 
 
For the actual class, this is
 
 
 
<math>\frac{(16 + 24)}{120} = 0.3333</math>
 
 
 
For the predicted class this is
 
 
 
<math>\frac{(8 + 24)}{120} = 0.2666</math>
 
 
 
The total probability that both actual and predicted will randomly agree on the negative class is <math>0.3333 \times 0.2666 = 0.0888</math></blockquote>
 
 
 
*'''Calculate <math>\rho_e</math>'''
 
 
 
<blockquote>The probability <math>\rho_e</math> is simply the sum of the results of the calculations previously calculated:
 
 
 
<math>\rho_e = 0.4888 + 0.0888 = 0.5776</math></blockquote>
 
 
 
*'''Calculate kappa:'''
 
 
 
<blockquote>
 
<math>
 
K = 1 - \frac{1 - \rho_0}{1- \rho_e}</math>  <math>= 1 - \frac{1 - 0.7}{1 - 0.5776} = 0.2898
 
</math>
 
 
 
This indicates a 'fair agreement' according to the scale suggested by Lanis and Koch (1977)
 
</blockquote>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">The receiver Operator Characteristic Curve</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
The Receiver Operating Characteristic Curve has its origins in radio transmission, but in this context is a method to visually evaluate the performance of a classifier. It is a 2D trap with the true positive rate on the x-axis and the false positive rate on the y-axis.
 
 
 
 
 
'''There are 4 keys points on a ROC curve:'''
 
 
 
* (0,0): classifier doesn't do anything
 
* (1,1): classifier always predict true
 
* (0,1): perfect classifier that never issues a false positive
 
* Line y = x: random classification (coin toss); the standards base line
 
 
 
 
 
'''Any classifier is:'''
 
 
 
* better the closer it is to the point (0,1)
 
* conservative if it is on the left-hand side of the graph
 
* liberal if they are in on the upper right of the graph
 
<div class="mw-collapsible-content">
 
<br />
 
'''To create a ROC curve we do the following:'''
 
 
 
* Rank the prediction of ta classifier by confidence in (or probability of) correct classification
 
* Order them (highest first)
 
* Plot each prediction's impact on the true positive rate and false-positive rate.
 
 
 
 
 
Classifiers are considered conservative if they make positive classifications in the presence of strong evidence, so they make fewer false-positive errors, typically at the cost of low true positive rates.
 
 
 
Classifiers are considered liberal if they make positive classifications with weak evidence so they classify nearly all positives correctly, typically at the cost of high false-positive rates.
 
 
 
 
 
May real-world data sets are dominated by negative instances. The left-hand side of the ROC curve is, therefore, more interesting.
 
</div>
 
</div>
 
|
 
[[File:ROC_curve.png|center|350px]]
 
|-
 
!style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">The Area Under the ROC Curve - AUC</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
Although the ROC curve can provide a Quik visual indication of the performance of a classifier, they can be difficult to interpret.
 
 
 
 
 
It is possible to reduce the curve to a meaningful number (a scalar) by computing the area under the curve.
 
 
 
 
 
AUC falls in the range [0,1], with 1 indicating a perfect classifier, 0.5 a classifier no better than a random choice and 0 a classifier that predicts everything incorrectly.
 
 
 
 
 
A convention for interpreting AUC is:
 
 
 
* 0.9 - 1.0 = A (outstanding)
 
* 0.8 - 0.9 = B (excellent / good)
 
* 0.7 - 0.8 = C (acceptable / fair)
 
* 0.6 - 0.7 = D (poor)
 
* 0.5 - 0.6 = F (no discrimination)
 
 
 
 
 
Note that ROC curves with similar AUCs may be shaped very differently, so the AUC can be misleading and shouldn't be computed without some qualitative examination of the ROC curve itself.
 
<div class="mw-collapsible-content">
 
<br />
 
...
 
</div>
 
</div>
 
|
 
|}
 
 
 
 
 
<br />
 
 
 
===References===
 
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-174. DOI: 10.2307/2529310.
 
 
 
 
 
<br />
 
==[[Python for Data Science]]==
 
 
 
 
 
<br />
 
===[[NumPy and Pandas]]===
 
 
 
 
 
<br />
 
===[[Data Visualization with Python]]===
 
 
 
 
 
<br />
 
 
 
===[[Text Analytics in Python]]===
 
 
 
 
 
<br />
 
 
 
===[[Dash - Plotly]]===
 
 
 
 
 
<br />
 
===[[Scrapy]]===
 
 
 
 
 
<br />
 
 
 
==[[R]]==
 
 
 
 
 
<br />
 
===[[R tutorial]]===
 
 
 
 
 
<br />
 
 
 
==[[RapidMiner]]==
 
<br />
 
 
 
<br />
 
 
 
==Assessments==
 
<br >
 
* [[Media:Exploration_of_the_Darts_dataset_using_statistics.pdf]]
 
* [[Media:Exploration_of_the_Darts_datase_using_statistics.zip]]
 
 
 
 
 
<br />
 
===Diploma in Predictive Data Analytics assessment===
 
* Assessment brief: /home/adelo/1-system/1-disco_local/1-mis_archivos/.stockage/desktop-dis/it_cct/Diploma_in_Predictive_Data_Analytics/0-PredictiveAnalyticsProject.pdf
 
 
 
 
 
* Possible sources of data for the project
 
: https://moodle.cct.ie/mod/page/view.php?id=61395
 
 
 
 
 
* User Review Datasets
 
: https://kavita-ganesan.com/user-review-datasets/#.Xw-CWXVKhaQ
 
:: http://www.cs.cornell.edu/people/pabo/movie-review-data/
 
 
 
 
 
<br />
 
 
 
==Notas==
 
* There is an error in the slide 41. MAE = aquí (ver the recording)
 
 
 
 
 
<br />
 
==References==
 
<ref name=":1">
 
{{Cite web
 
|title=A Gentle Introduction to Bayes Theorem for Machine Learning
 
|website=machinelearningmastery
 
|url=https://machinelearningmastery.com/bayes-theorem-for-machine-learning/
 
|url-status=live
 
|last=Brownlee
 
|first=Jason
 
|date=Oct 2019
 
|access-date=
 
}}
 
</ref>
 
<!--
 
 
 
 
 
-->
 
<ref name=":2">
 
{{Cite web
 
|title=Conditional Probability
 
|website=
 
|url=https://web.stanford.edu/class/cs109/reader/3%20Conditional.pdf
 
|url-status=live
 
|last=Chris Piech and Mehran Sahami
 
|first=
 
|date=Oct 2017
 
|access-date=
 
}}
 
</ref>
 
<!--
 
 
 
 
 
-->
 

Revision as of 15:49, 22 February 2023



This is a protected page.

Projects portfolio



Data Analytics courses

Data Science courses


  • Posts


  • Python for Data Science and Machine Learning Bootcamp - Nivel básico
https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/
  • Machine Learning, Data Science and Deep Learning with Python - Nivel básico - Parecido al anterior
https://www.udemy.com/course/data-science-and-machine-learning-with-python-hands-on/
  • Data Science: Supervised Machine Learning in Python - Nivel más alto
https://www.udemy.com/course/data-science-supervised-machine-learning-in-python/
  • Mathematical Foundation For Machine Learning and AI
https://www.udemy.com/course/mathematical-foundation-for-machine-learning-and-ai/
  • The Data Science Course 2019: Complete Data Science Bootcamp
https://www.udemy.com/course/the-data-science-course-complete-data-science-bootcamp/


  • Coursera - By Stanford University



  • Columbia University - COURSE FEES USD 1,400



Possible sources of data


Irish Government Data Portal https://data.gov.ie/
UK Government Data Portal https://data.gov.uk/
UK National Health Service Data https://digital.nhs.uk/data-and-information
EU Open Data Portal http://data.europa.eu/euodp/en/data/
US Government Data Portal https://www.data.gov/
Canadian Government Data Portal https://open.canada.ca/en/open-data
Indian Government Open Data https://data.gov.in/
World Bank https://data.worldbank.org/
International Monetary Fund https://www.imf.org/en/Data
World Health Organisation http://www.who.int/gho/en/
UNICEF https://data.unicef.org/
Federal Drug Administration https://www.fda.gov/Drugs/InformationOnDrugs/ucm079750.htm
Google Public Data Explorer https://www.google.com/publicdata/directory
Human Rights Data Analysis Group https://hrdag.org/
Armed Conflict Data http://www.pcr.uu.se/research/UCDP/
Amazon Web Services Open Data Registry https://registry.opendata.aws/
Pew Research Datasets http://www.pewinternet.org/datasets/
CERN Open Data http://opendata.cern.ch/
Kaggle https://www.kaggle.com/
UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php
Open Data Network https://www.opendatanetwork.com/
Linked Open Data - University of Münster https://www.uni-muenster.de/LODUM/
US National Climate Data https://www.ncdc.noaa.gov/data-access/quick-links#loc-clim
US Medicare Hospital Quality Data https://data.medicare.gov/data/hospital-compare
Yelp Data https://www.yelp.com/dataset/challenge
US Census Data https://www.census.gov/data.html
Broad Institute Cancer Program Data http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi
National Centers for Environmental Information https://www.ncdc.noaa.gov/data-access
Centers for Disease Control and Prevention https://www.cdc.gov/datastatistics/
Open Data Monitor https://opendatamonitor.eu/
Plenario http://plenar.io/
British Film Institute http://www.bfi.org.uk/education-research/film-industry-statistics-research
Edinburgh University Datasets http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html
DataHub http://datahub.io



What is data

It is difficult to define such a broad concept, but the definition that I like it that data is a collection (or any set) of characters or files, such as numbers, symbols, words, text files, images, files, audio files, etc, that represent measurements, observations, or just descriptions, that are gathered and stored for some purpose. https://www.mathsisfun.com/data/data.html https://www.computerhope.com/jargon/d/data.htm



Qualitative vs quantitative data

https://learn.g2.com/qualitative-vs-quantitative-data



Qualitative data Quantitative data
Qualitative data is descriptive and conceptual information (it describes something) Quantitative data is numerical information (numbers)
It is subjective, interpretive, and exploratory It is objective, to-the-point, and conclusive
It is non-statistical It is statistical
It is typically unstructured or semi-structured. It is typically structured
Examples:

See unstructured data examples below.

Examples:

See structured data examples below.



Discrete and continuous data

https://www.youtube.com/watch?v=cz4nPSA9rlc


Quantitative data can be discrete or continuous.

  • Continuous data can take on any value in an interval.
  • We usually say that continuous data is measured.
  • Examples:
  • Measurements of temperature: ºF.
Temperature can be any value within an interval and it is measured (not counted)


  • Discrete data can only have specific values.
  • We usually say that discrete data is counted.
  • Discrete data is usually (but not always) whole numbers:
  • Examples:
  • Possible values on a Dice Roller:
  • Shoe sizes: . They are not whole numbers but can not be any number.




Structured vs Unstructured data

https://learn.g2.com/structured-vs-unstructured-data

http://troindia.in/journal/ijcesr/vol3iss3/36-40.pdf


Structured data Unstructured data Semi-structured data
Structured data is organized within fixed fields or columns, usually in relational databases (or spreadsheets) so it can be easily queried with SQL

https://learn.g2.com/structured-vs-unstructured-data

https://www.talend.com/resources/structured-vs-unstructured-data

It's data that doesn't fit easily into a spreadsheet or a relational database. The line between Semi-structured data and Unstructured data has always been unclear. Semi-structured data is usually referred to as information that is not structured in a traditional database but contains some organizational properties that make its processing easier.
  • Examples of structured data include:
  • Quantative data:
  • Weather forecast data: Measurements of temperature, precipitation (in millimeters (mm)), atmospheric pressure, wind speed, cloud coverage
  • Seismic data: Measurement of ground movement caused by seismic activity.
  • Housing data: Gattered housing data composed, for example, by Price, Area of the house, Number of rooms, House age, Area population, Avg. Income of residents of the city
  • Numeric financial information and Market reports


  • Another good example of structured data is a company's database where the company stores all the data that is usually associated with the ERP (Enterprise resource planning: A suite of integrated applications that an organization can use to collect, store, manage, and interpret data from many business activities), such as:
  • Human resource data: For example, an «Employees» table: id, fname, lname, dob, email, phone_number, address
  • Customer data (Customer relationship management (CRM)): «Client» table
  • Projects data
  • Accounting data
  • Text files: Word docs, PowerPoint presentations, Email, Chat logs, Text messages, Customer reviews, News articles, etc.
  • Email: There’s some internal metadata structure, so it’s sometimes called semi-structured, but the message field is unstructured and difficult to analyze with traditional tools.


  • Media files (Images, Audio, and Video files): Satellite images, surveillance images/videos, Call recordings (Call logs), Music audios/videos, Locations, etc.


  • Some sources of data are:
  • Social Media data: Data from social networking sites like Facebook, Twitter, and LinkedIn
  • Mobile data: Text messages
  • Call centers data

For example, NoSQL documents are considered to be semi-structured data since they contain keywords that can be used to process the documents easier. https://www.youtube.com/watch?v=dK4aGzeBPkk


It is important to highlight that the huge increase in data in the last 10 years has been driven by the increase in unstructured data. Currently, some estimations indicate that there are around 300 exabytes of data, of which around 80% is unstructured data.

The prefix exa indicates multiplication by the sixth power of 1000 ().


Some sources also suggest that the amount of data is doubling every 2 years.






Data Levels and Measurement

Levels of M