Machine Learning Complete Elite Course (projects with python included) | Beginner to Expert | CF

Machine Learning Complete Elite Course (projects with python included) | Beginner to Expert | CF
ArticlesBlog


(Subtitles Auto Generated) Hello and welcome to machine learning
with Python in this course you’ll learn how machine learning is used in many key
fields and industries for example in the healthcare industry data scientists use
machine learning to predict whether a human cell that is believed to be at
risk of developing cancer is either benign or malignant as such machine
learning can play a key role in determining a person’s health and
welfare you’ll also learn about the value of decision trees and how building
a good decision tree from historical data helps doctors to prescribe the
proper medicine for each of their patients you’ll learn how bankers use
machine learning to make decisions on whether to approved loan applications
and you will learn how to use machine learning to do Bank customer
segmentation where it is not usually easy to run for huge volumes of very
data in this course you’ll see how machine learning helps websites such as
YouTube Amazon or Netflix develop recommendations to their customers about
various products or services such as which movies they might be interested in
going to see or which books to buy there is so much that you can do with machine
learning here you’ll learn how to use popular Python libraries to build your
model for example given an automobile data set we can use the scikit-learn
library to estimate the co2 emission of cars using their engine size or
cylinders we could even predict what the co2 emissions will be for a car that
hasn’t even been produced yet and we’ll see how the telecommunications
industries can predict customer churn hello and welcome in this video I will
give you a high-level introduction to machine learning so let’s get started
this is a human cell sample extracted from a patient and this cell has
characteristics for example its clump thickness is 6 its uniformity of cell
size is 1 its marginal adhesion is 1 and so on one of the interesting questions
we can ask at this point is is this a benign or malignant cell in contrast
with a benign tumor a malignant tumor is a tumor that may invade its surrounding
tissue or spread around the body and diagnosing it early might be the key to
a patient’s survival one could easily presume that only a doctor with years of
experience could diagnose that tumor and say if the patient is developing
cancer or not right well imagine that you’ve obtained the dataset containing
characteristics of thousands of human cell samples extracted from patients who
were believed to be at risk of developing cancer analysis of the
original data showed that many of the characteristics differed significantly
between benign and malignant samples you can use the values of these cell
characteristics in samples from other patients to give an early indication of
whether a new sample might be benign or malignant you should clean your data
select a proper algorithm for building a prediction model and train your model to
understand patterns of benign or malignant cells within the data once the
model has been trained by going through data iteratively it can be used to
predict your new or unknown cell with rather high accuracy this is machine
learning it is the way that a machine learning model can do a doctor’s task or
at least help that doctor make the process faster now let me give a formal
definition of machine learning machine learning is the subfield of computer
science that gives computers the ability to learn without being explicitly
programmed let me explain what I mean when I say without being explicitly
programmed assume that you have a data set of images of animals such as cats
and dogs and you want to have software or an application that could recognize
and differentiate them the first thing that you have to do here is interpret
the images as a set of feature sets for example does the image show the animals
eyes if so what is their size does it have ears what about a tail how many
legs does it have wings prior to machine learning each image would be transformed
to a vector of features then traditionally we had to write down some
rules or methods in order to get computers to be intelligent and detect
the animals but it was a failure why well as you can guess it needed a lot of
rules highly dependent on the current data set and not generalized enough to
detect out-of-sample cases this is when machine learning entered the scene using
machine learning allows us to build a model that looks at all the feature sets
and their corresponding type of animals and it learns the pattern of each animal
it is a model built by machine learning algorithms into text without explicitly
being programmed to do so in essence machine learning follows the same
process that a four-year-old child uses to learn understand and differentiate
animals so machine learning algorithms inspired by the human learning process
iteratively learned from data and allow computers to find hidden insights these
models help us in a variety of tasks such as object recognition summarization
recommendation and so on machine learning impact society in a very
influential way here are some real life examples first how do you think Netflix
and Amazon recommend videos movies and TV shows to its users they use machine
learning to produce suggestions that you might enjoy this is similar to how your
friends might recommend a television show to you based on their knowledge of
the types of shows you like to watch how do you think banks make a decision when
approving a loan application they use machine learning to predict the
probability of default for each applicant and then approve or refuse the
loan application based on that probability telecommunication companies
use their customers demographic data to segment them or predict if they will
unsubscribe from their company the next month there are many other applications
of machine learning that we see every day in our daily life such as chat BOTS
logging into our phones or even computer games using face recognition
each of these use different machine learning techniques and algorithms so
let’s quickly examine a few of the more popular techniques the regression
estimation technique is used for predicting a continuous value for
example predicting things like the price of a house based on his characteristics
or to estimate the co2 emission from a car’s engine a classification technique
is used for predicting the class or category of a case for example if a cell
is benign or malignant or whether or not a customer will churn clustering groups
of similar cases for example can find similar patients or it can be used for
customer segmentation in the banking field Association technique is used for
finding items or events that often co-occur
for example grocery items that are usually bought together by a particular
customer anomaly detection is used to discover abnormal and unusual cases for
example it is used for credit card fraud detection
sequence mining is used for predicting the next event for instance that click
stream and website’s dimension reduction is used to reduce the size of data and
finally recommendation systems this associates people’s preferences with
others who have similar tastes and recommends new items to them such as
books or movies we will cover some of these techniques in the next videos by
this point I’m quite sure this question has crossed your mind what is the
difference between these buzzwords that we keep hearing these days such as
artificial intelligence or AI machine learning and deep learning well let me
explain what is different between them in brief AI tries to make computers
intelligent in order to mimic the cognitive functions of humans so
artificial intelligence is a general field with a broad scope including
computer vision language processing creativity and summarization machine
learning is the branch of AI that covers the statistical part of artificial
intelligence it teaches the computer to solve problems by looking at hundreds or
thousands of examples learning from them and then using that experience to solve
the same problem in new situations and deep learning is a very special field of
machine learning where computers can actually learn and make intelligent
decisions on their own deep learning involves a deeper level of automation in
comparison with most machine learning algorithms now that we’ve completed the
introduction to machine learning subsequent videos will focus on
reviewing two main components first you’ll be learning about the purpose of
machine learning and where can be applied in the real world and second
you’ll get a general overview of machine learning topics such as supervised
versus unsupervised learning model evaluation and various machine learning
algorithms so now that you have a sense of what’s in store on this journey let’s
continue our exploration of the Shave learning hello and welcome in this video
we’ll talk about how to use Python for machine learning so let’s get started
Python is a popular and powerful general-purpose programming language
that recently emerged as the preferred language among data scientists you can
write your machine learning algorithms using Python and it works very well
however there are a lot of modules and libraries already implemented in Python
that could make your life much easier we try to introduce the Python packages
in this course and use it in the labs to give you better hands-on experience the
first package is numpy which is a math library to work with n dimensional
arrays in Python it enables you to do computation efficiently and effectively
it is better than regular Python because of its amazing capabilities for example
for working with arrays dictionaries functions data types and working with
images you need to know numpy SCI PI is a collection of numerical algorithms and
domain-specific tool boxes including signal processing optimization
statistics and much more SCI pi is a good library for scientific
and high-performance computation matplotlib is a very popular plotting
package that provides 2d plotting as well as 3d plotting basic knowledge
about these three packages which are built on top of Python is a good asset
for data scientists who want to work with real-world problems if you’re not
familiar with these packages I recommend that you take the data analysis with
Python course first this course covers most of the useful topics in these
packages pandas library is a very high-level Python library that provides
high performance easy to use data structures it has many functions for
data importing manipulation and analysis in particular it offers data structures
and operations for manipulating numerical tables and time series
scikit-learn is a collection of algorithms and tools for machine
learning which is our focus here and which you’ll learn to use within this
course as will be using scikit-learn quite a bit in the labs
let me explain more about it and show you why it is so popular among data
scientists scikit-learn is a free machine learning library for the Python
programming language it has most of the classification regression and clustering
algorithms and is designed to work with a Python numerical and scientific
libraries numpy and Sai pipe also it includes very good documentation on top
of that implementing machine learning models with scikit-learn
is really easy with a lines of Python code most of the tasks
that need to be done in a machine learning pipeline are implemented
already inside kit learn including pre-processing of data
feature selection feature extraction train test splitting defining the
algorithms fitting models tuning parameters prediction evaluation and
exporting model let me show you an example of how scikit-learn
looks like when you use this library you don’t have to understand the code for
now but just see how easily you can build a model with just a few lines of
code basically machine learning algorithms benefit from standardization
of the data set if there are some outliers or different scales fields in
your data set you have to fix them the pre-processing package of scikit-learn
provides several common utility functions and transformer classes to
change raw feature vectors into a suitable form of vector for modeling you
have to split your data set into train and test sets to train your model and
then test the models accuracy separately scikit-learn can split arrays or
matrices into random train and test subsets for you in one line of code then
you can set up your alga rhythm for example you can build a classifier using
a support vector classification algorithm we call our estimator instance
CLF and initialize as parameters now you can train your model with the train set
by passing our training set to the fit method the CLF model learns to classify
unknown cases then we can use our test set to run predictions and the result
tells us what the class of each unknown value is also you can use the different
metrics to evaluate your model accuracy for example using a confusion matrix to
show the results and finally you save your model you may find all or some of
these machine learning terms confusing but don’t worry we’ll talk about all of
these topics in the following videos the most important point to remember is that
the entire process of a machine learning task can be done simply in a few lines
of code using scikit-learn please notice that though it is possible
it would not be that easy if you want to do all of this using numpy or side PI
packages and of course it needs much more coding if you use pure Python
programming to implement all of these tasks hello and welcome in this video
we’ll introduce supervised algorithms versus unsupervised algorithms so let’s
get started an easy way to begin grasping the concept of supervised
learning is by looking directly at the words that make it up supervised means
to observe and direct the execution of a task project or activity obviously we
aren’t going to be supervising a person instead we’ll be supervising a machine
learning model that might be able to produce classification regions like we
see here so how do we supervise a machine learning model we do this by
teaching the model that is we load the model with knowledge so that we can have
it predict future instances but this leads to the next question which is how
exactly do we teach a model we teach the model by training it with some data from
a label data set it’s important to note that the data is labeled and what does a
label data set look like well it could look something like this this example is
taken from the cancer data set as you can see we have some historical data for
patients and we already know the class of each row let’s start by introducing
some components of this table the names appear which are called club thickness
uniformity of cell size uniformity of cell shape marginal adhesion and so on
are called attributes the columns are called features which include the data
if you plot this data and look at a single data point unapplied
it’ll have all of these attributes that would make a row on this chart also
referred to as an observation looking directly at the value of the data you
can have two kinds the first is numerical when dealing with machine
learning the most commonly used data is numeric the second is categorical that
is it’s non numeric because it contains characters rather than numbers in this
case it’s categorical because this data set is made for classification there are
two types of supervised learning techniques they are classification and
aggression classification is the process of predicting a discrete class label or
category regression is the process of predicting a continuous value as opposed
to predicting a categorical value in classification look at this data set it
is related to co2 emissions of different cars it includes engine sized cylinders
fuel consumption and co2 emission of various models of automobiles given this
data set you can use regression to predict the co2 emission of a new car by
using other fields such as engine size or number of cylinders since we know the
meaning of supervised learning what do you think unsupervised learning needs
yes unsupervised learning is exactly as it
sounds we do not supervise the model but we let the model work on its own to
discover information that may not be visible to the human hi it means the
unsupervised algorithm trains on the data set and draws conclusions on
unlabeled data generally speaking unsupervised learning has more difficult
algorithms than supervised learning since we know little to no information
about the data or the outcomes that are to be expected dimension reduction
density estimation Market Basket analysis and clustering are the most
widely used unsupervised machine learning techniques dimensionality
reduction and or feature selection play a large role in this by reducing
redundant features that make the classification easier market basket
analysis is a modeling technique based upon the theory that if you buy a
certain group of items you’re more likely to buy another group of items
density estimation is a very simple concept that is mostly used to explore
the data to find some structure within it and finally clustering clustering is
considered to be one of the most popular unsupervised machine learning techniques
used for grouping data points or objects that are somehow similar cluster
analysis has many applications in different domains whether it be a bank’s
desire to segment its customers based on certain characteristics or helping an
individual to organize and group his or her favorite types of music generally
speaking though clustering is used mostly for discovering structure
summarization and anomaly detection so to recap the biggest difference between
soup and unsupervised learning is that
supervised learning deals with label data while unsupervised learning deals
with unlabeled data in supervised learning we have machine learning
algorithms for classification and regression in unsupervised learning we
have methods such as clustering in comparison to supervised learning
unsupervised learning has fewer models and fewer evaluation methods that can be
used to ensure that the outcome of the model is accurate as such unsupervised
learning creates a less controllable environment as the machine is creating
outcomes for us hello and welcome in this video we’ll be giving a brief
introduction to regression so let’s get started look at this data set it’s
related to co2 emissions from different cars it includes engine size number of
cylinders fuel consumption and co2 emission from various automobile models
the question is given this data set can we predict the co2 emission of a car
using other fields such as engine size or cylinders let’s assume we have some
historical data from different cars and assume that a car such as in row 9 has
not been manufactured yet but we’re interested in estimating its approximate
co2 emission after production is it possible we can use regression methods
to predict a continuous value such as co2 emission using some other variables
indeed regression is the process of predicting a continuous value in
regression there are two types of variables a dependent variable and one
or more independent variables the dependent variable can be seen as the
state target or final goal we study and try to predict and the independent
variables also known as explanatory variables can be seen as the causes of
those states that independent variables are shown conventionally by X and the
dependent variable is notated by Y our regression model relates Y or the
dependent variable to a function of X ie the independent variable
the key point in the regression is that our dependent value should be continuous
and cannot be a discrete value however the independent variable or variables
can be measured on either a categorical or continuous measurement scale so what
we want to do here is to use the historical data of some cards using one
or more of their features and from that data make a model we use regression to
build such a regression estimation model then the model is used to predict the
expected co2 emission for a new or unknown car basically there are two
types of regression models simple regression and multiple regression
simple regression is when one independent variable is used to estimate
a dependent variable it can be either linear or non-linear
for example predicting co2 emission using the variable of engine size
linearity regression is based on the nature of relationship between
independent and dependent variables when more than one independent variable is
present the process is called multiple linear regression for example predicting
co2 emission using engine size and the number of cylinders in any given car
again depending on the relation between dependent and independent variables it
can be either linear or non-linear regression let’s examine some sample
applications of regression essentially we use regression when we want to
estimate a continuous value for instance one of the applications of regression
analysis could be in the area of sales forecasting you can try to predict a
sales persons total yearly sales from independent variables such as age
education and years of experience it can also be used in the field of psychology
for example to determine individual satisfaction based on demographic and
psychological factors we can use regression analysis to predict the price
of a house in an area based on its size number of bedrooms and so on we can even
use it to predict employment income for independent variables such as hours of
work education occupation sex age years of experience and so on indeed you can
find many examples of the usefulness of regression analysis
in these and many other fields or domains such as finance healthcare
retail and more we have many regression algorithms each
of them has its own importance and a specific condition to which their
application is best suited and while we’ve covered just a few of them in this
course it gives you enough base knowledge for you to explore different
regression techniques hello and welcome in this video we’ll be covering linear
regression you don’t need to know any linear algebra to understand topics in
linear regression this high-level introduction will give you enough
background information on linear regression to be able to use it
effectively on your own problems so let’s get started let’s take a look at
this data set it’s related to the co2 emission of different cars it includes
engine sized cylinders fuel consumption and co2 emissions for various car models
the question is given this data set can we predict the co2 emission of a car
using another field such as engine size quite simply yes we can use linear
regression to predict a continuous value such as co2 emission by using other
variables linear regression is the approximation of a linear model used to
describe the relationship between two or more variables in simple linear
regression there are two variables a dependent variable and an independent
variable the key point in the linear regression is that our dependent value
should be continuous and cannot be a discrete value however the independent
variables can be measured on either a categorical or continuous measurement
scale there are two types of linear regression models they are simple
regression and multiple regression simple linear regression is when one
independent variable is used to estimate a dependent variable for example
predicting co2 emission using the engine size variable when more than one
independent variable is present the process is called multiple linear
regression for example predicting co2 emission using engine size and cylinders
of cars our focus in this video is on simple linear regression now let’s see
how linear regression works okay so let’s look at our data set again to
understand linear regression we can plot our variables here we show engine size
as an independent variable and a mission as the target value that
we would like to predict a scatterplot clearly shows the relation between
variables where changes in one variable explained or possibly caused changes in
the other variable also it indicates that these variables are linearly
related with linear regression you can fit a line through the data for instance
as the engine size increases so do the emissions with linear regression you can
model the relationship of these variables a good model can be used to
predict what the approximate emission of each car is how do we use this line for
prediction now let us assume for a moment that the line is a good fit of
the data we can use it to predict the emission of an unknown car for example
for a sample car with engine size 2.4 you can find the emission is 214 now
let’s talk about what the fitting line actually is we’re going to predict the
target value Y in our case using the independent variable engine size
represented by X 1 the fit line is shown traditionally as a polynomial in a
simple regression problem a single X the form of the model would be theta 0 plus
theta 1 X 1 in this equation Y hat is the dependent variable or the predicted
value and X 1 is the independent variable theta 0 and theta 1 are the
parameters on the line that we must adjust theta 1 is known as the slope or
gradient of the fitting line and theta 0 is known as the intercept theta 0 and
theta 1 are also called the coefficients of the linear equation you can interpret
this equation as y hat being a function of X 1 or Y hat being dependent of X 1
now the questions are how would you draw a line through the points and how do you
determine which line fits best linear regression estimates the coefficients of
the line this means we must calculate theta 0 and theta 1 to find the best
line to fit the data this line would best estimate the admission of the
unknown data points let’s see how we can find this line or to be more precise how
we can adjust the parameters to make the line the best fit for the data for a
moment let’s assume we’ve already found the best fit line for
now let’s go through all the points and check how well they line with this line
best fit here means that if we have for instance a car with engine size x1
equals 5.4 and actual co2 equals 250 its co2 should be predicted very close to
the actual value which is y equals 250 based on historical data but if we use
the fit line or better to say using our polynomial with known parameters to
predict the co2 emission it will return y hat equals 340 now if you compare the
actual value of the admission of the car with what we’ve predicted using our
model you will find out that we have a ninety unit error this means our
prediction line is not accurate this error is also called the residual error
so we can say the error is the distance from the data point to the fitted
regression line the mean of all residual errors shows how poorly the line fits
with the whole data set mathematically it can be shown by the equation mean
squared error shown as MSE our objective is to find a line where the mean of all
these errors is minimized in other words the mean error of the prediction using
the fit line should be minimized let’s reword it more technically the
objective of linear regression is to minimize this MSE equation and to
minimize it we should find the best parameters theta0 and theta1 now the
question is how to find theta 0 and theta 1 in such a way that it minimizes
this error how can we find such a perfect line or set another way how
should we find the best parameters for our line should we move the line a lot
randomly and calculate the MSE value every time and choose the minimum one
not really actually we have two options here option one we can use a mathematic
approach or option two we can use an optimization approach let’s see how we
can easily use a mathematic formula to find the theta zero and theta one as
mentioned before theta 0 and theta 1 in the simple linear regression are the
coefficients of the fit line we can use a simple equation to estimate these
coefficients that is given that it’s a simple linear regression with only two
parameters and no that theta 0 and theta 1 are the
intercept and slope of the line we can estimate them directly from our data it
requires that we calculate the mean of the independent and dependent or target
columns from the data set notice that all of the data must be available to
traverse and calculate the parameters it can be shown that the intercept and
slope can be calculated using these equations we can start off by estimating
the value for theta 1 this is how you can find the slope of a line based on
the data X bar is the average value for the engine size in our data set please
consider that we have nine rows here row 0 to 8 first we calculate the average of
x1 and average of Y then we plug it into the slope equation to find theta 1 the X
I and y i in the equation refer to the fact that we need to repeat these
calculations across all values in our data set and I refers to the eyuth value
of X or Y applying all values we find theta 1 equals 39 it is our second
parameter it is used to calculate the first parameter which is the intercept
of the line now we can plug theta 1 into the line equation to find theta 0 it is
easily calculated that theta 0 equals 125 point seven four so these are the
two parameters for the line where theta 0 is also called the bias coefficient
and theta 1 is the coefficient for the co2-emission column as a side note you
really don’t need to remember the formula for calculating these parameters
as most of the libraries used for machine learning in python r and scala
can easily find these parameters for you but it’s always good to understand how
it works now we can write down the polynomial of the line so we know how to
find the best fit for our data and it’s equation now the question is how can we
use it to predict the emission of a new car based on its engine size after we
found the parameters of the linear equation making predictions is a simple
as solving the equation for a specific set of inputs imagine we are predicting
co2-emission or Y from engine size or X for the automobile in record number 9
our linear regression model representation for this problem would be
y hat equals theta 0 plus theta 1 x1 or if we map it to our data set it would be
co2-emission equals theta 0 plus theta 1 engine size as we saw we can find theta
0 theta 1 using the equations that we just talked about once found we can plug
in the equation of the linear model for example let’s use theta 0 equals 125 and
theta 1 equals 39 so we can rewrite the linear model as co2 emission equals 125
plus 39 engine size now let’s plug in the ninth row of our data set and
calculate the co2 emission for a car with an engine size of 2.4 so co2
emission equals 125 plus 39 times 2.4 therefore we can predict that the co2
emission for this specific car would be 218 point six let’s talk a bit about why
linear regression is so useful quite simply it is the most basic regression
to use and understand in fact one reason why the linear regression is so useful
is that it’s fast it also doesn’t require tuning of
parameters so something like tuning the K parameter and K nearest neighbors or
the learning rate in neural networks isn’t something to worry about linear
regression is also easy to understand and highly interpretive all hello and
welcome in this video we’ll be covering model evaluation so let’s get started
the goal of regression is to build a model to accurately predict an unknown
case to this end we have to perform regression evaluation after building the
model in this video we’ll introduce and discuss two types of evaluation
approaches that can be used to achieve this goal these approaches are train and
test on the same data set train test split we’ll talk about what
each of these are as well as the pros and cons of using each of these models
also will introduce some metrics for accuracy of regression models let’s look
at the first approach when considering evaluation models we clearly want to
choose the one that will give us the most accurate results so the question is
how can we calculate the accuracy of our model in other words how much can we
trust this model for prediction of an unknown sample using a given data set
and having built a model such as linear regression one of the solutions is to
select a portion of our data set for testing for instance assume that we have
10 records in our data set we use the entire data set for training and we
build a model using this training set now we select a small portion of the
data set such as row number six to nine but without the labels this set is
called a test set which has the labels but the labels are not used for
prediction and is used only as ground truth the labels are called actual
values of the test set now we pass the feature set of the testing portion to
our built model and predict the target values finally we compare the predicted
values by our model with the actual values in the test set this indicates
how accurate our model actually is there are different metrics to report the
accuracy of the model but most of them work generally based on the similarity
of the predicted and actual values let’s look at one of the simplest metrics to
calculate the accuracy of our regression model as mentioned we just compare the
actual values Y with the predicted values which is noted as y hat for the
testing set the error of the model is calculated as the average difference
between the predicted and actual values for all the rows we can write this error
as an equation so the first evaluation approach we just talked about is the
simplest one train and test on the same data set essentially the name of this
approach says it all you train the model on the entire data set then you test it
using a portion of the same data set in a general sense when you test with a
data set in which you know the target value for each data point you’re able to
obtain a percentage of accurate predictions for the model this
evaluation approach most likely have a high training
accuracy and a low out-of-sample accuracy since the model knows all of
the testing data points from the training what is training accuracy and
out-of-sample accuracy we said that training and testing on the same data
set produces a high training accuracy but what exactly is training accuracy
training accuracy is the percentage of correct predictions that the model makes
when using the test data center however a high training accuracy isn’t
necessarily a good thing for instance having a high training accuracy may
result in an overfit of the data this means that the model is overly trained
to the data set which may capture noise and produce a non generalized model
out-of-sample accuracy is the percentage of correct predictions that the model
makes on data that the model has not been trained on doing a training test on
the same data set will most likely have low out-of-sample accuracy due to the
likelihood of being overfit it’s important that our models have high
out-of-sample accuracy because the purpose of our model is of course to
make correct predictions on unknown data so how can we improve out-of-sample
accuracy one way is to use another evaluation approach called train test
split in this approach we select a portion of our data set for training for
example row 0 to 5 and the rest is used for testing for example row 6 to 9 the
model is built on the training set then the tests feature set is passed to the
model for prediction and finally the predicted values for the test set are
compared with the actual values of the testing set the second evaluation
approach is called train test split train test split involves splitting the
data set into training and testing sets respectively which are mutually
exclusive after which you train with the training
set and test with the testing set this will provide a more accurate evaluation
on out-of-sample accuracy because the testing data set is not part of the data
set that has been used to train the data it is more realistic for real-world
problems this means that we know the outcome of each data point in the data
set making it great to test with and since this data has not been used to
train the model has no knowledge of the outcome of these
data points so in essence it’s truly out-of-sample testing however please
ensure that you train your model with the testing set afterwards as you don’t
want to lose potentially valuable data the issue with train tests split is that
it’s highly dependent on the data sets on which the data was trained and tested
the variation of this causes train tests split to have a better out-of-sample
prediction than training and testing on the same data set but it still has some
problems due to this dependency another evaluation model called k-fold
cross-validation resolves most of these issues how do you fix a high variation
that results from a dependency well new averaging let me explain the basic
concept of k-fold cross-validation to see how we can solve this problem the
entire dataset is represented by the points in the image at the top left if
we have K equals four folds then we split up this dataset as shown here in
the first fold for example we use the first 25% of the data set for testing
and the rest for training the model is built using the training set and is
evaluated using the test set then in the next round or in the second fold the
second 25% of the data set is used for testing and the rest for training the
model again the accuracy of the model is calculated we continue for all folds
finally the result of all four evaluations are average that is the
accuracy of each fold is then average keeping in mind that each fold is
distinct where no training data in one fold is used in another k-fold
cross-validation in its simplest form performs multiple trained test splits
using the same data set where each split is different then the result is averaged
to produce a more consistent out-of-sample accuracy we wanted to show
you an evaluation model that address some of the issues we’ve described in
the previous approaches however going in depth with k-fold cross-validation model
is out of the scope for this course hello and welcome in this video we’ll be
covering accuracy metrics for model evaluation so let’s get started
evaluation metrics are used to explain the performance of a model let’s talk
more about the model evaluation metrics that are used for regression
as mentioned basically we can compare the actual values and predicted values
to calculate the accuracy of a regression model evaluation metrics
provide a key role in the development of a model as it provides insight to areas
that require improvement we’ll be reviewing a number of model evaluation
metrics including mean absolute error mean squared error and root mean squared
error but before we get into defining these we need to define what an error
actually is in the context of regression the error of the model is the difference
between the data points and the trendline generated by the algorithm
since there are multiple data points and error can be determined in multiple ways
main absolute error is the mean of the absolute value of the errors this is the
easiest of the metrics to understand since it’s just the average error mean
squared error is the mean of the squared error it’s more popular than mean
absolute error because the focus is geared more towards large errors this is
due to the squared term exponentially increasing larger errors in comparison
to smaller ones root mean squared error is the square root of the mean squared
error this is one of the most popular of the evaluation metrics because root mean
squared error is interprete belen the same units as the response vector or Y
units making it easy to relate its information relative absolute error also
known as residual sum of square where Y bar is a mean value of Y takes the total
absolute error and normalizes it by dividing by the total absolute error of
the simple predictor relative squared error is very similar to relative
absolute error but is widely adopted by the data science community as it is used
for calculating R squared R squared is not an error per se but is a popular
metric for the accuracy of your model it represents how close the data values are
to the fitted regression line the higher the r-squared the better the
model fits your data each of these metrics can be used for quantifying of
your prediction the choice of metric completely depends on the type of model
your data type and domain of knowledge unfortunately further review is out of
scope of this course hello and welcome in this video we’ll be
covering multiple linear regression as you know there are two types of linear
regression models simple regression and multiple regression simple linear
regression is when one independent variable is used to estimate a dependent
variable for example predicting co2-emission using the variable of
engine size in reality there are multiple variables that predict the
co2-emission when multiple independent variables are present the process is
called multiple linear regression for example predicting co2-emission using
engine size and the number of cylinders in the car’s engine our focus in this
video is on multiple linear regression the good thing is that multiple linear
regression is the extension of the simple linear regression model so I
suggest you go through the simple linear regression video first if you haven’t
watched it already before we dive into a sample data set and see how multiple
linear regression works I want to tell you what kind of problems that can solve
when we should use it and specifically what kind of questions we can answer
using it basically there are two applications for multiple linear
regression first it can be used when we would like to identify the strength of
the effect that the independent variables have on a dependent variable
for example does revision time test anxiety lecture attendance and gender
have any effect on exam performance of students second it can be used to
predict the impact of changes that is to understand how the dependent variable
changes when we change the independent variables for example if we were
reviewing a person’s health data a multiple linear regression can tell you
how much that person’s blood pressure goes up or down for every unit increase
or decrease in a patient’s body mass index holding other factors constant as
is the case with simple linear regression multiple linear regression is
a method of predicting a continuous variable it uses multiple variables
called independent variables or predictors that best predict the value
of the target variable which is also called the dependent variable in
multiple linear regression the target value Y is a linear combination of
independent variables for example you can predict how much co2
a car might admit due to independent variables such as the car’s engine size
number of cylinders and fuel consumption multiple linear regression is very
useful because you can examine which variables are significant predictors of
the outcome variable also you can find out how each feature impacts the outcome
variable and again as is the case in simple linear regression if you manage
to build such a regression model you can use it to protect the emission amount of
an unknown case such as record number nine
generally the model is of the form y hat equals theta 0 plus theta 1 x1 plus
theta 2 x2 and so on up to theta n X n mathematically we can show it as a
vector form as well this means it can be shown as a dot product of two vectors
the parameters vector and the feature set vector generally we can show the
equation for a multi-dimensional space as theta transpose X where theta is an N
by 1 vector of unknown parameters in a multi-dimensional space and X is the
vector of the featured sets as theta is a vector of coefficients and is supposed
to be multiplied by X conventionally it is shown as transpose theta theta is
also called the parameters or weight vector of the regression equation both
these terms can be used interchangeably and X is the feature set which
represents a car for example x1 for engine size or x2 for cylinders and so
on the first element of the feature set would be set to 1 because it turns the
theta 0 into the intercept or bias parameter when the vector is multiplied
by the parameter vector please notice that theta transpose X in a one
dimensional space is the equation of a line it is what we use in simple linear
regression in higher dimensions when we have more than one input or X the line
is called a plane or a hyperplane and this is what we use for multiple linear
regression so the whole idea is to find the best fit hyperplane for our data to
this end and as is the case in linear regression we should estimate the values
for theta vector that best predict the value of the target field in each row
to achieve this goal we have to minimize the error of the prediction now the
question is how to refine the optimized parameters to find the optimized
parameters for our model we should first understand what the optimized parameters
are then we will find a way to optimize the parameters in short optimize
parameters are the ones which lead to a model with the fewest errors let’s
assume for a moment that we have already found the parameter vector of our model
it means we already know the values of theta vector now we can use the model
and the feature set of the first row of our data set to predict the co2 emission
for the first car correct if we plug the feature set values into the model
equation we find y hat let’s say for example it returns 140 as the predicted
value for this specific row what is the actual value y equals 196 how different
is the predicted value from the actual value of 196 well we can calculate it
quite simply as 196 subtract 140 which of course equals 56 this is the error of
our model only for one row or one car in our case as is the case in linear
regression we can say the error here is the distance from the data point to the
fitted regression model the mean of all residual errors shows how bad the model
is representing the data set it is called the mean squared error or MSE
mathematically MSE can be shown by an equation while this is not the only way
to expose the error of a multiple linear regression model it is one of the most
popular ways to do so the best model for our dataset is the one with minimum
error for all prediction values so the objective of multiple linear regression
is to minimize the MSE equation to minimize it we should find the best
parameters theta but how okay how do we find the
parameter or coefficients for multiple linear regression there are many ways to
estimate the value of these coefficients however the most common methods are the
ordinary least squares and optimization approach ordinary least squares tries to
estimate the values of the coefficients by minimizing the mean square error this
approach uses the data as an tricks and uses linear algebra
operations to estimate the optimal values for the theta the problem with
this technique is the time complexity of calculating matrix operations as it can
take a very long time to finish when the number of rows in your data set is less
than 10,000 you can think of this technique as an option however for
greater values you should try other faster approaches the second option is
to use an optimization algorithm to find the best parameters that is you can use
a process of optimizing the values of the coefficients by iteratively
minimizing the error of the model on your training data for example you can
use gradient descent which starts optimization with random values for each
coefficient then calculates the errors and tries to minimize it through wise
changing of the coefficients in multiple iterations gradient descent is a proper
approach if you have a large data set please understand however that there are
other approaches to estimate the parameters of the multiple linear
regression that you can explore on your own after you find the best parameters
for your model you can go to the prediction phase after we found the
parameters of the linear equation making predictions is as simple as solving the
equation for a specific set of inputs imagine we are predicting co2-emission
or y from other variables for the automobile in record number 9 our linear
regression model representation for this problem would be y hat equals theta
transpose x once we find the parameters we can plug them into the equation of
the linear model for example let’s use theta 0 equals 125 theta 1 equals 6
point 2 theta 2 equals 14 and so on if we map it to our data set we can rewrite
the linear model as co2emissions equals 125 plus 6 point 2 multiplied by engine
size plus 14 multiplied by cylinder and so on as you can see multiple linear
regression estimates the relative importance of predictors for example it
shows cylinder has higher impact on co2 emission amounts in comparison with
engine size now let’s plug in the ninth row of our
data set and calculate the co2 emission for a car with the engine size of 2.4 so
co2 emission equals 125 plus 6 point 2 times 2 point 4 plus 14 times 4 and so
on we can predict the co2 emission for this specific car would be 214 point 1
now let me address some concerns that you might already be having regarding
multiple linear regression as you saw you can use multiple independent
variables to predict a target value in multiple linear regression it sometimes
results in a better model compared to using a simple linear regression which
uses only one independent variable to predict the dependent variable now the
question is how many independent variable should we use for the
prediction should we use all the fields in our data set does adding independent
variables to a multiple linear regression model always increase the
accuracy of the model basically adding too many independent variables without
any theoretical justification may result in an overfit model an overfit model is
a real problem because it is too complicated for your data set and not
general enough to be used for a prediction so it is recommended to avoid
using many variables for a prediction there are different ways to avoid
overfitting a model in regression however that is outside the scope of
this video the next question is should independent
variables be continuous basically categorical independent variables can be
incorporated into a regression model by converting them into numerical variables
for example given a binary variable such as car type the code dummy 0 for manual
and 1 for automatic cars as a last point remember that multiple linear regression
is a specific type of linear regression so there needs to be a linear
relationship between the dependent variable and each of your independent
variables there are a number of ways to check for linear relationship for
example you can use scatter plots and then visually check for linearity if the
relationship displayed in your scatter plot is not linear then you need to use
nonlinear regression hello and welcome in this video we’ll be
every nonlinear regression basics so let’s get started these data points
correspond to China’s gross domestic product or GDP from 1960 to 2014 the
first column is the years and the second is China’s corresponding annual gross
domestic income in u.s. dollars for that year this is what the data points look
like now we have a couple of interesting questions first can gdp be predicted
based on time and second can we use a simple linear regression to model it
indeed if the data shows a curvy trend then linear regression will not produce
very accurate results when compared to a nonlinear regression simply because as
the name implies linear regression presumes that the data is linear the
scatterplot shows that there seems to be a strong relationship between GDP and
time but the relationship is not linear as you can see the growth starts off
slowly then from 2005 onward the growth is very significant and finally it
decelerates slightly in the 2010 it kind of looks like either a logistical or
exponential function so it requires a special estimation method of the
nonlinear regression procedure for example if we assume that the model for
these data points are exponential functions such as y hat equals theta 0
plus theta 1 theta 2 transpose X or to the power of X our job is to estimate
the parameters of the model ie Thetas and use the fitted model to predict GDP
for unknown or future cases in fact many different regressions exist that can be
used to fit whatever the data set looks like you can see a quadratic and cubic
regression lines here and it can go on and on to infinite degrees in essence we
can call all of these polynomial regression where the relationship
between the independent variable X and the dependent variable Y is modeled as
an nth degree polynomial in X with many types of regression to choose from
there’s a good chance that one will fit your data set well
remember it’s important to pick a regression that fits the data the best
so what is polynomial regression polynomial regression fits a curved line
to your data a simple example of polynomial with degree three is shown as
y hat equals theta zero plus theta 1 X plus theta two x squared plus theta
three x cubed or to the power of three where Thetas are parameters to be
estimated that makes the model fit perfectly to the underlying data though
the relationship between x and y is nonlinear here and polynomial regression
can fit them a polynomial regression model can still be expressed as linear
regression I know it’s a bit confusing but let’s look at an example given the
third degree polynomial equation by defining x1 equals x and x2 equals x
squared or X to the power of two and so on
the model is converted to a simple linear regression with new variables as
Y hat equals theta 0 plus theta 1 x1 plus theta 2 x2 plus theta 3 x3 this
model is linear in the parameters to be estimated right
therefore this polynomial regression is considered to be a special case of
traditional multiple linear regression so you can use the same mechanism as
linear regression to solve such a problem
therefore polynomial regression models can fit using the model of least squares
least squares is a method for estimating the unknown parameters in a linear
regression model by minimizing the sum of the squares of the differences
between the observed dependent variable in the given data set and those
predicted by the linear function so what is nonlinear regression exactly first
nonlinear regression is a method to model a nonlinear relationship between
the dependent variable and a set of independent variables second for a model
to be considered non linear Y hat must be a nonlinear function of the
parameters theta is not necessarily the features X when it comes to nonlinear
equation it can be the shape of exponential
logarithmic and logistic or many other types as you can see in all of these
equations the change of Y hat depends on changes in the parameters theta not
necessarily on X only that is a nonlinear regression a model is
nonlinear by parameters in contrast to linear regression we cannot use the
ordinary least-squares method to fit the data in nonlinear regression and in
general estimation of the parameters is not easy let me answer two important
questions here first how can I know if a problem is linear or non-linear in an
easy way to answer this question we have to do two things the first is to
visually figure out if the relation is linear or non-linear it’s best to plot
bivariate plots of output variables with each input variable also you can
calculate the correlation coefficient between independent and dependent
variables and if for all variables it is 0.7 or higher there is a linear tendency
and thus it’s not appropriate to fit a nonlinear regression the second thing we
have to do is to use nonlinear regression instead of linear regression
when we cannot accurately model the relationship with linear parameters the
second important question is how should I model my data if it displays non
linear on a scatter plot well to address this you have to use either a polynomial
regression use a nonlinear regression model or transform your data which is
not in scope for this course hello and welcome in this video we’ll learn a
machine learning method called logistic regression which is used for
classification in examining this method will specifically answer these three
questions what is logistic regression what kind of problems can be solved by
logistic regression and in which situations do we use logistic regression
so let’s get started logistic regression is a statistical and
machine learning technique for classifying records of a data set based
on the values of the input fields let’s say we have a telecommunication data set
that we’d like to analyze in order to understand which customers might leave
us next month this is historical customer data where each row represents
one customer imagine that you’re an analyst at this company and you have to
find out who is leaving and why you’ll use the data set to build a model based
on historical records and use it to predict the future churn within the
customer group the data set includes information about services that each
customer has signed up for customer account information demographic
information about customers like gender and age range and also customers who’ve
left the company within the last month the column is called churn we can use
logistic regression to build a model for predicting customer churn using the
given features in logistic regression we use one or more independent variables
such as tenure age and income to predict an outcome such as churn which we call a
dependent variable representing whether or not customers will stop using the
service logistic regression is analogous to linear regression but tries to
predict a categorical or discrete target field instead of a numeric one in linear
regression we might try to predict a continuous value of variables such as
the price of a house blood pressure of the patient or fuel consumption of a car
but in logistic regression we predict a variable which is binary such as yes/no
true/false successful or not successful pregnant not pregnant and so on all of
which can be coded as 0 or 1 in logistic regression dependent variable should be
continuous if categorical they should be dummy or indicator coded this means we
have to transform them to some continuous value please note that
logistic regression can be used for both binary classification and multi-class
classification but for simplicity in this video we’ll focus on binary
classification let’s examine some applications of logistic regression
before we explain how they work as mentioned logistic regression is a type
of classification algorithm so it can be used in different situations for example
to predict the probability of a person having a heart attack within a specified
time period based on our knowledge of the person’s
age sex and body mass index or to predict the chance of mortality in an
injured patient or to predict whether a patient has a given disease such as
diabetes based on observed characteristics of that patient such as
weight height blood pressure and results of various blood tests and so on in a
marketing context we can use it to predict the likelihood of a customer
purchasing a product or halting a subscription as we’ve done in our turn
example we can also use logistic regression to predict the probability of
failure of a given process system or product we can even use it to predict
the likelihood of a homeowner defaulting on a mortgage these are all good
examples of problems that can be solved using logistic regression notice that in
all these examples not only do we predict the class of each case we also
measure the probability of a case belonging to a specific class there are
different machine algorithms which can classify or estimate a variable the
question is when should we use logistic regression here are for situations in
which logistic regression is a good candidate first when the target field in
your data is categorical or specifically is binary such as 0 1 yes/no churn or no
churn positive negative and so on second you need the probability of your
prediction for example if you want to know what the probability is of a
customer buying a product logistic regression returns a probability score
between 0 & 1 for a given sample of data in fact
logistic regression predicts the probability of that sample and we’ve
mapped the cases to a discrete class based on that probability third if your
data is linearly separable the decision boundary of logistic
regression is a line or a plane or a hyperplane a classifier will classify
all the points on one side of the decision boundary as belonging to one
class and all those on the other side as belonging to the other class for example
if we have just two features and are not applying any polynomial processing we
can obtain an inequality like theta 0 plus theta 1 X 1 plus theta 2 X 2 is
greater than 0 which is a half plain easily plausible please note that in
using logistic regression we can also achieve a complex decision boundary
using polynomial processing as well which is out of scope
here you’ll get more insight from decision boundaries when you understand
how logistic regression works fourth you need to understand the impact of a
feature you can select the best features based on the statistical significance of
the logistic regression model coefficients or parameters that is after
finding the optimum parameters a feature X with the weight theta one close to
zero has a smaller effect on the prediction than features with large
absolute values of theta one indeed it allows us to understand the impact an
independent variable has on the dependent variable while controlling
other independent variables let’s look at our data set again we define the
independent variables as X and dependent variable as Y notice that for the sake
of simplicity we can code the target or dependent values to 0 or 1 the goal of
logistic regression is to build a model to predict the class of each sample
which in this case is a customer as well as the probability of each sample
belonging to a class given that let’s start to formalize the problem X is our
data set in the space of real numbers of M by n that is of M dimensions or
features and n records and Y is the class that we want to predict which can
be either 0 or 1 ideally a logistic regression model so
called y hat can predict that the class of the customer is 1 given its features
X it can also be shown quite easily that the probability of a customer being in
class 0 can be calculated as 1 minus the probability that the class of the
customer is 1 hello in this video we’ll give you an introduction to
classification so let’s get started it machine learning classification is a
supervised learning approach which can be thought of as a means of categorizing
or classifying some unknown items into a discrete set of classes classification
attempts to learn the relationship between a set of feature variables and a
target variable of interest the target attribute and classification is a
categorical variable with discrete values so how does classification and
classifiers work given a set of training data points along with the target labels
classification determines the class label for an unlabeled test case let’s
explain this with an example a good sample of classification is the loan
default prediction suppose a bank is concerned about the
potential for loans not to be repaid if previous loan default data can be used
to predict which customers are likely to have problems repaying loans these bad
risk customers can either have their loan application declined or offered
alternative products the goal of a loan default predictor is to use existing
loan default data which is information about the customers such as age income
education that said around to build a classifier pass a new customer or
potential future to falter to the model and then label it ie the data points as
defaulter or not default ur or for example 0 or 1 this is how a classifier
predicts an unlabeled test case please notice that this specific example was
about a binary classifier with two values we can also build classifier
models for both binary classification and multi-class classification for
example imagine that you’ve collected data about a set of patients all of whom
suffered from the same illness during their course of treatment each patient
responded to one of three medications you can use this labelled data set with
a classification algorithm to build a classification model then you can use it
to find out which drug might be appropriate for a future patient with
the same illness as you can see it is a sample of multi-class classification
classification has different business use cases as well for example to predict
the category to which a customer belongs for churn detection where we predict
whether a customer switches to another provider or brand or to predict whether
or not a customer responds to a particular advertising campaign data
classification has several applications in a wide variety of industries
essentially many problems can be expressed as associations between
feature and target variables especially when label data is available this
provides a broad range of applicability for classification for example
classification can be used for email filtering speech recognition handwriting
recognition biometric identification document classification and much more
here we have the types of classification algorithms in machine learning
they include decision trees naive Bayes linear discriminant analysis k nearest
neighbor logistic regression neural networks and support vector
machines there are many types of classification algorithms we will only
cover a few in this course hello and welcome in this video we’ll be covering
the K nearest neighbors algorithm so let’s get started
imagine that a telecommunications provider has segmented his customer base
by service usage patterns categorizing the customers into four groups
if demographic data can be used to predict group membership the company can
customize offers for individual prospective customers this is a
classification problem that is given the data set with predefined labels we need
to build a model to be used to predict the class of a new or unknown case the
example focuses on using demographic data such as region age and marital
status to predict usage patterns the Target Field called cust cat has four
possible values that correspond to the four customer groups as follows basic
service a service plus service and total service our objective is to build a
classifier for example using the row 0 to 7 to predict the class of row 8 we
will use a specific type of classification called K nearest neighbor
just for sake of demonstration let’s use only two fields as predictors
specifically age and income and then plot the customers based on their group
membership now let’s say that we have a new customer for example record number 8
with a known age and income how can we find the class of this customer can we
find one of the closest cases and assign the same class label to our new customer
can we also say that the class of our new customer is most probably group 4 ie
total service because its nearest neighbor is also of class 4 yes we can
in fact it is the first nearest neighbor now the question is to what extent can
we trust our judgment which is based on the first nearest neighbor it might be a
poor judgment especially if the first nearest neighbor is a very specific case
or an outlier correct now let’s look at our scatter plot again rather than
choose the first nearest neighbor what if we chose the five nearest neighbors
and did a majority vote among them to define the class
our new customer in this case we see that three out of five nearest neighbors
tell us to go for class three which is plus service doesn’t this make more
sense yes in fact it does in this case the value of K in the K nearest
neighbors algorithm is five this example highlights the intuition behind the K
nearest neighbors algorithm now let’s define the K nearest neighbors the K
nearest neighbors algorithm is a classification algorithm that takes a
bunch of label points and uses them to learn how to label other points this
algorithm classifies cases based on their similarity to other cases and K
nearest neighbors data points that are near each other are said to be neighbors
k nearest neighbors is based on this paradigm similar cases with the same
class labels are near each other thus the distance between two cases is a
measure of their dissimilarity there are different ways to calculate the
similarity or conversely the distance or dissimilarity of two data points for
example this can be done using Euclidean distance now let’s see how the K nearest
neighbors algorithm actually works in a classification problem the K nearest
neighbors algorithm works as follows one pick a value for K to calculate the
distance from the new case holdout from each of the cases in the data set three
search for the K observations in the training data that are nearest to the
measurements of the unknown data point and four predict the response of the
unknown data point using the most popular response value from the K
nearest neighbors there are two parts in this algorithm that might be a bit
confusing first how to select the correct K and second how to compute the
similarity between cases for example among customers let’s first start with a
second concern that is how can we calculate the similarity between two
data points assume that we have two customers customer 1 and customer 2 and
for a moment assume that these two customers have only one feature age we
can easily use a specific type of Minkowski distance to calculate the
distance of these two customers indeed the Euclidean distance distance
of X 1 from X 2 is root of 34 minus 32 power of 2 which is 4 what about if we
have more than one feature for example age and income if we have income and age
for each customer we can still use the same formula but this time we’re using
it in a two dimensional space we can also use the same distance matrix for
multi-dimensional vectors of course we have to normalize our feature set to get
the accurate dissimilarity measure there are other dissimilarity measures as well
that can be used for this purpose but as mentioned it is highly dependent on data
type and also the domain that classification is done for it as
mentioned K and K nearest neighbors is the number of nearest neighbors to
examine it is supposed to be specified by the user so how do we choose the
right cake assume that we want to find the class of the customer noted as
question mark on the chart what happens if we choose a very low value of K let’s
say k equals 1 the first nearest point would be blue which is class 1 this
would be a bad prediction since more of the points around it are magenta or
class 4 in fact since its nearest neighbor is blue we can say that we
captured the noise in the data or we chose one of the points that was an
anomaly in the data a low value of K causes a highly complex model as well
which might result in overfitting of the model
it means the prediction process is not generalized enough to be used for
out-of-sample cases out-of-sample data is data that is outside of the data set
used to train the model in other words it could not be trusted to be used for
prediction of unknown samples it’s important to remember that overfitting
is bad as we want a general model that works for any data not just the data
used for training now on the opposite side of the spectrum if we choose a very
high value of K such as K equals 20 then the model becomes over and generalized
so how can we find the best value for K the general solution is to reserve a
part of your data for testing the accuracy of the model once you’ve done
so choose K equals 1 and then use the training part for
modeling and calculate the accuracy of prediction using all samples in your
test set repeat this process increasing the K and
see which K is best for your model for example in our case k equals four will
give us the best accuracy nearest neighbors analysis can also be used to
compute values for a continuous target in this situation the average or median
target value of the nearest neighbors is used to obtain the predicted value for
the new case for example assume that you are predicting the price of a home based
on its feature set such as number of rooms square footage the year it was
built and so on you can easily find the three nearest neighbor houses of course
not only based on distance but also based on all the attributes and then
predict the price of the house as the medium of neighbors hello and welcome in
this video we’ll be covering evaluation metrics for classifiers so let’s get
started evaluation metrics explain the
performance of a model let’s talk more about the model evaluation metrics that
are used for classification imagine that we have an historical data set which
shows the custom return for a telecommunication company we have
trained the model and now we want to calculate its accuracy using the test
set we pass the test set to our model and refine the predictive labels now the
question is how accurate is this model basically we compare the actual values
in the test set with the values predicted by the model to calculate the
accuracy of the model evaluation metrics provide a key role in the development of
a model as they provide insight to areas that might require improvement there are
different model evaluation metrics but we just talk about three of them here
specifically jacquard index f1 score and log loss let’s first look at one of the
simplest accuracy measurements the jacquard index also known as the Jaccard
similarity coefficient let’s say Y shows the true labels of the churn data set
and Y hat shows the predicted values by our classifier then we can define
jacquard as the size of the intersection divided by the size of the Union
to label sex for example for a test set of size 10 with 8 correct predictions or
8 interceptions the accuracy by the jacquard index would be zero point six
six if the entire set of predicted labels for a sample strictly matches
with the true set of labels then the subset accuracy is one point zero
otherwise it is zero point zero another way of looking at accuracy of
classifiers is to look at a confusion matrix for example let’s assume that our
test set has only 40 rows this matrix shows the corrected and wrong
predictions in comparison with the actual labels each confusion matrix row
shows the actual true labels in the test set and the columns show the predicted
labels by classifier let’s look at the first row the first row is for customers
whose actual churn value in the test set is 1 as you can calculate out of 40
customers the churn value of 15 of them is 1 and out of these 15 that classifier
correctly predicted 6 of them as 1 and 9 of them as 0 this means that for 6
customers the actual churn value was 1 in the test set and the classifier also
correctly predicted those as 1 however while the actual label of 9 customers
was 1 the classifier predicted those as 0 which is not very good we can consider
this as an error of the model for the first row what about the customers with
a churn value 0 let’s look at the second row it looks like there were 25
customers whose churn value was 0 the classifier correctly predicted 24 of
them 0 and one of them wrongly predicted as 1 so it has done a good job in
predicting the customers with a churn value of 0 a good thing about the
confusion matrix is that it shows the model’s ability to correctly predict or
separate the classes in the specific case of a binary classifier such as this
example we can interpret these numbers as the count
true positives false negatives true negatives and false positives based on
the count of each section we can calculate the precision and recall of
each label precision is a measure of the accuracy provided that a class label has
been predicted it is defined by precision equals true positive divided
by true positive plus false positive and recall is the true positive rate it is
defined as recall equals true positive divided by true positive plus false
negative so we can calculate the precision and recall of each class now
we’re in the position to calculate the f1 scores for each label based on the
precision and recall of that label the f1 score is the harmonic average of the
precision and recall where an f1 score reaches its s value at 1 which
represents perfect precision and recall and it’s worst at 0 it is a good way to
show that a classifier has a good value for both recall and precision it is
defined using the f1 score equation for example the f1 score for class 0 ie
churn equals 0 is zero point eight three and the f1 score for class one ie churn
equals one is zero point five five and finally we can tell the average accuracy
for this classifier is the average of the f1 score for both labels which is
zero point seven two in our case please notice that both Jaccard and f1 score
can be used for multi class classifiers as well which is out of scope for this
course now let’s look at another accuracy metric for classifiers
sometimes the output of a classifier is the probability of a class label instead
of the label for example in logistic regression the output can be the
probability of customer churn ie yes or equals to one this probability is a
value between 0 & 1 logarithmic loss also known as log loss
measures the performance of a classifier where the predicted output is a
probability value between 0 & 1 so for example predicting a probability of 0.1
3 when the actual label is 1 would be back and would result in a high log loss
we can calculate the log loss for each row using the log loss equation which
measures how far each prediction is from the actual label then we calculate the
average log loss across all rows of the test set it is obvious that more ideal
classifiers have progressively smaller values of log loss so the classifier
with the lower log loss has better accuracy hello and welcome in this video
we’re going to introduce and examine decision trees so let’s get started what
exactly is a decision tree how do we use them to help us classify how can I grow
my own decision tree these may be some of the questions that you have in mind
from hearing the term decision tree hopefully you’ll soon be able to answer
these questions and many more by watching this video
imagine that you’re a medical researcher compiling data for a study you’ve
already collected data about a set of patients all of whom suffered from the
same illness during their course of treatment each patient responded to one
of two medications will call them drug a and drug B part of your job is to build
a model to find out which drug might be appropriate for a future patient with
the same illness the feature sets of this data set are age gender blood
pressure and cholesterol of our group of patients and the target is the drug that
each patient responded to it is a sample of binary classifiers and you can use
the training part of the data set to build a decision tree and then use it to
predict a class of an unknown patient in essence to come up with a decision on
which drug to prescribe to a new patient let’s see how a decision tree is built
for this data set decision trees are built by splitting the training set into
distinct nodes where one node contains all of or most of one category of the
data if we look at the diagram here we can see that it’s a patient’s classifier
so as mentioned we want to prescribe a drug to a new patient but the decision
to choose drug a or B will be influenced by the patient’s situation we start with
age which could be young middle aged or senior if the patient is middle-aged
then we’ll definitely go for drug B on the other hand if he is a young or a
senior patient we’ll need more details to help us determine which drug to
prescribe the additional decision variables can be things such as
cholesterol levels gender or blood pressure for example if the patient is
female and we will recommend drug a but if the patient is male then we’ll go for
drug B as you can see decision trees are about testing an attribute and branching
the cases based on the result of the test each internal node corresponds to a
test and each branch corresponds to a result of the test and each leaf node
assigns a patient to a class now the question is how can we build such a
decision tree here is the way that a decision tree is built a decision tree
can be constructed by considering the attributes one by one first choose an
attribute from our data set calculate the significance of the attribute in the
splitting of the data in the next video we will explain how to calculate the
significance of an attribute to see if it’s an effective attribute or not next
split the data based on the value of the best attribute then go to each branch
and repeat it for the rest of the attributes after building this tree you
can use it to predict the class of unknown cases or in our case the proper
drug for a new patient based on his or her characteristics hello and welcome in
this video we’ll be covering the process of building decision trees so let’s get
started consider the drug data set again the question is how do we build a
decision tree based on that data set decision trees are built using recursive
partitioning to classify the data let’s say we have 14 patients in our dataset
the algorithm chooses the most predictive feature to split the data on
what is important in making a decision tree is to determine which attribute is
the best or more predictive to split data based on the feature
let’s say we pick cholesterol as the first attribute to split data it will
split our data into two branches as you can see if the patient has high
cholesterol we cannot say with high confidence that
drug B might be suitable for him also if the patient’s cholesterol is normal we
still don’t have sufficient evidence or information to determine if either drug
a or drug B is in fact suitable it is a sample of bad attribute selection for
splitting data so let’s try another attribute again we have our 14 cases
this time we pick the sex attribute of patients it will split our data into two
branches male and female as you can see if the patient is female we can say
drunk B might be suitable for her with high certainty but if the patient is
male we don’t have sufficient evidence or information to determine if drug a or
drug being is suitable however it is still a better choice in comparison with
the cholesterol attribute because the result in the nodes are more pure
it means nodes that are either mostly drug a or drug B so we can say the sex
attribute is more significant than cholesterol or in other words it’s more
predictive than the other attributes indeed predictiveness is based on
decrease in impurity of nose we’re looking for the best feature to decrease
the impurity of patients in the leads after splitting them up based on that
feature so the sex feature is a good candidate in the following case because
it almost found the pure patients let’s go one step further for the male patient
branch we again test other attributes to split the subtree we test cholesterol
again here as you can see it results in even more pure leaves so we can easily
make a decision here for example if a patient is male and his cholesterol is
high we can certainly prescribed drug a but if it is normal we can prescribe
drug B with high confidence as you might notice the choice of attribute to split
data is very important and it is all about purity of the leaves after the
split a node in the tree is considered pure if
in 100% of the cases the nodes fall into a specific category of the target field
in fact the method uses recursive partitioning to split the training
records into segments by minimizing the impurity at each step impurity of nodes
is calculated by entropy of data in the node so what is entropy entropy is the
amount of information disorder or the amount of randomness in the data the
entropy in the node depends on how much random data is in that node and is
calculated for each node in decision trees we’re looking for trees that have
the smallest entropy in their nodes the entropy is used to calculate the
homogeneity of the samples in that node if the samples are completely
homogeneous the entropy is zero and if the samples are equally divided it has
an entropy of one this means if all the data in a node are either drug a or drug
B then the entropy is zero but if half of the data or drug a and other half are
B then the entropy is one you can easily calculate the entropy of a node using
the frequency table of the attribute through the entropy formula where p is
for the proportion or ratio of a category such as drug a or b please
remember though that you don’t have to calculate these as it’s easily
calculated by the libraries or packages that you use as an example let’s
calculate the entropy of the data set before splitting it we have nine
occurrences of drug B and v of drug a you can embed these numbers into the
entropy formula to calculate the impurity of the target attribute before
splitting it in this case it is 0.94 so what is entropy after splitting now we
can test different attributes to find the one with the most predictiveness
which results in two more fewer branches let’s first select the cholesterol of
the patient and see how the data gets split based on its values for example
when it is normal we have 6 for drug B and 2 for drug a we can calculate the
entropy of this node based on the distribution of drug a and B which is
0.8 in this case but when cholesterol is high the data is
split into three for drug B and three for drug a calculating its entropy we
can see it would be 1.0 we should go through all the attributes and calculate
the entropy after the split and then choose the best attribute okay
let’s try another field let’s choose the sex attribute for the next check as you
can see when we use the sex attribute to split the data when its value is female
we have three patients that respond to to drug B and for patients that respond
to to drug a the entropy for this node is 0.98 which is not very promising
however on the other side of the branch when the value of the sex attribute is
male the result is more pure with six for drug B and only one for drug a the
entropy for this group is zero point five nine now the question is between
the cholesterol and sex attributes which one is a better choice which one is
better at the first attribute to divide the data set into two branches or in
other words which attribute results in more pure nodes for our drugs or in
which tree do we have less entropy after splitting rather than before splitting
the sex attribute with entropy of 0.98 at 0.5 nine or the cholesterol attribute
with entropy of 0.8 one and one point zero in its branches
the answer is the tree with the higher information gained after splitting so
what is information gain information game is the information that can
increase the level of certainty after splitting it is the entropy of a tree
before the split minus the weighted entropy after the split by an attribute
we can think of information gain and entropy as opposites as entropy or the
amount of randomness decreases the information gained or amount of
certainty increases and vice versa so constructing a decision tree is all
about finding attributes that return the highest information game let’s see how
information gain is calculated for the sex attribute as mentioned the
information gained is the entropy of the tree before the split minus the weighted
edge after the split the entropy of the tree
before the split is 0.94 the portion of female patients is 7 out of 14 and his
entropy is 0.98 5 also the portion of men is 7 out of 14 and the entropy of
the male node is 0.5 9 – the result of a square bracket here is the weighted
entropy after the split so the information gained of the tree if we use
the sex attribute to split the data set is 0.15 one as you can see we will
consider the entropy over the distribution of samples falling under
each leaf node and we’ll take a weighted average of that entropy weighted by the
proportion of samples falling under that leaf we can calculate the information
gain of the tree if we use cholesterol as well it is zero point four eight now
the question is which attribute is more suitable well as mentioned the tree with
the higher information gained after splitting this means the sex attribute
so we select the sex attribute as the first splitter now what is the next
attribute after branching by the sex attribute well as you can guess we
should repeat the process for each branch and test each of the other
attributes to continue to reach the most pure leaves this is the way you build a
decision tree hello and welcome in this video we’ll learn a machine learning
method called logistic regression which is used for classification and examining
this method will specifically answer these three questions what is logistic
regression what kind of problems can be solved by logistic regression and in
which situations do we use logistic regression so let’s get started logistic
regression is a statistical and machine learning technique for classifying
records of a data set based on the values of the input fields let’s say we
have a telecommunication data set that we’d like to analyze in order to
understand which customers might leave us next month this is historical
customer data where each row represents one customer imagine that you’re an
analyst at this company and you have to find out who is leaving and why you’ll
use the data set to build a model based on historical records and use it to
predict the future churn within the customer group the dataset includes
information about services that each customer has signed up for customer
account information demographic information about customers like gender
and age range and also customers who’ve left the company within the last month
the column is called churn we can use logistic regression to build a model for
predicting customer churn using the given features in logistic regression we
use one or more independent variables such as tenure age and income to predict
an outcome such as churn which we call a dependent variable representing whether
or not customers will stop using the service logistic regression is analogous
to linear regression but tries to predict a categorical or discrete target
field instead of a numeric one in linear regression we might try to predict a
continuous value of variables such as the price of a house blood pressure of
the patient or fuel consumption of a car but in logistic regression we predict a
variable which is binary such as yes/no true/false successful or not successful
pregnant not pregnant and so on all of which can be coded as 0 or 1 in logistic
regression dependent variables should be continuous if categorical they should be
dummy or indicator coded this means we have to transform them to some
continuous value please note that logistic regression can be used for both
binary classification and multi-class classification but for simplicity in
this video we’ll focus on binary classification let’s examine some
applications of logistic regression before we explain how they work
as mentioned logistic regression is a type of classification algorithm so it
can be used in different situations for example to predict the probability of a
person having a heart attack within a specified time period based on our
knowledge of the person’s age sex and body mass index or to predict the chance
of mortality in an injured patient or to predict whether a patient has a given
disease such as diabetes based on observed characteristics of that patient
such as weight height blood pressure and results of various blood tests and so on
in a marketing context we can use it to predict the likelihood of a customer
purchasing a product or halting a subscription as we’ve done in our churn
example we can also use logistic regression to predict the probability of
failure of a given process system or product we can even use it to predict
the likelihood of a homeowner defaulting on a mortgage these are all good
examples of problems that can be solved using logistic regression notice that in
all these examples not only do we predict the class of each case we also
measure the probability of a case belonging to a specific class there are
different machine algorithms which can classify or estimate a variable the
question is when should we use logistic regression here are for situations in
which logistic regression is a good candidate first when the target field in
your data is categorical or specifically is binary such as 0 1 yes/no churn or no
churn positive negative and so on second you need the probability of your
prediction for example if you want to know what the probability is of a
customer buying a product logistic regression returns a probability score
between 0 & 1 for a given sample of data in fact logistic regression predicts the
probability of that sample and we’ve mapped the cases to a discrete class
based on that probability third if your data is linearly separable the decision
boundary of logistic regression is a line or a plane or a hyperplane a
classifier will classify all the points on one side of the decision boundary as
belonging to one class and all those on the other side as belonging to the other
class for example if we have just two features and are not applying any
polynomial processing we can obtain an inequality like theta 0 plus theta 1 x1
plus theta 2 x2 is greater than 0 which is a half plane easily plausible please
note that in using logistic regression we can also achieve a complex decision
boundary using polynomial processing as well which is out of scope here you’ll
get more insight from decision boundaries when you understand how
logistic regression works fourth you need to understand the impact of a
feature you can select the best features based on the statistical significance of
the logistic regression model coefficients or parameters that is after
finding the optimum parameters a feature X with the weight theta 1 close to 0 has
a smaller effect on the prediction than features with large absolute values of
theta 1 indeed it allows us to understand the
impact an independent variable has on the dependent variable while controlling
other independent variables let’s look at our data set again we define the
independent variables as X and dependent variable as Y notice that for the sake
of simplicity we can code the target or dependent values to 0 or 1 the goal of
logistic regression is to build a model to predict the class of each sample
which in this case is a customer as well as the probability of each sample
belonging to a class given that let’s start to formalize the problem X is our
data set in the space of real numbers of M by n that is of M dimensions or
features and n records and Y is the class that we want to predict which can
be either 0 or 1 ideally a logistic regression model so called y hat can
predict that the class of the customer is 1 given its features X it can also be
shown quite easily that the probability of a customer being in class 0 can be
calculated as 1 minus the probability that the class of the customer is 1
hello and welcome in this video we will learn the difference between linear
regression and logistic regression we go over linear regression and see why it
cannot be used properly for some binary classification problems we also look at
the sigmoid function which is the main part of logistic regression let’s start
let’s look at the telecommunication data set again the goal of logistic
regression is to build a model to predict the class of each customer and
also the probability of each sample belonging to a class ideally we want to
build a model y hat that can estimate that the class of a customer is 1 given
its features X I want to emphasize that Y is the labels vector also called
actual values that we would like to predict and Y hat is the vector of the
predicted values by our model mapping the class labels to integer numbers can
we use linear regression to solve this problem first let’s recall how linear
regression works to better understand logistic regression forget about the
churn prediction for a minute and assume our goal is to predict the income of
customers in the data set this means that instead of predicting churn which
is a categorical value let’s predict income which is a
continuous value so how can we do this let’s select an independent variable
such as customer age and predict a dependent variable such as income of
course we can have more features but for the sake of simplicity let’s just take
one feature here we can plot it and show age as an independent variable and
income has the target value we would like to predict with linear regression
you can fit a line or polynomial through the data we can find this line through
training our model or calculating it mathematically based on the sample sets
we’ll say this is a straight line through the sample set this line has an
equation shown as a plus B x1 now use this line to predict the continuous
value Y that is use this line to predict the income of an unknown customer based
on his or her age and it is done what if we want to predict churn can we use the
same technique to predict a categorical field such as churn okay let’s see say
we’re given data on customer churn and our goal this time is to predict the
churn of customers based on their age we have a feature age denoted as x1 and a
categorical feature churn with two classes churn is yes and churn is known
as mentioned we could map yes and no two integer values 0 and 1 how can we model
it now well graphically we could represent our data with a scatter plot
but this time we have only two values for the y axis in this plot class 0 is
denoted in red and class 1 is denoted in blue our goal here is to make a model
based on existing data to predict if a new customer is red or blue let’s do the
same technique that we use for linear regression here to see if we can solve
the problem for a categorical attribute such as chert with linear regression you
again can fit a polynomial through the data which is shown traditionally as a
plus B X this polynomial can also be shown traditionally has Fatah’s 0 plus
theta 1 x 1 this line has two parameters which are shown with vector theta where
the values of the vector R theta 0 and we can also show the equation of this
line formally as theta transpose X and generally we can show the equation for a
multi-dimensional space as theta transpose X where theta is the
parameters of the line in two-dimensional space or parameters of a
plane in three-dimensional space and so on as Stata is a vector of parameters
and is supposed to be multiplied by X it is shown conventionally as transpose
theta theta is also called the wayans factor or confidences of the equation
with both these terms used interchangeably and X is the feature set
which represents a customer anyway given a dataset all the feature sets X theta
parameters can be calculated through an optimization algorithm or mathematically
which results in the equation of the fitting line for example the parameters
of this line are minus 1 and 0.1 and the equation for the line is minus 1 plus
0.1 X 1 now we can use this regression line to predict the turn of a new
customer for example for our customer or let’s say a data point with X value of
age equals 13 we can plug the value into the line formula and the Y value is
calculated and returns a number for instance for p1 point we have theta
transpose x equals minus 1 plus 0.1 x x1 equals minus 1 plus 0.1 times 13 equals
0.3 we can show it on our graph now we can define a threshold here for example
at 0.5 to define the class so we write a rule here for our model Y hat which
allows us to separate class 0 from class 1 if the value of theta transpose X is
less than 0.5 then the class is 0 otherwise if the value of theta
transpose X is more than 0.5 then the class is 1 and because our customers Y
value is less than the threshold we can say it belongs to class 0 based on our
model but there is one problem here what is the probability that this customer
belongs to class 0 as you can see it’s not the best model to solve this problem
also there are some other issues which verify that linear regression is not the
proper method for classification problems so as mentioned if we use the
regression line to calculate the class of a point it always returns a number
such as three or negative two and so on then we should use a threshold for
example zero point five to assign that point to either class of zero or one
this threshold works as a step function that outputs zero or one regardless of
how big or small positive or negative the input is so using the threshold we
can find the class of a record notice that in the step function no matter how
big the value is as long as it’s greater than 0.5 it simply equals 1 and vice
versa regardless of how small the value Y is the output would be zero if it is
less than 0.5 in other words there is no difference between a customer who has a
value of 1 for 1000 the outcome would be 1 instead of having this step function
wouldn’t it be nice if we had a smoother line one that would project these values
between 0 and 1 indeed the existing method does not really give us the
probability of a customer belonging to a class which is very desirable we need a
method that can give us the probability of falling in a class as well so what is
the scientific solution here well if instead of using theta transpose
X we use a specific function called sigmoid then sigmoid of theta transpose
X gives us the probability of a point belonging to a class instead of the
value of y directly I’ll explain the sigmoid function in a second but for now
please accept that it will do the trick instead of calculating the value of
theta transpose X directly it returns the probability that a theta transpose X
is very big or very small it always returns a value between 0 and 1
depending on how large the theta transpose X actually is now our model is
sigmoid of theta transpose X which represents the probability that the
output is 1 given X now the question is what is the sigmoid function let me
explain in detail what sigmoid really is the sigmoid function also called the
logistic function resembles the step function and is used
by the following expression in the logistic regression the sigmoid function
looks a bit complicated at first but don’t worry about remembering this
equation it’ll make sense to you after working with it notice that in the
sigmoid equation when theta transpose X is very big the e power minus theta
transpose X in the denominator of the fraction becomes almost zero and the
value of the sigmoid function gets closer to one if theta transpose X is
very small the sigmoid function gets closer to zero depicting on the in
sigmoid plot when theta transpose X gets bigger the value of the sigmoid function
gets closer to 1 and also if the theta transpose X is very small the sigmoid
function gets closer to 0 so the sigmoid functions output is always between 0 and
1 which makes it proper to interpret the results as probabilities it is obvious
that when the outcome of the sigmoid function gets closer to 1 the
probability of y equals 1 given X goes up and in contrast when the sigmoid
value is closer to 0 the probability of y equals 1 given X is very small so what
is the output of our model when we use the sigmoid function in logistic
regression we model the probability that an input X belongs to the default class
y equals 1 and we can write this formula as probability of y equals 1 given X we
can also write probability of Y belongs to class 0 given X is 1 minus
probability of y equals 1 given X for example the probability of a customer
staying with the company can be shown as probability of churn equals 1 given a
customer’s income and age which can be for instance 0.8 and the probability of
churn is 0 for the same customer given a customer’s income and age can be
calculated as 1 minus 0.8 equals 0.2 so now our job is to train the model to set
its parameter values in such a way that our model is a good estimate of
probability of y equals 1 given X in fact this is what a good classifier
model built by logistic regression is supposed to do for us
also it should be a good estimate of probability of y belongs to class zero
given X that can be shown as 1 minus Sigma of theta transpose X now the
question is how can we achieve this we can find theta through the training
process so let’s see what the training process is step 1 initialize theta
vector with random values as with most machine learning algorithms for example
minus 1 or 2 step 2 calculate the model output which is sigmoid of theta
transpose X for a sample customer in your training set X in theta transpose X
is the feature vector values for example the age and income of the customer for
instance 2 & 5 and theta is the confidence or weight that you’ve set in
the previous step the output of this equation is the prediction value in
other words the probability that the customer belongs to class 1 step 3
compare the output of our model y hat which could be a value of let’s say zero
point 7 with the actual label of the customer which is for example one for
chert then record the difference as our models error for this customer which
would be one minus zero point seven which of course equals zero point three
this is the error for only one customer out of all the customers in the training
set step four calculate the error for all customers as we did in the previous
steps and add up these errors the total error is the cost of your model and is
calculated by the models cost function the cost function by the way basically
represents how to calculate the error of the model which is the difference
between the actual and the models predicted values so the cost shows how
poorly the model is estimating the customers labels therefore the lower the
cost the better the model is at estimating the customers labels
correctly and so what we want to do is to try to minimize this cause step 5 but
because the initial values for theta were chosen randomly it’s very likely
that the cost function is very high so we change the theta in such a way to
hopefully reduce the total cost step 6 after changing the values of theta we go
back to step 2 then we start another iteration and
calculate the cost of the model again and we keep doing those steps over and
over changing the values of theta each time until the cost is low enough so
this brings up two questions first how can we change the values of theta so
that the cost is reduced across iterations and second when should we
stop the iterations there are different ways to change the values of theta but
one of the most popular ways is gradient descent also there are various ways to
stop iterations but essentially you stop training by calculating the accuracy of
your model and stop it when it’s satisfactory hello and welcome in this
video we’ll learn more about training a logistic regression model also we’ll be
discussing how to change the parameters of the model to better estimate the
outcome finally we talked about the cost function and gradient descent in
logistic regression as a way to optimize the model so let’s start the main
objective of training in logistic regression is to change the parameters
of the model so as to be the best estimation of the labels of the samples
in the dataset for example the customer churn how do we do that in brief first
we have to look at the cost function and see what the relation is between the
cost function and the parameters theta so we should formulate the cost function
then using the derivative of the cost function we can find how to change the
parameters to reduce the cost or rather the error let’s dive into it to see how
it works but before I explain it I should highlight for you that it needs
some basic mathematical background to understand it however you shouldn’t
worry about it as most data science languages like
Python R and Scala have some packages or libraries that calculate these
parameters for you so let’s take a look at it let’s first find the cost function
equation for a sample case to do this we can use one of the customers in the
churn problem there’s normally a general equation for calculating the cost the
cost function is the difference between the actual values of y and our model
output Y hat this is a general rule for most cost functions in machine learning
we can show this as the cost of our model comparing it with actual labels
which is the difference between the predicted value of our model and actual
value of the target field where the predicted value of our model is sigmoid
of theta trans X usually the square of this equation is
used because of the possibility of the negative result and for the sake of
simplicity half of this value is considered as the cost function through
the derivative process now we can write the cost function for all the samples in
our training set for example for all customers we can write it as the average
sum of the cost functions of all cases it is also called the mean squared error
and as it is a function of a parameter vector theta it is shown as J of theta
okay good we have the cost function now how do we find or set the best weights
or parameters that minimize this cost function
the answer is we should calculate the minimum point of this cost function and
it’ll show us the best parameters for our model although we can find the
minimum point of a function using the derivative of a function there’s not an
easy way to find the global minimum point for such an equation given this
complexity describing how to reach the global minimum for this equation is
outside the scope of this video so what is the solution well we should find
another cost function instead one which has the same behavior that is easier to
find its minimum point let’s plot the desirable cost function for our model
recall that our model is y hat our actual value is y which equals 0 or 1
and our model tries to estimate it as we want to find a simple cost function for
our model for a moment assume that our desired value for Y is 1 this means our
model is best if it estimates y equals 1 in this case we need a cost function
that returns zero if the outcome of our model is one which is the same as the
actual label and the cost should keep increasing as the outcome of our model
is farther from one and cost should be very large if the outcome of our model
is close to zero we can see that the minus log function provides such a cost
function for us it means if the actual value is 1 and the model also predicts 1
the minus log function returns zero cost but if the prediction is smaller than 1
the minus log function returns a larger cost value so we can use the minus log
function for calculating the cost of our logistic regression model so if you
recall we previously noted that in general it is difficult to calculate the
derivative of the cost function well we can now change it with a minus
log of our model we can easily prove that in the case that desirable y is 1
the cost can be calculated as minus log y hat and in the case that desirable y
is 0 the cost can be calculated as minus log 1 minus y hat now we can plug it
into our total cost function and rewrite it as this function so this is the
logistic regression cost function as you can see for yourself it penalizes
situations in which the class is 0 and the model output is 1 and vice versa
remember however that Y hat does not return a class as output but it’s a
value of 0 or 1 which should be assumed as a probability now we can easily use
this function to find the parameters of our model in such a way as to minimize
the cost ok let’s recap what we’ve done our objective was to find a model that
best estimates the actual labels finding the best model means finding the best
parameters theta for that model so the first question was how do we find the
best parameters for our model well by finding and minimizing the cost function
of our model in other words to minimize the J of theta we just defined the next
question is how do we minimize the cost function
the answer is using an optimization approach there are different
optimization approaches but we use one of the most famous and effective
approaches here gradient descent the next question is what is gradient
descent generally gradient descent is an iterative approach to finding the
minimum of a function specifically in our case gradient descent is a technique
to use the derivative of a cost function to change the parameter values to
minimize the cost or error let’s see how it works the main objective of gradient
descent is to change the parameter values so as to minimize the cost
how can gradient descent do that think of the parameters or weights in our
model to be in a two dimensional space for example theta 1 theta 2 for two
feature sets age and income recall the cost function J that we discussed in the
previous slides we need to minimize the cost function J which is a function of
variables theta 1 and theta 2 so let’s add a dimension for the observed call
or error J function let’s assume that if we plot the cost function based on all
possible values of theta 1 and theta 2 we can see something like this it
represents the error value for different values of parameters that is error which
is a function of the parameters this is called your error curve or inner Bowl of
your cost function recall that we want to use this error Bowl to find the best
parameter values that result in minimizing the cost value now the
question is which point is the best point for your cost function yes you
should try to minimize your position on the error curve so what should you do
you have to find the minimum value of the cost by changing the parameters but
which way will you add some value to your weights or deduct some value and
how much would that value be you can select random parameter values that
locate a point on the bowl you can think of our starting point being the yellow
point you change the parameters by delta theta1 and delta theta2 and take one
step on the surface let’s assume we go down one step in the bowl as long as
we’re going downwards we can go one more step a steeper the slope the further we
can step and we can keep taking steps as we approach the lowest point the slope
diminishes so we can take smaller steps until we reach a flat surface this is
the minimum point of our curve and the optimum theta1 theta2
what are these steps really I mean in which direction should we take these
steps to make sure we descend and how big should the steps be to find the
direction and size of these steps in other words to find how to update the
parameters you should calculate the gradient of the cost function at that
point the gradient is the slope of the surface
at every point and the direction of the gradient is the direction of the
greatest uphill now the question is how do we calculate the gradient of a cost
function at a point if you select a random point on this surface for example
the yellow point and take the partial derivative of J of theta with respect to
each parameter at that point it gives you the slope of the move for each
parameter at that point now if we move in the opposite direction of that slope
it guarantees that we go down in the error curve for example if we calculate
the derivative of J with respect to theta one we find out that it is a
positive number this indicates that function is
increasing as theta one increases so to decrease J we should move in the
opposite direction this means to move in the direction of
the negative derivative for theta one ie slope we have to calculate it for other
parameters as well at each step the gradient value also indicates how big of
a step to take if the slope is large we should take a large step because we’re
far from the minimum if the slope is small we should take a smaller step
gradient descent takes increasingly smaller steps towards the minimum with
each iteration the partial derivative of the cost function J is calculated using
this expression if you want to know how the derivative of the J function is
calculated you need to know the derivative concept which is beyond our
scope here but to be honest you don’t really need to remember all the details
about it as you can easily use this equation to calculate the gradients so
in a nutshell this equation returns the slope of that point and we should update
the parameter in the opposite direction of the slope a vector of all these
slopes is the gradient vector and we can use this vector to change or update all
the parameters we take the previous values of the parameters and subtract
the error derivative this results in the new parameters for theta that we know
will decrease the cost also we multiply the gradient value by a constant value
mu which is called the learning rate learning rate gives us additional
control on how fast we move on the surface in sum we can simply say
gradient descent is like taking steps in the current direction of the slope and
the learning rate is like the length of the step you take so these would be our
new parameters notice that it’s an iterative operation and in each
iteration we update the parameters and minimize the cause until the algorithm
converge is on an acceptable minimum okay let’s recap what we’ve done to this
point by going through the training algorithm again step by step step one we
initialize the parameters with random values step two we feed the cost
function with the training set and calculate the cost we expect a high
error rate as the parameters are set randomly step three we calculate the
gradient of the cost function keeping in mind that we have to use a partial
derivative so to calculate the gradient vector we need all the training data to
feed the equation for each parameter of course this is an expensive part of
the algorithm but there are some solutions for this step 4
we update the weights with new parameter values step 5
here we go back to step 2 and feed the cost function again which has new
parameters as was explained earlier we expect less error as we’re going down
the error surface we continue this loop until we reach a short value of cost or
some limited number of iterations step 6 the parameter should be roughly found
after some iterations this means the model is ready and we can use it to
protect the probability of a customer staying or leaving hello and welcome in
this video we will learn a machine learning method called support vector
machine or SVM which is used for classification so let’s get started
imagine that you’ve obtained a dataset containing characteristics of thousands
of human cell samples extracted from patients who were believed to be at risk
of developing cancer analysis of the original data showed that many of the
characteristics differed significantly between benign and malignant samples you
can use the values of the cell characteristics and samples from other
patients to give an early indication of whether a new sample might be benign or
malignant you can use support vector machine or SVM as a classifier to train
your model to understand patterns within the data that might show benign or
malignant cells once the model has been trained it can be used to predict your
new or unknown cell with rather high accuracy now let me give you a formal
definition of SVM a support vector machine is a supervised algorithm that
can classify cases by finding a separator SVM works by first mapping
data to a high dimensional feature space so that data points can be categorized
even when the data are not otherwise linearly separable then a separator is
estimated for the data the data should be transformed in such a way that a
separator could be drawn as a hyperplane for example consider the following
figure which shows the distribution of a small set of cells only based on their
unit size and clump thickness as you can see the data points fall into two
different categories it represents a linearly non
several data set the two categories can be separated with a curve but not
aligned that is it represents a linearly non-separable data set which is the case
for most real-world data sets we can transfer this data to a higher
dimensional space for example mapping it to a three dimensional space after the
transformation the boundary between the two categories can be defined by a
hyperplane as we are now in three dimensional space the separator is shown
as a plane this plane can be used to classify new or unknown cases therefore
the SVM algorithm outputs an optimal hyperplane that categorizes new examples
now there are two challenging questions to consider first how do we transfer
data in such a way that a separator could be drawn as a hyperplane and two
how can we find the best or optimized hyperplane separator after
transformation let’s first look at transforming data to see how it works
for the sake of simplicity imagine that our data set is one dimensional data
this means we have only one feature X as you can see it is not linearly separable
so what can we do here well we can transfer it into a two dimensional space
for example you can increase the dimension of data by mapping X into a
new space using a function with outputs x and x squared now the data is linearly
separable right notice that as we are in a two dimensional space the hyperplane
is a line dividing a plane into two parts where each class lays on either
side now we can use this line to classify new cases basically mapping
data into a higher dimensional space is called kernel in the mathematical
function used for the transformation is known as the kernel function and can be
of different types such as linear polynomial radial basis function or RBF
and sigmoid each of these functions has its own characteristics its pros and
cons and its equation but the good news is that you don’t need to know them as
most of them are already implemented in libraries of data science programming
languages also as there’s no easy way of knowing which function performs best
with any given data set we usually choose different functions in turn
and compare the results now we get to another question specifically how do we
find the right or optimize separator after transformation basically SVM’s are
based on the idea of finding a hyperplane that best divides a dataset
into two classes as shown here as we’re in a two dimensional space you can think
of the hyperplane as a line that linearly separates the blue points from
the red points one reasonable choice as the best hyperplane is the one that
represents the largest separation or margin between the two classes so the
goal is to choose a hyperplane with as big a margin as possible examples
closest to the hyperplane are support vectors it is intuitive that only
support vectors matter for achieving our goal and thus other training examples
can be ignored we try to find the hyperplane in such a way that it has the
maximum distance to support vectors please note that the hyperplane and
boundary decision lines have their own equations so finding the optimized
hyperplane can be formalized using an equation which involves quite a bit more
math so I’m not going to go through it here in detail that said the hyperplane
is learned from training data using an optimization procedure that maximizes
the margin and like many other problems this optimization problem can also be
solved by gradient descent which is out of scope of this video therefore the
output of the algorithm is the values W and B for the line you can make
classifications using this estimated line it is enough to plug in input
values into the line equation then you can calculate whether an unknown point
is above or below the line if the equation returns a value greater than 0
then the point belongs to the first class which is above the line and vice
versa the two main advantages of support vector machines are that they’re
accurate in high dimensional spaces and they use a subset of training points in
the decision function called support vectors so it’s also memory efficient
the disadvantages of support vector machines include the fact that the
algorithm is prone for overfitting if the number of features is much greater
than the number samples also SVM’s do not directly
provide probability estimates which are desirable in most classification
problems and finally svms are not very efficient computationally if your
dataset is very big such as when you have more than 1,000 rows and now our
final question is in which situation should I use SVM well SVM is good for
image analysis tasks such as image classification and handwritten digit
recognition also SVM is very effective in text mining taxes particularly due to
its effectiveness in dealing with high dimensional data for example it is used
for detecting spam next category assignment and sentiment analysis
another application of SVM is in gene expression data classification again
because of its power in high dimensional data classification SVM can also be used
for other types of machine learning problems such as regression outlier
detection and clustering I’ll leave it to you to explore more about these
particular problems hello and welcome in this video we’ll give you a high-level
introduction to clustering its applications and different types of
clustering algorithms let’s get started imagine that you have a customer data
set and you need to apply customer segmentation on this historical data
customer segmentation is the practice of partitioning a customer base into groups
of individuals that have similar characteristics it is a significant
strategy as it allows a business to target specific groups of customers so
as to more effectively allocate marketing resources for example one
group might contain customers who are high profit and low risk that is more
likely to purchase products or subscribed for a service knowing this
information allows a business to devote more time and attention to retaining
these customers another group might include customers from nonprofit
organizations and so on a general segmentation process is not
usually feasible for large volumes of varied data therefore you need an
analytical approach to deriving segments and groups from large data sets
customers can be grouped based on several factors including age gender
interest spending habits and so on the important requirement is to use the
available data to under stand and identify how customers are
similar to each other let’s learn how to divide a set of customers into
categories based on characteristics they share one of the most adopted approaches
that can be used for customer segmentation is clustering clustering
can group data only unsupervised based on the similarity of customers to each
other it will partition your customers into mutually exclusive groups for
example into three clusters the customers in each cluster are similar to
each other demographically now we can create a profile for each group
considering the common characteristics of each cluster for example the first
group is made up of affluent and middle-aged customers the second is made
up of young educated and middle-income customers and the third group includes
young and low-income customers finally we can assign each individual in our
dataset to one of these groups or segments of customers now imagine that
you cross join this segment at data set with the data set of the product or
services that customers purchase from your company this information would
really help to understand and predict the differences in individual customers
preferences and their buying behaviors across various products indeed having
this information would allow your company to develop highly personalized
experiences for each segment customer segmentation is one of the popular
usages of clustering cluster analysis also has many other applications in
different domains so let’s first define clustering and then we’ll look at other
applications clustering means finding clusters in a data set unsupervised so
what is a cluster a cluster is a group of data points or objects in a data set
that are similar to other objects in the group and dissimilar to data points in
other clusters now the question is what is different between clustering and
classification let’s look at our customer data set again classification
algorithms predict categorical class labels this means assigning instances to
predefined classes such as defaulted or not defaulted for example if an analyst
wants to analyze customer data in order to know which customers might default on
their payments she uses a labeled data set as training data and uses
classification approaches such as a decision tree support vector machine
or SVM or logistic regression to predict the default value for a new or unknown
customer generally speaking classification is a
supervised learning where each training data instance belongs to a particular
class in clustering however the data is unlabeled and the process is
unsupervised for example we can use a clustering algorithm such as k-means to
group similar customers as mentioned and assign them to a cluster based on
whether they share similar attributes such as age education and so on while
I’ll be giving you some examples in different industries I’d like you to
think about more samples of clustering in the retail industry clustering is
used to find associations among customers based on their demographic
characteristics and use that information to identify buying patterns of various
customer groups also it can be used in recommendation systems to find a group
of similar items or similar users and use it for collaborative filtering to
recommend things like books or movies to customers in banking analyst find
clusters of normal transactions to find the patterns of fraudulent credit card
usage also they use clustering to identify clusters of customers for
instance to find loyal customers versus churn customers in the insurance
industry clustering is used for fraud detection in claims analysis or to
evaluate the insurance risk of certain customers based on their segments in
publication media clustering is used to auto categorize news based on its
content or to tag news then cluster it so as to recommend similar news articles
to readers in medicine it can be used to characterize patient behavior based on
their similar characteristics so as to identify successful medical therapies
for different illnesses or in biology clustering is used to group genes with
similar expression patterns or to cluster genetic markers to identify
family ties if you look around you can find many other applications of
clustering but generally clustering can be used for one of the following
purposes exploratory data analysis summary generation or reducing the scale
outlier detection especially to be used for fraud detection or noise removal
finding duplicates and datasets or as a pre-processing step for either
prediction other data mining tasks or as part of a
complex system let’s briefly look at different clustering algorithms and
their characteristics partition based clustering is a group of clustering
algorithms that produces sphere-like clusters such as k-means k median or
fuzzy c means these algorithms are relatively efficient and are used for
medium and large sized databases hierarchical clustering algorithms
produce trees of clusters such as agglomerative and divisive algorithms
this group of algorithms are very intuitive and are generally good for use
with small size datasets density based clustering algorithms produce arbitrary
shaped clusters they are especially good when dealing with spatial clusters or
when there is noise at your data set for example the DB scan algorithm hello and
welcome in this video we’ll be covering k-means clustering so let’s get started
imagine that you have a customer data set and you need to apply a customer
segmentation on this historical data customer segmentation is the practice of
partitioning a customer base into groups of individuals that have similar
characteristics one of the algorithms that can be used for customer
segmentation is k-means clustering k-means can group data only unsupervised
based on the similarity of customers to each other let’s define this technique
more formally there are various types of clustering algorithms such as
partitioning hierarchical or density based clustering k-means is a type of
partitioning clustering that is it divides the data into Kate
non-overlapping subsets or clusters without any cluster internal structure
or labels this means it’s an unsupervised algorithm objects within a
cluster are very similar and objects across different clusters are very
different or dissimilar as you can see for using k-means we have to find
similar samples for example similar customers now we face a couple of key
questions first how can we find the similarity of samples in clustering and
then how do we measure how similar two customers are with regard to their
demographics though the objective of k-means is to form clusters in such a
way that similar samples go into a cluster and dissimilar samples fall in
different clusters it can be shown that instead of a similarity metric we can
use dissimilarity metrics in other words conventionally the distance of samples
from each other is used to shape the clusters so we can say k-means tries to
minimize the intra cluster distances and maximize the inter cluster distances now
the question is how can we calculate the disome Alera T or distance of two cases
such as two customers assume that we have two customers will call them
customer one and two let’s also assume that we have only one feature for each
of these two customers and that feature is age we can easily use a specific type
of Minkowski distance to calculate the distance of these two customers indeed
it is the Euclidean distance distance of X 1 from X 2 is root of 34 minus 30
power 2 which is 4 what about if we have more than one feature for example age
and income for example if we have income and age for each customer we can still
use the same formula but this time in a two dimensional space also we can use
the same distance matrix for multi-dimensional vectors of course we
have to normalize our feature set to get the accurate dissimilarity measure there
are other dissimilarity measures as well that can be used for this purpose but it
is highly dependent on data type and also the domain that clustering is done
for it for example you may use Euclidean
distance cosine similarity average distance and so on indeed the similarity
measure haile controls how the clusters are formed so it is recommended to
understand the domain knowledge of your data set and data type of features and
then choose the meaningful distance measurement now let’s see how K means
clustering works for the sake of simplicity let’s assume that our dataset
has only two features the age and income of customers this means it’s a two
dimensional space we can show the distribution of customers using a
scatterplot the y axis indicates age and the x axis shows income of customers we
try to cluster the customer data set into distinct groups or clusters based
on these two dimensions in the first step we should determine the number of
clusters the key concept of the k-means algorithm is that it randomly picks a
center point for each cluster it means we must initialize K which represents
number of clusters essentially determining the number of clusters in a
data set or K is a hard problem in k-means that we will discuss later for
now let’s put k equals three here for our sample data set in his life we have
three representative points for our clusters these three data points are
called centroids of clusters and should be of same feature size of our customer
features set there are two approaches to choose these centroids one we can
randomly choose three observations out of the data set and use these
observations as the initial means or two we can create three random points and
centroids of the clusters which is our choice that is shown in the plot with
red color after the initialization step which was defining the centroid of each
cluster we have to assign the each customer to the closest center for this
purpose we have to calculate the distance of each data point or in our
case each customer from the centroid points as mentioned before depending on
the nature of the data and the purpose for which clustering is being used
different measures of distance may be used to place items into clusters
therefore you will form a matrix where each row represents the distance of a
customer from each centroid it is called the distance matrix the main objective
of k-means clustering is to minimize the distance of data points from the
centroid of his cluster and maximize the distance from other cluster centroids
so in this step we have to find the closest centroid to each data point we
can use the distance matrix to find the nearest centroid two data points finding
the closest centroids for each data point we assign each data point to that
cluster in other words all the customers will fall to a cluster based on their
distance from centroids we can easily say that it does not
result in good clusters because the centroids were chosen randomly from the
first indeed the model would have a high error here
errors the total distance of each point from its centroid it can be shown as
within cluster sum of squares error intuitively we try to reduce this error
it means we should shape clusters in such a way that the
distance of all members of a cluster from its centroid be minimized now the
question is how can we turn it into better clusters with less error okay we
move centroids in the next step each cluster Center will be updated to be the
mean four data points in its cluster indeed each centroid moves according to
their cluster members in other words the centroid of each of the three clusters
becomes the new mean for example if point eight coordination is seven point
four and three point six and point B features are seven point eight and three
point eight the new centroid of this cluster with two points would be the
average of them which is seven point six and three point seven now we have new
centroids as you can guess once again we will have to calculate the distance of
all points from the new centroids the points re clustered and the centroids
move again this continues until the centroid is no
longer move please note that whenever a centroid moves each points distance to
the centroid needs to be measured again yes k-means is an iterative algorithm
and we have to repeat steps two to four until the algorithm converges in each
iteration it will move the centroids calculate the distances from new
centroids and assign data points to the nearest centroid it results in the
clusters with minimum error or the most dense clusters however as it is a
heuristic algorithm there is no guarantee that it will converge to the
global optimum and the result may depend on the initial clusters it means this
algorithm is guaranteed to converge to a result but the result may be a local
optimum ie not necessarily the best possible outcome to solve this problem
it is common to run the whole process multiple times with different starting
conditions this means with randomized starting centroids it may give a better
outcome and as the owner of them is usually very fast it wouldn’t be any
problem to run it multiple times hello and welcome in this video we’ll look at
k-means accuracy and characteristics let’s get started let’s define the
algorithm more concretely before we talk about its accuracy a k-means algorithm
works by randomly placing K centroids one for each cluster
the farther apart the clusters are placed the better the next step is to
calculate the distance of each data point or object from the centroids
Euclidean distance is used to measure the distance from the object to the
centroid please note however that you can also use different types of distance
measurements not just Euclidean distance Euclidean distance is used because it’s
the most popular then assign each data point or object to its closest centroid
creating a group next once each data point has been classified to a group
recalculate the position of the case centroids the new centroid position is
determined by the mean of all points in the group finally this continues until
the centroids no longer move now the question is how can we evaluate the
goodness of the clusters formed by k-means in other words how do we
calculate the accuracy of k-means clustering one way is to compare the
clusters with the ground truth if it’s available
however because Kate means is an unsupervised algorithm we usually don’t
have ground truth and real world problems to be used but there is still a
way to say how bad each cluster is based on the objective of the k-means this
value is the average distance between data points within a cluster also
average of the distances of data points from their cluster centroids can be used
as a metric of error for the clustering algorithm essentially determining the
number of clusters in a dataset or K as in the k-means algorithm is a frequent
problem in data clustering the correct choice of K is often ambiguous because
it’s very dependent on the shape and scale of the distribution of points in a
data set there are some approaches to address this problem but one of the
techniques that is commonly used is to run the clustering across the different
values of K and looking at a metric of accuracy for clustering this metric can
be mean distance between data points and their clusters centroid which indicate
how dense our clusters are or to what extent we minimize the error of
clustering then looking at the change of this metric we can find the best value
for K but the problem is that with increasing the number of clusters the
distance of centroids two data points will always reduce this means increasing
K will always decrease the error so the value of the metric as a function of K
is plotted and the elbow point is determined where the rate
of decrease sharply shifts it is the right K for clustering this method is
called the elbow method so let’s recap k-means clustering k-means is a
partition based clustering which is a relatively efficient on medium and
large-sized datasets B produces sphere-like clusters because the
clusters are shaped around the centroids and C its drawback is that we should pre
specify the number of clusters and this is not an easy task hello and welcome in
this video we’ll be covering hierarchical clustering so let’s get
started let’s look at this chart an international team of scientists led by
UCLA biologists use this dendrogram to report genetic data from more than 900
dogs from 85 breeds and more than 200 wild gray wolves worldwide including
populations from North America Europe the Middle East and East Asia they use
molecular genetic techniques to analyze more than 48,000 genetic markers this
diagram shows hierarchical clustering of these animals based on the similarity in
their genetic data hierarchical clustering algorithms build a hierarchy
of clusters where each node is a cluster consisting of the clusters of its
daughter nodes strategies for hierarchical clustering generally fall
into two types divisive and agglomerative divisive is top-down
so you start with all observations in a large cluster and break it down into
smaller pieces think about divisive as dividing the
cluster agglomerated is the opposite of divisive so it is bottom-up where each
observation starts in its own cluster and pairs of clusters are merged
together as they move up the hierarchy agglomeration means to amass or collect
things which is exactly what this does with the cluster the agglomerative
approach is more popular among data scientists and so it is the main subject
of this video let’s look at a sample of agglomerated clustering this method
builds the hierarchy from the individual elements by progressively merging
clusters in our example let’s say we want to cluster 6 cities in Canada based
on their distances from one another they are Toronto Ottawa Vancouver
Montreal Winnipeg and Edmonton we construct a distance matrix at this
stage where the numbers in the row I column J is the distance between the I
and J cities in fact this table shows the distances between each pair of
cities the algorithm is started by assigning each city to its own cluster
so if we have six cities we have six clusters each containing just one city
let’s note each city by showing the first two characters of its name the
first step is to determine which cities let’s call them clusters from now on to
merge into a cluster usually we want to take the two closest clusters according
to the chosen distance looking at the distance matrix Montreal and Ottawa are
the closest clusters so we make a cluster out of them please notice that
we just use a simple one-dimensional distance feature here but our object in
the multi-dimensional and distance measurement can either be in clitty in’
pearson average distance or many others depending on data type and domain
knowledge anyhow we have to merge these two closest cities in the distance
matrix as well so rows and columns are merged as the cluster is constructed as
you can see in the distance matrix rows and columns related to Montreal and
Ottawa cities are merged as the cluster is constructed then the distances from
all cities to this new merged cluster get updated but how for example how do
we calculate the distance from Winnipeg to the Ottawa Montreal cluster well
there are different approaches but let’s assume for example we just select the
distance from the centre of the Ottawa Montreal cluster to Winnipeg updating
the distance matrix we now have one less cluster next we look for the closest
clusters once again in this case Ottawa Montreal and Toronto are the closest
ones which creates another cluster in the next step the closest distance is
between the Vancouver cluster and the Edmonton cluster forming a new cluster
their data in the matrix table gets updated essentially the rows and columns
are merged as the clusters are merged and the distance updated this is a
common way to implement this to clustering and has the benefit of
cashing distances between clusters in the same way agglomerated
algorithm proceeds by merging clusters and we repeat it until all clusters are
merged and the tree becomes completed it means until all cities are clustered
into a single cluster of size 6 hierarchical clustering is typically
visualized as a dendrogram as shown on this slide
each merge is represented by a horizontal line the y coordinate of the
horizontal line is the similarity of the two clusters that were merged where
cities are viewed as singleton clusters by moving up from the bottom layer to
the top node a dendrogram allows us to reconstruct the history of merges that
resulted in the depicted clustering essentially hierarchical clustering does
not require a pre specified number of clusters however in some applications we
want a partition of disjoint clusters just as in flat clustering in those
cases the hierarchy needs to be cut at some point for example here cutting in a
specific level of similarity we create three clusters of similar cities hello
and welcome in this video we’ll be covering more details about hierarchical
clustering let’s get started let’s look at agglomerated algorithm for
hierarchical clustering remember that agglomerative clustering is a bottom-up
approach let’s say our dataset has n data points first we want to create n
clusters one for each data point then each point is assigned as a cluster next
we want to compute the distance proximity matrix which will be an N by n
table after that we want to iteratively run the following steps until the
specified cluster number is reached or until there is only one cluster left
first merge the two nearest clusters distances are computed already in the
proximity matrix second update the proximity matrix with the new values we
stopped after we’ve reached the specified number of clusters or there is
only one cluster remaining with the results stored in the dendrogram so in
the proximity matrix we have to measure the distances between clusters and also
merge the clusters that are nearest so the key operation is the computation
of the proximity between the clusters with one point and also clusters with
multiple data points at this point there are a number of key questions that need
to be answered for instance how do we measure the distances between these
clusters and how do we define the nearest among clusters we also can ask
which points do we use first let’s see how to calculate the distance between
two clusters with one point each let’s assume that we have a data set of
patients and we want to cluster them using hierarchy clustering so our data
points are patients with a feature set of three dimensions for example age body
mass index or BMI and blood pressure we can use different distance measurements
to calculate the proximity matrix for instance Euclidean distance so if we
have a data set of n patients we can build an N by n dissimilarity distance
matrix it will give us the distance of clusters with one data point however as
mentioned we merge clusters in agglomerative clustering now the
question is how can we calculate the distance between clusters when there are
multiple patients in each cluster we can use different criteria to find the
closest clusters and merge them in general it completely depends on the
data type dimensionality of data and most importantly the domain knowledge of
the data set in fact different approaches to defining the distance
between clusters distinguish the different algorithms as you might
imagine there are multiple ways we can do this
the first one is called single linkage clustering single linkage is defined as
the shortest distance between two points in each cluster such as point a and meet
next up is complete linkage clustering this time we are finding the longest
distance between the points in each cluster
such as the distance between point A and V the third type of linkage is average
linkage clustering or the mean distance this means we’re looking at the average
distance of each point from one cluster to every point in another cluster the
final linkage type to be reviewed is centroid linkage clustering centroid is
the average of the feature sets of points in a cluster this linkage takes
into account the centroid of each cluster when determining the minimum
instance there are three main advantages to using hierarchical clustering first
we do not need to specify the number of clusters required for the algorithm
second hierarchical clustering is easy to implement and third the dendrogram
produced is very useful in understanding the data there are some disadvantages as
well first the algorithm can never undo any previous steps so for example the
algorithm clusters two points and later on we see that the connection was not a
good one the program cannot undo that step second the time complexity for the
clustering can result in very long computation times in comparison with
efficient algorithms such as k-means finally if we have a large data set it
can become difficult to determine the correct number of clusters by the
dendrogram now let’s compare hierarchical clustering with k-means
Gate means is more efficient for large datasets in contrast to k-means
hierarchical clustering does not require the number of clusters to be specified
hierarchical clustering gives more than one partitioning depending on the
resolution whereas K means gives only one partitioning of the data
hierarchical clustering always generates the same clusters in contrast with
k-means that returns different clusters each time it is run due to random
initialization of centroids hello and welcome in this video we’ll be covering
DD scan a density based clustering algorithm which is appropriate to use
when examining spatial data so let’s get started most of the traditional
clustering techniques such as k-means hierarchical and fuzzy clustering can be
used to group data in an unsupervised way however when applied to tasks with
arbitrary shaped clusters or clusters within clusters traditional techniques
might not be able to achieve good results that is elements in the same
cluster might not share enough similarity or the performance may be
poor additionally while partitioning based algorithms such as k-means may be
easy to understand and implement in practice the algorithm has no notion of
outliers that is all points are assigned to a cluster even if they do not belong
in any in the domain of anomaly detection this causes problems as
anomalous points will be assigned to the same cluster
normal data points the anomalous points pulled the cluster centroid towards them
making it harder to classify them as anomalous points in contrast density
based clustering locates regions of high density that are separated from one
another by regions of low density density in this context is defined as
the number of points within a specified radius a specific and very popular type
of density based clustering is DP scan DB scan is particularly effective for
tasks like class identification on a spatial context the wonderful attribute
of the DB scan algorithm is that it can find out any arbitrary shaped cluster
without getting affected by noise for example this map shows the location of
weather stations in Canada DB scan can be used here to find the group of
stations which show the same weather condition as you can see it not only
finds different arbitrary shape clusters it can find the denser part of data
centered samples by ignoring less dense areas or noises now let’s look at this
clustering algorithm to see how it works DB scan stands for density based spatial
clustering of applications with noise this technique is one of the most common
clustering algorithms which works based on density of object DB scan works on
the idea that if a particular point belongs to a cluster it should be near
to lots of other points in that cluster it works based on two parameters radius
and minimum points are determines a specified radius that if it includes
enough points within it we call it a dense area mm determines the minimum
number of data points we want in a neighborhood to define the cluster let’s
define radius as two units for the sake of simplicity assume it as radius of two
centimeters around a point of interest also let’s set the minimum point
or M to be 6 points including the point of interest to see how DB scan works we
have to determine the type of points each point in our data set can be either
a core order or outlier point don’t worry I’ll explain what these points are
in a moment but the whole idea behind the DB scan algorithm is to visit each
point and find its type first then we group points as clusters based
there types let’s pick a point randomly first we check to see whether it’s a
core data point so what is a core point a data point is a core point if within
our neighborhood of the point there are at least M points for example as there
are six points in the 2-centimeter neighbor of the rent point we mark this
point as a core point okay what happens if it’s not a core point let’s look at
another point is this point a core point no as you can see there are only five
points in this neighborhood including the yellow point so what kind of point
is this one in fact it is a border point what is a border point a data point is a
border point if a it’s neighborhood contains less than M data points or B it
is reachable from some core point here reach ability means it is within our
distance from a core point it means that even though the yellow point is within
the 2-centimeter neighborhood of the red point it is not by itself a core point
because it does not have at least six points in its neighborhood we continue
with the next point as you can see it is also a core point and all points around
it which are not core points or border points next core point and next core
point let’s pick this point you can see it is not a core point nor is it a
border point so we label it as an outlier what is an outlier an outlier is
a point that is not a core point and also is not close enough to be reachable
from a core point we continue and visit all the points in the data set and label
them as either core order or outlier the next step is to connect core points that
are neighbors and put them in the same cluster so a cluster is formed as at
least one core point plus all reachable core points plus all their borders it
simply shapes all the clusters and finds outliers as well let’s review this one
more time to see why DB scan is cool D bees can confine arbitrarily shaped
clusters it can even find a cluster completely surrounded by a different
cluster DB scan has a notion of noise and is
us two outliers on top of that DB scan makes it very practical for use in many
real-world problems because it does not require one to specify the number of
clusters such as K in k-means hello and welcome in this video we’ll be going
through a quick introduction to recommendation systems so let’s get
started even though people’s tastes may vary
they generally follow patterns by that I mean that there are similarities and the
things that people tend to like or another way to look at it is that people
tend to like things in the same category or things that share the same
characteristics for example if you’ve recently purchased a book on machine
learning and Python and you’ve enjoyed reading it it’s very likely that you’ll
also enjoy reading a book on data visualization people also tend to have
similar taste to those of the people they’re close to in their lives
recommender systems try to capture these patterns and similar behaviors to help
predict what else you might like recommender systems have many
applications that I’m sure you’re already familiar with indeed recommender
systems are usually at play on many websites for example suggesting books on
Amazon and movies on Netflix in fact everything on Netflix is website is
driven by customer selection if a certain movie gets viewed frequently
enough Netflix’s recommender system ensures
that that movie gets an increasing number of recommendations another
example can be found in a daily use mobile app where a recommender engine is
used to recommend anything from where to eat or what job to apply to on social
media sites like Facebook or LinkedIn regularly recommend friendships
recommender systems are even used to personalize your experience on the web
for example when you go to a news platform website a recommender system
will make note of the types of stories that you clicked on and make
recommendations on which types of stories you might be interested in
reading in future there are many of these types of examples and they are
growing in number every day so let’s take a closer look at the main benefits
of using a recommendation system one of the main advantages of using
recommendation systems is that users get a broader exposure to many different
products they might be interested in this exposure encourages users towards
continual usage or purchase of their product not only does this provide a
better experience for the user but it benefits the service provider as
well with increased potential revenue and better security for its customers
there are generally two main types of recommendation systems content-based and
collaborative filtering the main difference between each can be summed up
by the type of statement that a consumer might make for instance the main
paradigm of a content-based recommendation system is driven by the
statement show me more of the same of what I’ve liked before content based
systems try to figure out what a user’s favorite aspects of an item are and then
make recommendations on items that share those aspects collaborative filtering is
based on a user saying tell me what’s popular among my
neighbors because I might like it to collaborative filtering techniques finds
similar groups of users and provide recommendations based on similar tastes
within that group in short it assumes that a user might be interested in what
similar users are interested in also there are hybrid recommender systems
which combine various mechanisms in terms of implementing recommender
systems there are two types memory based and model based in memory based
approaches we use the entire user item data set to generate a recommendation
system it uses statistical techniques to approximate users or items examples of
these techniques include Pearson correlation cosine similarity and
Euclidean distance among others in model-based approaches a model of users
is developed in an attempt to learn their preferences models can be created
using machine learning techniques like regression clustering classification and
so on hello and welcome in this video we’ll be covering content-based
recommender systems so let’s get started a content-based recommendation system
tries to recommend items to users based on their profile the users profile
revolves around that users preferences and tastes it is shaped based on user
ratings including the number of times that user has clicked on different items
or perhaps even liked those items the recommendation process is based on the
similarity between those items similarity or closeness of items is
measured based on the similarity in the content of those items when we say
content we’re talking about things like the items
Gouri tag genre and so on for example if we have four movies and if the user
likes or rates the first two items and if item three is similar to item one in
terms of their genre the engine will also recommend item three to the user in
essence this is what content-based recommender system engines do now let’s
dive into a content-based recommender system to see how it works let’s assume
we have a data set of only six movies this data set shows movies that our user
has watched and also the genre of each of the movies for example Batman vs
Superman is in the adventure superhero genre and guardians of the galaxy is in
the comedy adventure superhero and science fiction genre z’m let’s say the
user has watched and rated three movies so far and she has given a rating of two
out of ten – the first movie 10 out of 10 – the second movie an 8 out of 10 to
the third the task of the recommender engine is to recommend one of the three
candidate movies to this user or in other words we want to protect what the
users possible rating would be of the three candidate movies if she were to
watch them to achieve this we have to build the user profile first we create a
vector to show the users ratings for the movies that she’s already watched we
call it input user ratings then we encode the movies through the one-hot
encoding approach genre movies are used here as a feature set we use the first
three movies to make this matrix which represents the movie feature set matrix
if we multiply these two matrices we can get the weighted feature set for the
movies let’s take a look at the result this matrix is also called the way to
genre matrix and represents the interests of the user for each genre
based on the movies that she’s watched now given the weighted shun remai tricks
we can shape the profile of our active user essentially we can aggregate the
weight of genres and then normalize them to find the user profile it clearly
indicates that she likes superhero movies more than other genres we use
this profile to figure out what movie is proper to recommend to this user
recall that we also had three candidate movies for recommendation that haven’t
been washed by the user we encode these movies as well now we’re in the position
where have to figure out which of them is most
suited to be recommended to the user to do this we simply multiply the user
profile matrix by the candidate movie matrix which results in the weighted
movies matrix it shows the weight of each genre with
respect to the user profile now if we aggregate these weighted ratings we get
the active users possible interest level in these three movies in essence is our
recommendation lists which we can sort to rank the movies and recommend them to
the user for example we can say that the Hitchhiker’s Guide to the galaxy has the
highest score in our list and is proper to recommend to the user
now you can come back and fill the predicted ratings for the user so to
recap what we’ve discussed so far the recommendation in a content based system
is based on users taste and the content or feature set items such a model is
very efficient however in some cases it doesn’t work for example assume that we
have a movie in the drama genre which the user has never watched so this genre
would not be in her profile therefore shall only get recommendations
related to genres that are already in her profile and the recommender engine
may never recommend any movie within other genres this problem can be solved
by other types of recommender systems such as collaborative filtering hello
and welcome in this video we’ll be covering a recommender system technique
called collaborative filtering so let’s get started
collaborative filtering is based on the fact that relationships exist between
products and people’s interests many recommendation systems use collaborative
filtering to find these relationships and to give an accurate recommendation
of a product that the user might like or be interested in collaborative filtering
has basically two approaches user based and item based user based collaborative
filtering is based on the users similarity or neighborhood item based
collaborative filtering is based on similarity among items let’s first look
at the intuition behind the user based approach in user based collaborative
filtering we have active user for whom the recommendation
is aimed the collaborative filtering engine first looks for users who are
similar that is users who share the active users raining patterns
collaborative filtering basis this similarity on things like history
preference and choices that users make when buying watching or enjoying
something for example movies that similar users have rated highly then it
uses the ratings from these similar users to predict the possible ratings by
the active user for a movie that she had not previously watched for instance if
two users are similar or our neighbors in terms of their interested movies we
can recommend the movie to the active user that her neighbor has already seen
now let’s dive into the algorithm to see how all of this works assume that we
have a simple user item matrix which shows the ratings of four users for five
different movies let’s also assume that our active user has watched and rated
three out of these five movies let’s find out which of the two movies that
are active user hasn’t watched should be recommended to her the first step is to
discover how similar the active user is to the other users how do we do this
well this can be done through several different statistical and Victoria
techniques such as distance or similarity measurements including
Euclidean distance Pearson correlation cosine similarity and so on to calculate
the level of similarity between two users we use the three movies that both
the users have rated in the past regardless of what we use for similarity
measurement let’s say for example the similarity could be 0.7 0.9 and 0.4
between the active user and other users these numbers represent similarity
weights or proximity of the active user to other users in the dataset the next
step is to create a weighted rating matrix we just calculated the similarity
of users to our active user in the previous slide now we can use it to
calculate the possible opinion of the active user about our to target movies
this is achieved by multiplying the similarity weights to the user ratings
it results in a weight of ratings matrix which represents the user’s neighbors
opinion about our two candidate movies for recommendation in fact it
incorporates the behavior of other users and gives more weight to the ratings of
those users who are more similar to the active user now we can generate the
recommendation matrix by aggregating all of the weight of rates however as three
users rated the first potential movie and two users rated the second movie we
have to normalize the weighted rating values we do this by dividing it by the
sum of the similarity index for users the result is the potential rating that
our active user will give to these movies based on her similarity to other
users it is obvious that we can use it to rank the movies for providing
recommendation to our active user now let’s examine what’s different between
user based and item based collaborative filtering in the user based approach the
recommendation is based on users of the same neighborhood with whom he or she
shares common preferences for example as user 1 and user 3 both liked item 3 and
item 4 we consider them as similar or neighbor users and recommend item 1
which is positively rated by user 1 to user 3 in the item based approach
similar items build neighborhoods on the behavior of users please note however
that it is not based on their contents for example item 1 and item 3 are
considered neighbors as they were positively rated by both user 1 and user
2 so item 1 can be recommended to user 3 as he has already shown interest in item
3 therefore the recommendations here are based on the items in the neighborhood
that a user might prefer collaborative filtering is a very effective
recommendation system however there are some challenges with it as well one of
them is Dana sparsity data sparsity happens when you have a large data set
of users who generally rate only a limited number of items as mentioned
collaborative based recommenders can only predict snoring of an item if there
are other users liberated them due to sparsity we might not have enough
ratings in the user item data set which makes it impossible to provide proper
recommendations another issue to keep in mind is something called cold start cold
start refers to the difficulty the recommendation system has when there is
a new user and as such a profile doesn’t exist for them yet cold start can also
happen when we have a new item which is not received a rating scalability can
become an issue as well as the number of users or items increases and the amount
of data expands collaborative filtering algorithms will begin to suffer drops in
performance simply due to growth in the similarity computation there are some
solutions for each of these challenges such as using hybrid based recommender
systems but they are out of scope of this course
thanks for watching you you

3 thoughts on “Machine Learning Complete Elite Course (projects with python included) | Beginner to Expert | CF

  1. πŸŽ’ Contents & Time Line ⏱
    🎞️ 0:00:10 – Welcome To ML with Python
    🎞️ 0:01:42 – High-Level Introduction to ML
    🎞️ 0:08:27 – Python For Machine Learning
    🎞️ 0:13:21 – Supervised Vs Unsupervised Learning
    🎞️ 0:18:07 – πŸ“‚ Regression
    🎞️ 0:21:59 – β””πŸ“ Linear Regression
    🎞️ 0:32:01 – β””πŸ“ Model Evaluation Approache
    🎞️ 0:38:34 – β””πŸ“ Evaluation Metrics in RM
    🎞️ 0:41:01 – β””πŸ“ Multiple Linear Regression
    🎞️ 0:51:54 – β””πŸ“ Non-Linear Regression
    🎞️ 0:57:57 – β””πŸ“ Logistic Regression
    🎞️ 1:04:15 – πŸ“‚ Classification
    🎞️ 1:07:20 – β””πŸ“ K – Nearest Neighbours
    🎞️ 1:14:40 – β””πŸ“ Evaluation Metrics (Classification)
    🎞️ 1:20:22 – β””πŸ“ Decision Trees
    🎞️ 1:23:32 – β””πŸ“ Building Decision Trees
    🎞️ 1:32:00 – β””πŸ“ Logistic Regression
    🎞️ 1:38:18 – β””πŸ“ Linear Vs Logistic Regression
    🎞️ 1:50:41 – β””πŸ“ Logistic Regression Model Training
    🎞️ 2:01:42 – β””πŸ“ Support Vector Machine (SVM)
    🎞️ 2:08:45 – πŸ“‚ Clustering
    🎞️ 2:15:07 – β””πŸ“ K – Means Clustering Basic
    🎞️ 2:22:52 – β””πŸ“ K – Means Clustering Advanced
    🎞️ 2:25:53 – β””πŸ“ Hierarchical Clustering Basic
    🎞️ 2:30:52 – β””πŸ“ Hierarchical Clustering Advanced
    🎞️ 2:35:32 – β””πŸ“ DBSCAN Clustering
    🎞️ 2:41:03 – πŸ“‚Recommendation Systems (RS)
    🎞️ 2:44:39 – β””πŸ“Content-Based RS
    🎞️ 2:48:48 – β””πŸ“Collaborative Filtering
    🎞️ 2:54:31 – πŸ“‚ Recommended Projects
    🎞️ 2:54:36 – β””πŸ“ Twitter Sentiment Analysis
    🎞️ 2:56:11 – β””πŸ“ Handwriting Recognition
    🎞️ 2:57:26 – β””πŸ“ Stock Price Prediction
    🎞️ 2:58:22 – β””πŸ“ Football Match Prediction
    🎞️ 2:59:36 – β””πŸ“ Movie Recommendation System
    🎞️ 3:00:10 – Where To Go From Here?

Leave a Reply

Your email address will not be published. Required fields are marked *