(Subtitles Auto Generated) Hello and welcome to machine learning

with Python in this course you’ll learn how machine learning is used in many key

fields and industries for example in the healthcare industry data scientists use

machine learning to predict whether a human cell that is believed to be at

risk of developing cancer is either benign or malignant as such machine

learning can play a key role in determining a person’s health and

welfare you’ll also learn about the value of decision trees and how building

a good decision tree from historical data helps doctors to prescribe the

proper medicine for each of their patients you’ll learn how bankers use

machine learning to make decisions on whether to approved loan applications

and you will learn how to use machine learning to do Bank customer

segmentation where it is not usually easy to run for huge volumes of very

data in this course you’ll see how machine learning helps websites such as

YouTube Amazon or Netflix develop recommendations to their customers about

various products or services such as which movies they might be interested in

going to see or which books to buy there is so much that you can do with machine

learning here you’ll learn how to use popular Python libraries to build your

model for example given an automobile data set we can use the scikit-learn

library to estimate the co2 emission of cars using their engine size or

cylinders we could even predict what the co2 emissions will be for a car that

hasn’t even been produced yet and we’ll see how the telecommunications

industries can predict customer churn hello and welcome in this video I will

give you a high-level introduction to machine learning so let’s get started

this is a human cell sample extracted from a patient and this cell has

characteristics for example its clump thickness is 6 its uniformity of cell

size is 1 its marginal adhesion is 1 and so on one of the interesting questions

we can ask at this point is is this a benign or malignant cell in contrast

with a benign tumor a malignant tumor is a tumor that may invade its surrounding

tissue or spread around the body and diagnosing it early might be the key to

a patient’s survival one could easily presume that only a doctor with years of

experience could diagnose that tumor and say if the patient is developing

cancer or not right well imagine that you’ve obtained the dataset containing

characteristics of thousands of human cell samples extracted from patients who

were believed to be at risk of developing cancer analysis of the

original data showed that many of the characteristics differed significantly

between benign and malignant samples you can use the values of these cell

characteristics in samples from other patients to give an early indication of

whether a new sample might be benign or malignant you should clean your data

select a proper algorithm for building a prediction model and train your model to

understand patterns of benign or malignant cells within the data once the

model has been trained by going through data iteratively it can be used to

predict your new or unknown cell with rather high accuracy this is machine

learning it is the way that a machine learning model can do a doctor’s task or

at least help that doctor make the process faster now let me give a formal

definition of machine learning machine learning is the subfield of computer

science that gives computers the ability to learn without being explicitly

programmed let me explain what I mean when I say without being explicitly

programmed assume that you have a data set of images of animals such as cats

and dogs and you want to have software or an application that could recognize

and differentiate them the first thing that you have to do here is interpret

the images as a set of feature sets for example does the image show the animals

eyes if so what is their size does it have ears what about a tail how many

legs does it have wings prior to machine learning each image would be transformed

to a vector of features then traditionally we had to write down some

rules or methods in order to get computers to be intelligent and detect

the animals but it was a failure why well as you can guess it needed a lot of

rules highly dependent on the current data set and not generalized enough to

detect out-of-sample cases this is when machine learning entered the scene using

machine learning allows us to build a model that looks at all the feature sets

and their corresponding type of animals and it learns the pattern of each animal

it is a model built by machine learning algorithms into text without explicitly

being programmed to do so in essence machine learning follows the same

process that a four-year-old child uses to learn understand and differentiate

animals so machine learning algorithms inspired by the human learning process

iteratively learned from data and allow computers to find hidden insights these

models help us in a variety of tasks such as object recognition summarization

recommendation and so on machine learning impact society in a very

influential way here are some real life examples first how do you think Netflix

and Amazon recommend videos movies and TV shows to its users they use machine

learning to produce suggestions that you might enjoy this is similar to how your

friends might recommend a television show to you based on their knowledge of

the types of shows you like to watch how do you think banks make a decision when

approving a loan application they use machine learning to predict the

probability of default for each applicant and then approve or refuse the

loan application based on that probability telecommunication companies

use their customers demographic data to segment them or predict if they will

unsubscribe from their company the next month there are many other applications

of machine learning that we see every day in our daily life such as chat BOTS

logging into our phones or even computer games using face recognition

each of these use different machine learning techniques and algorithms so

let’s quickly examine a few of the more popular techniques the regression

estimation technique is used for predicting a continuous value for

example predicting things like the price of a house based on his characteristics

or to estimate the co2 emission from a car’s engine a classification technique

is used for predicting the class or category of a case for example if a cell

is benign or malignant or whether or not a customer will churn clustering groups

of similar cases for example can find similar patients or it can be used for

customer segmentation in the banking field Association technique is used for

finding items or events that often co-occur

for example grocery items that are usually bought together by a particular

customer anomaly detection is used to discover abnormal and unusual cases for

example it is used for credit card fraud detection

sequence mining is used for predicting the next event for instance that click

stream and website’s dimension reduction is used to reduce the size of data and

finally recommendation systems this associates people’s preferences with

others who have similar tastes and recommends new items to them such as

books or movies we will cover some of these techniques in the next videos by

this point I’m quite sure this question has crossed your mind what is the

difference between these buzzwords that we keep hearing these days such as

artificial intelligence or AI machine learning and deep learning well let me

explain what is different between them in brief AI tries to make computers

intelligent in order to mimic the cognitive functions of humans so

artificial intelligence is a general field with a broad scope including

computer vision language processing creativity and summarization machine

learning is the branch of AI that covers the statistical part of artificial

intelligence it teaches the computer to solve problems by looking at hundreds or

thousands of examples learning from them and then using that experience to solve

the same problem in new situations and deep learning is a very special field of

machine learning where computers can actually learn and make intelligent

decisions on their own deep learning involves a deeper level of automation in

comparison with most machine learning algorithms now that we’ve completed the

introduction to machine learning subsequent videos will focus on

reviewing two main components first you’ll be learning about the purpose of

machine learning and where can be applied in the real world and second

you’ll get a general overview of machine learning topics such as supervised

versus unsupervised learning model evaluation and various machine learning

algorithms so now that you have a sense of what’s in store on this journey let’s

continue our exploration of the Shave learning hello and welcome in this video

we’ll talk about how to use Python for machine learning so let’s get started

Python is a popular and powerful general-purpose programming language

that recently emerged as the preferred language among data scientists you can

write your machine learning algorithms using Python and it works very well

however there are a lot of modules and libraries already implemented in Python

that could make your life much easier we try to introduce the Python packages

in this course and use it in the labs to give you better hands-on experience the

first package is numpy which is a math library to work with n dimensional

arrays in Python it enables you to do computation efficiently and effectively

it is better than regular Python because of its amazing capabilities for example

for working with arrays dictionaries functions data types and working with

images you need to know numpy SCI PI is a collection of numerical algorithms and

domain-specific tool boxes including signal processing optimization

statistics and much more SCI pi is a good library for scientific

and high-performance computation matplotlib is a very popular plotting

package that provides 2d plotting as well as 3d plotting basic knowledge

about these three packages which are built on top of Python is a good asset

for data scientists who want to work with real-world problems if you’re not

familiar with these packages I recommend that you take the data analysis with

Python course first this course covers most of the useful topics in these

packages pandas library is a very high-level Python library that provides

high performance easy to use data structures it has many functions for

data importing manipulation and analysis in particular it offers data structures

and operations for manipulating numerical tables and time series

scikit-learn is a collection of algorithms and tools for machine

learning which is our focus here and which you’ll learn to use within this

course as will be using scikit-learn quite a bit in the labs

let me explain more about it and show you why it is so popular among data

scientists scikit-learn is a free machine learning library for the Python

programming language it has most of the classification regression and clustering

algorithms and is designed to work with a Python numerical and scientific

libraries numpy and Sai pipe also it includes very good documentation on top

of that implementing machine learning models with scikit-learn

is really easy with a lines of Python code most of the tasks

that need to be done in a machine learning pipeline are implemented

already inside kit learn including pre-processing of data

feature selection feature extraction train test splitting defining the

algorithms fitting models tuning parameters prediction evaluation and

exporting model let me show you an example of how scikit-learn

looks like when you use this library you don’t have to understand the code for

now but just see how easily you can build a model with just a few lines of

code basically machine learning algorithms benefit from standardization

of the data set if there are some outliers or different scales fields in

your data set you have to fix them the pre-processing package of scikit-learn

provides several common utility functions and transformer classes to

change raw feature vectors into a suitable form of vector for modeling you

have to split your data set into train and test sets to train your model and

then test the models accuracy separately scikit-learn can split arrays or

matrices into random train and test subsets for you in one line of code then

you can set up your alga rhythm for example you can build a classifier using

a support vector classification algorithm we call our estimator instance

CLF and initialize as parameters now you can train your model with the train set

by passing our training set to the fit method the CLF model learns to classify

unknown cases then we can use our test set to run predictions and the result

tells us what the class of each unknown value is also you can use the different

metrics to evaluate your model accuracy for example using a confusion matrix to

show the results and finally you save your model you may find all or some of

these machine learning terms confusing but don’t worry we’ll talk about all of

these topics in the following videos the most important point to remember is that

the entire process of a machine learning task can be done simply in a few lines

of code using scikit-learn please notice that though it is possible

it would not be that easy if you want to do all of this using numpy or side PI

packages and of course it needs much more coding if you use pure Python

programming to implement all of these tasks hello and welcome in this video

we’ll introduce supervised algorithms versus unsupervised algorithms so let’s

get started an easy way to begin grasping the concept of supervised

learning is by looking directly at the words that make it up supervised means

to observe and direct the execution of a task project or activity obviously we

aren’t going to be supervising a person instead we’ll be supervising a machine

learning model that might be able to produce classification regions like we

see here so how do we supervise a machine learning model we do this by

teaching the model that is we load the model with knowledge so that we can have

it predict future instances but this leads to the next question which is how

exactly do we teach a model we teach the model by training it with some data from

a label data set it’s important to note that the data is labeled and what does a

label data set look like well it could look something like this this example is

taken from the cancer data set as you can see we have some historical data for

patients and we already know the class of each row let’s start by introducing

some components of this table the names appear which are called club thickness

uniformity of cell size uniformity of cell shape marginal adhesion and so on

are called attributes the columns are called features which include the data

if you plot this data and look at a single data point unapplied

it’ll have all of these attributes that would make a row on this chart also

referred to as an observation looking directly at the value of the data you

can have two kinds the first is numerical when dealing with machine

learning the most commonly used data is numeric the second is categorical that

is it’s non numeric because it contains characters rather than numbers in this

case it’s categorical because this data set is made for classification there are

two types of supervised learning techniques they are classification and

aggression classification is the process of predicting a discrete class label or

category regression is the process of predicting a continuous value as opposed

to predicting a categorical value in classification look at this data set it

is related to co2 emissions of different cars it includes engine sized cylinders

fuel consumption and co2 emission of various models of automobiles given this

data set you can use regression to predict the co2 emission of a new car by

using other fields such as engine size or number of cylinders since we know the

meaning of supervised learning what do you think unsupervised learning needs

yes unsupervised learning is exactly as it

sounds we do not supervise the model but we let the model work on its own to

discover information that may not be visible to the human hi it means the

unsupervised algorithm trains on the data set and draws conclusions on

unlabeled data generally speaking unsupervised learning has more difficult

algorithms than supervised learning since we know little to no information

about the data or the outcomes that are to be expected dimension reduction

density estimation Market Basket analysis and clustering are the most

widely used unsupervised machine learning techniques dimensionality

reduction and or feature selection play a large role in this by reducing

redundant features that make the classification easier market basket

analysis is a modeling technique based upon the theory that if you buy a

certain group of items you’re more likely to buy another group of items

density estimation is a very simple concept that is mostly used to explore

the data to find some structure within it and finally clustering clustering is

considered to be one of the most popular unsupervised machine learning techniques

used for grouping data points or objects that are somehow similar cluster

analysis has many applications in different domains whether it be a bank’s

desire to segment its customers based on certain characteristics or helping an

individual to organize and group his or her favorite types of music generally

speaking though clustering is used mostly for discovering structure

summarization and anomaly detection so to recap the biggest difference between

soup and unsupervised learning is that

supervised learning deals with label data while unsupervised learning deals

with unlabeled data in supervised learning we have machine learning

algorithms for classification and regression in unsupervised learning we

have methods such as clustering in comparison to supervised learning

unsupervised learning has fewer models and fewer evaluation methods that can be

used to ensure that the outcome of the model is accurate as such unsupervised

learning creates a less controllable environment as the machine is creating

outcomes for us hello and welcome in this video we’ll be giving a brief

introduction to regression so let’s get started look at this data set it’s

related to co2 emissions from different cars it includes engine size number of

cylinders fuel consumption and co2 emission from various automobile models

the question is given this data set can we predict the co2 emission of a car

using other fields such as engine size or cylinders let’s assume we have some

historical data from different cars and assume that a car such as in row 9 has

not been manufactured yet but we’re interested in estimating its approximate

co2 emission after production is it possible we can use regression methods

to predict a continuous value such as co2 emission using some other variables

indeed regression is the process of predicting a continuous value in

regression there are two types of variables a dependent variable and one

or more independent variables the dependent variable can be seen as the

state target or final goal we study and try to predict and the independent

variables also known as explanatory variables can be seen as the causes of

those states that independent variables are shown conventionally by X and the

dependent variable is notated by Y our regression model relates Y or the

dependent variable to a function of X ie the independent variable

the key point in the regression is that our dependent value should be continuous

and cannot be a discrete value however the independent variable or variables

can be measured on either a categorical or continuous measurement scale so what

we want to do here is to use the historical data of some cards using one

or more of their features and from that data make a model we use regression to

build such a regression estimation model then the model is used to predict the

expected co2 emission for a new or unknown car basically there are two

types of regression models simple regression and multiple regression

simple regression is when one independent variable is used to estimate

a dependent variable it can be either linear or non-linear

for example predicting co2 emission using the variable of engine size

linearity regression is based on the nature of relationship between

independent and dependent variables when more than one independent variable is

present the process is called multiple linear regression for example predicting

co2 emission using engine size and the number of cylinders in any given car

again depending on the relation between dependent and independent variables it

can be either linear or non-linear regression let’s examine some sample

applications of regression essentially we use regression when we want to

estimate a continuous value for instance one of the applications of regression

analysis could be in the area of sales forecasting you can try to predict a

sales persons total yearly sales from independent variables such as age

education and years of experience it can also be used in the field of psychology

for example to determine individual satisfaction based on demographic and

psychological factors we can use regression analysis to predict the price

of a house in an area based on its size number of bedrooms and so on we can even

use it to predict employment income for independent variables such as hours of

work education occupation sex age years of experience and so on indeed you can

find many examples of the usefulness of regression analysis

in these and many other fields or domains such as finance healthcare

retail and more we have many regression algorithms each

of them has its own importance and a specific condition to which their

application is best suited and while we’ve covered just a few of them in this

course it gives you enough base knowledge for you to explore different

regression techniques hello and welcome in this video we’ll be covering linear

regression you don’t need to know any linear algebra to understand topics in

linear regression this high-level introduction will give you enough

background information on linear regression to be able to use it

effectively on your own problems so let’s get started let’s take a look at

this data set it’s related to the co2 emission of different cars it includes

engine sized cylinders fuel consumption and co2 emissions for various car models

the question is given this data set can we predict the co2 emission of a car

using another field such as engine size quite simply yes we can use linear

regression to predict a continuous value such as co2 emission by using other

variables linear regression is the approximation of a linear model used to

describe the relationship between two or more variables in simple linear

regression there are two variables a dependent variable and an independent

variable the key point in the linear regression is that our dependent value

should be continuous and cannot be a discrete value however the independent

variables can be measured on either a categorical or continuous measurement

scale there are two types of linear regression models they are simple

regression and multiple regression simple linear regression is when one

independent variable is used to estimate a dependent variable for example

predicting co2 emission using the engine size variable when more than one

independent variable is present the process is called multiple linear

regression for example predicting co2 emission using engine size and cylinders

of cars our focus in this video is on simple linear regression now let’s see

how linear regression works okay so let’s look at our data set again to

understand linear regression we can plot our variables here we show engine size

as an independent variable and a mission as the target value that

we would like to predict a scatterplot clearly shows the relation between

variables where changes in one variable explained or possibly caused changes in

the other variable also it indicates that these variables are linearly

related with linear regression you can fit a line through the data for instance

as the engine size increases so do the emissions with linear regression you can

model the relationship of these variables a good model can be used to

predict what the approximate emission of each car is how do we use this line for

prediction now let us assume for a moment that the line is a good fit of

the data we can use it to predict the emission of an unknown car for example

for a sample car with engine size 2.4 you can find the emission is 214 now

let’s talk about what the fitting line actually is we’re going to predict the

target value Y in our case using the independent variable engine size

represented by X 1 the fit line is shown traditionally as a polynomial in a

simple regression problem a single X the form of the model would be theta 0 plus

theta 1 X 1 in this equation Y hat is the dependent variable or the predicted

value and X 1 is the independent variable theta 0 and theta 1 are the

parameters on the line that we must adjust theta 1 is known as the slope or

gradient of the fitting line and theta 0 is known as the intercept theta 0 and

theta 1 are also called the coefficients of the linear equation you can interpret

this equation as y hat being a function of X 1 or Y hat being dependent of X 1

now the questions are how would you draw a line through the points and how do you

determine which line fits best linear regression estimates the coefficients of

the line this means we must calculate theta 0 and theta 1 to find the best

line to fit the data this line would best estimate the admission of the

unknown data points let’s see how we can find this line or to be more precise how

we can adjust the parameters to make the line the best fit for the data for a

moment let’s assume we’ve already found the best fit line for

now let’s go through all the points and check how well they line with this line

best fit here means that if we have for instance a car with engine size x1

equals 5.4 and actual co2 equals 250 its co2 should be predicted very close to

the actual value which is y equals 250 based on historical data but if we use

the fit line or better to say using our polynomial with known parameters to

predict the co2 emission it will return y hat equals 340 now if you compare the

actual value of the admission of the car with what we’ve predicted using our

model you will find out that we have a ninety unit error this means our

prediction line is not accurate this error is also called the residual error

so we can say the error is the distance from the data point to the fitted

regression line the mean of all residual errors shows how poorly the line fits

with the whole data set mathematically it can be shown by the equation mean

squared error shown as MSE our objective is to find a line where the mean of all

these errors is minimized in other words the mean error of the prediction using

the fit line should be minimized let’s reword it more technically the

objective of linear regression is to minimize this MSE equation and to

minimize it we should find the best parameters theta0 and theta1 now the

question is how to find theta 0 and theta 1 in such a way that it minimizes

this error how can we find such a perfect line or set another way how

should we find the best parameters for our line should we move the line a lot

randomly and calculate the MSE value every time and choose the minimum one

not really actually we have two options here option one we can use a mathematic

approach or option two we can use an optimization approach let’s see how we

can easily use a mathematic formula to find the theta zero and theta one as

mentioned before theta 0 and theta 1 in the simple linear regression are the

coefficients of the fit line we can use a simple equation to estimate these

coefficients that is given that it’s a simple linear regression with only two

parameters and no that theta 0 and theta 1 are the

intercept and slope of the line we can estimate them directly from our data it

requires that we calculate the mean of the independent and dependent or target

columns from the data set notice that all of the data must be available to

traverse and calculate the parameters it can be shown that the intercept and

slope can be calculated using these equations we can start off by estimating

the value for theta 1 this is how you can find the slope of a line based on

the data X bar is the average value for the engine size in our data set please

consider that we have nine rows here row 0 to 8 first we calculate the average of

x1 and average of Y then we plug it into the slope equation to find theta 1 the X

I and y i in the equation refer to the fact that we need to repeat these

calculations across all values in our data set and I refers to the eyuth value

of X or Y applying all values we find theta 1 equals 39 it is our second

parameter it is used to calculate the first parameter which is the intercept

of the line now we can plug theta 1 into the line equation to find theta 0 it is

easily calculated that theta 0 equals 125 point seven four so these are the

two parameters for the line where theta 0 is also called the bias coefficient

and theta 1 is the coefficient for the co2-emission column as a side note you

really don’t need to remember the formula for calculating these parameters

as most of the libraries used for machine learning in python r and scala

can easily find these parameters for you but it’s always good to understand how

it works now we can write down the polynomial of the line so we know how to

find the best fit for our data and it’s equation now the question is how can we

use it to predict the emission of a new car based on its engine size after we

found the parameters of the linear equation making predictions is a simple

as solving the equation for a specific set of inputs imagine we are predicting

co2-emission or Y from engine size or X for the automobile in record number 9

our linear regression model representation for this problem would be

y hat equals theta 0 plus theta 1 x1 or if we map it to our data set it would be

co2-emission equals theta 0 plus theta 1 engine size as we saw we can find theta

0 theta 1 using the equations that we just talked about once found we can plug

in the equation of the linear model for example let’s use theta 0 equals 125 and

theta 1 equals 39 so we can rewrite the linear model as co2 emission equals 125

plus 39 engine size now let’s plug in the ninth row of our data set and

calculate the co2 emission for a car with an engine size of 2.4 so co2

emission equals 125 plus 39 times 2.4 therefore we can predict that the co2

emission for this specific car would be 218 point six let’s talk a bit about why

linear regression is so useful quite simply it is the most basic regression

to use and understand in fact one reason why the linear regression is so useful

is that it’s fast it also doesn’t require tuning of

parameters so something like tuning the K parameter and K nearest neighbors or

the learning rate in neural networks isn’t something to worry about linear

regression is also easy to understand and highly interpretive all hello and

welcome in this video we’ll be covering model evaluation so let’s get started

the goal of regression is to build a model to accurately predict an unknown

case to this end we have to perform regression evaluation after building the

model in this video we’ll introduce and discuss two types of evaluation

approaches that can be used to achieve this goal these approaches are train and

test on the same data set train test split we’ll talk about what

each of these are as well as the pros and cons of using each of these models

also will introduce some metrics for accuracy of regression models let’s look

at the first approach when considering evaluation models we clearly want to

choose the one that will give us the most accurate results so the question is

how can we calculate the accuracy of our model in other words how much can we

trust this model for prediction of an unknown sample using a given data set

and having built a model such as linear regression one of the solutions is to

select a portion of our data set for testing for instance assume that we have

10 records in our data set we use the entire data set for training and we

build a model using this training set now we select a small portion of the

data set such as row number six to nine but without the labels this set is

called a test set which has the labels but the labels are not used for

prediction and is used only as ground truth the labels are called actual

values of the test set now we pass the feature set of the testing portion to

our built model and predict the target values finally we compare the predicted

values by our model with the actual values in the test set this indicates

how accurate our model actually is there are different metrics to report the

accuracy of the model but most of them work generally based on the similarity

of the predicted and actual values let’s look at one of the simplest metrics to

calculate the accuracy of our regression model as mentioned we just compare the

actual values Y with the predicted values which is noted as y hat for the

testing set the error of the model is calculated as the average difference

between the predicted and actual values for all the rows we can write this error

as an equation so the first evaluation approach we just talked about is the

simplest one train and test on the same data set essentially the name of this

approach says it all you train the model on the entire data set then you test it

using a portion of the same data set in a general sense when you test with a

data set in which you know the target value for each data point you’re able to

obtain a percentage of accurate predictions for the model this

evaluation approach most likely have a high training

accuracy and a low out-of-sample accuracy since the model knows all of

the testing data points from the training what is training accuracy and

out-of-sample accuracy we said that training and testing on the same data

set produces a high training accuracy but what exactly is training accuracy

training accuracy is the percentage of correct predictions that the model makes

when using the test data center however a high training accuracy isn’t

necessarily a good thing for instance having a high training accuracy may

result in an overfit of the data this means that the model is overly trained

to the data set which may capture noise and produce a non generalized model

out-of-sample accuracy is the percentage of correct predictions that the model

makes on data that the model has not been trained on doing a training test on

the same data set will most likely have low out-of-sample accuracy due to the

likelihood of being overfit it’s important that our models have high

out-of-sample accuracy because the purpose of our model is of course to

make correct predictions on unknown data so how can we improve out-of-sample

accuracy one way is to use another evaluation approach called train test

split in this approach we select a portion of our data set for training for

example row 0 to 5 and the rest is used for testing for example row 6 to 9 the

model is built on the training set then the tests feature set is passed to the

model for prediction and finally the predicted values for the test set are

compared with the actual values of the testing set the second evaluation

approach is called train test split train test split involves splitting the

data set into training and testing sets respectively which are mutually

exclusive after which you train with the training

set and test with the testing set this will provide a more accurate evaluation

on out-of-sample accuracy because the testing data set is not part of the data

set that has been used to train the data it is more realistic for real-world

problems this means that we know the outcome of each data point in the data

set making it great to test with and since this data has not been used to

train the model has no knowledge of the outcome of these

data points so in essence it’s truly out-of-sample testing however please

ensure that you train your model with the testing set afterwards as you don’t

want to lose potentially valuable data the issue with train tests split is that

it’s highly dependent on the data sets on which the data was trained and tested

the variation of this causes train tests split to have a better out-of-sample

prediction than training and testing on the same data set but it still has some

problems due to this dependency another evaluation model called k-fold

cross-validation resolves most of these issues how do you fix a high variation

that results from a dependency well new averaging let me explain the basic

concept of k-fold cross-validation to see how we can solve this problem the

entire dataset is represented by the points in the image at the top left if

we have K equals four folds then we split up this dataset as shown here in

the first fold for example we use the first 25% of the data set for testing

and the rest for training the model is built using the training set and is

evaluated using the test set then in the next round or in the second fold the

second 25% of the data set is used for testing and the rest for training the

model again the accuracy of the model is calculated we continue for all folds

finally the result of all four evaluations are average that is the

accuracy of each fold is then average keeping in mind that each fold is

distinct where no training data in one fold is used in another k-fold

cross-validation in its simplest form performs multiple trained test splits

using the same data set where each split is different then the result is averaged

to produce a more consistent out-of-sample accuracy we wanted to show

you an evaluation model that address some of the issues we’ve described in

the previous approaches however going in depth with k-fold cross-validation model

is out of the scope for this course hello and welcome in this video we’ll be

covering accuracy metrics for model evaluation so let’s get started

evaluation metrics are used to explain the performance of a model let’s talk

more about the model evaluation metrics that are used for regression

as mentioned basically we can compare the actual values and predicted values

to calculate the accuracy of a regression model evaluation metrics

provide a key role in the development of a model as it provides insight to areas

that require improvement we’ll be reviewing a number of model evaluation

metrics including mean absolute error mean squared error and root mean squared

error but before we get into defining these we need to define what an error

actually is in the context of regression the error of the model is the difference

between the data points and the trendline generated by the algorithm

since there are multiple data points and error can be determined in multiple ways

main absolute error is the mean of the absolute value of the errors this is the

easiest of the metrics to understand since it’s just the average error mean

squared error is the mean of the squared error it’s more popular than mean

absolute error because the focus is geared more towards large errors this is

due to the squared term exponentially increasing larger errors in comparison

to smaller ones root mean squared error is the square root of the mean squared

error this is one of the most popular of the evaluation metrics because root mean

squared error is interprete belen the same units as the response vector or Y

units making it easy to relate its information relative absolute error also

known as residual sum of square where Y bar is a mean value of Y takes the total

absolute error and normalizes it by dividing by the total absolute error of

the simple predictor relative squared error is very similar to relative

absolute error but is widely adopted by the data science community as it is used

for calculating R squared R squared is not an error per se but is a popular

metric for the accuracy of your model it represents how close the data values are

to the fitted regression line the higher the r-squared the better the

model fits your data each of these metrics can be used for quantifying of

your prediction the choice of metric completely depends on the type of model

your data type and domain of knowledge unfortunately further review is out of

scope of this course hello and welcome in this video we’ll be

covering multiple linear regression as you know there are two types of linear

regression models simple regression and multiple regression simple linear

regression is when one independent variable is used to estimate a dependent

variable for example predicting co2-emission using the variable of

engine size in reality there are multiple variables that predict the

co2-emission when multiple independent variables are present the process is

called multiple linear regression for example predicting co2-emission using

engine size and the number of cylinders in the car’s engine our focus in this

video is on multiple linear regression the good thing is that multiple linear

regression is the extension of the simple linear regression model so I

suggest you go through the simple linear regression video first if you haven’t

watched it already before we dive into a sample data set and see how multiple

linear regression works I want to tell you what kind of problems that can solve

when we should use it and specifically what kind of questions we can answer

using it basically there are two applications for multiple linear

regression first it can be used when we would like to identify the strength of

the effect that the independent variables have on a dependent variable

for example does revision time test anxiety lecture attendance and gender

have any effect on exam performance of students second it can be used to

predict the impact of changes that is to understand how the dependent variable

changes when we change the independent variables for example if we were

reviewing a person’s health data a multiple linear regression can tell you

how much that person’s blood pressure goes up or down for every unit increase

or decrease in a patient’s body mass index holding other factors constant as

is the case with simple linear regression multiple linear regression is

a method of predicting a continuous variable it uses multiple variables

called independent variables or predictors that best predict the value

of the target variable which is also called the dependent variable in

multiple linear regression the target value Y is a linear combination of

independent variables for example you can predict how much co2

a car might admit due to independent variables such as the car’s engine size

number of cylinders and fuel consumption multiple linear regression is very

useful because you can examine which variables are significant predictors of

the outcome variable also you can find out how each feature impacts the outcome

variable and again as is the case in simple linear regression if you manage

to build such a regression model you can use it to protect the emission amount of

an unknown case such as record number nine

generally the model is of the form y hat equals theta 0 plus theta 1 x1 plus

theta 2 x2 and so on up to theta n X n mathematically we can show it as a

vector form as well this means it can be shown as a dot product of two vectors

the parameters vector and the feature set vector generally we can show the

equation for a multi-dimensional space as theta transpose X where theta is an N

by 1 vector of unknown parameters in a multi-dimensional space and X is the

vector of the featured sets as theta is a vector of coefficients and is supposed

to be multiplied by X conventionally it is shown as transpose theta theta is

also called the parameters or weight vector of the regression equation both

these terms can be used interchangeably and X is the feature set which

represents a car for example x1 for engine size or x2 for cylinders and so

on the first element of the feature set would be set to 1 because it turns the

theta 0 into the intercept or bias parameter when the vector is multiplied

by the parameter vector please notice that theta transpose X in a one

dimensional space is the equation of a line it is what we use in simple linear

regression in higher dimensions when we have more than one input or X the line

is called a plane or a hyperplane and this is what we use for multiple linear

regression so the whole idea is to find the best fit hyperplane for our data to

this end and as is the case in linear regression we should estimate the values

for theta vector that best predict the value of the target field in each row

to achieve this goal we have to minimize the error of the prediction now the

question is how to refine the optimized parameters to find the optimized

parameters for our model we should first understand what the optimized parameters

are then we will find a way to optimize the parameters in short optimize

parameters are the ones which lead to a model with the fewest errors let’s

assume for a moment that we have already found the parameter vector of our model

it means we already know the values of theta vector now we can use the model

and the feature set of the first row of our data set to predict the co2 emission

for the first car correct if we plug the feature set values into the model

equation we find y hat let’s say for example it returns 140 as the predicted

value for this specific row what is the actual value y equals 196 how different

is the predicted value from the actual value of 196 well we can calculate it

quite simply as 196 subtract 140 which of course equals 56 this is the error of

our model only for one row or one car in our case as is the case in linear

regression we can say the error here is the distance from the data point to the

fitted regression model the mean of all residual errors shows how bad the model

is representing the data set it is called the mean squared error or MSE

mathematically MSE can be shown by an equation while this is not the only way

to expose the error of a multiple linear regression model it is one of the most

popular ways to do so the best model for our dataset is the one with minimum

error for all prediction values so the objective of multiple linear regression

is to minimize the MSE equation to minimize it we should find the best

parameters theta but how okay how do we find the

parameter or coefficients for multiple linear regression there are many ways to

estimate the value of these coefficients however the most common methods are the

ordinary least squares and optimization approach ordinary least squares tries to

estimate the values of the coefficients by minimizing the mean square error this

approach uses the data as an tricks and uses linear algebra

operations to estimate the optimal values for the theta the problem with

this technique is the time complexity of calculating matrix operations as it can

take a very long time to finish when the number of rows in your data set is less

than 10,000 you can think of this technique as an option however for

greater values you should try other faster approaches the second option is

to use an optimization algorithm to find the best parameters that is you can use

a process of optimizing the values of the coefficients by iteratively

minimizing the error of the model on your training data for example you can

use gradient descent which starts optimization with random values for each

coefficient then calculates the errors and tries to minimize it through wise

changing of the coefficients in multiple iterations gradient descent is a proper

approach if you have a large data set please understand however that there are

other approaches to estimate the parameters of the multiple linear

regression that you can explore on your own after you find the best parameters

for your model you can go to the prediction phase after we found the

parameters of the linear equation making predictions is as simple as solving the

equation for a specific set of inputs imagine we are predicting co2-emission

or y from other variables for the automobile in record number 9 our linear

regression model representation for this problem would be y hat equals theta

transpose x once we find the parameters we can plug them into the equation of

the linear model for example let’s use theta 0 equals 125 theta 1 equals 6

point 2 theta 2 equals 14 and so on if we map it to our data set we can rewrite

the linear model as co2emissions equals 125 plus 6 point 2 multiplied by engine

size plus 14 multiplied by cylinder and so on as you can see multiple linear

regression estimates the relative importance of predictors for example it

shows cylinder has higher impact on co2 emission amounts in comparison with

engine size now let’s plug in the ninth row of our

data set and calculate the co2 emission for a car with the engine size of 2.4 so

co2 emission equals 125 plus 6 point 2 times 2 point 4 plus 14 times 4 and so

on we can predict the co2 emission for this specific car would be 214 point 1

now let me address some concerns that you might already be having regarding

multiple linear regression as you saw you can use multiple independent

variables to predict a target value in multiple linear regression it sometimes

results in a better model compared to using a simple linear regression which

uses only one independent variable to predict the dependent variable now the

question is how many independent variable should we use for the

prediction should we use all the fields in our data set does adding independent

variables to a multiple linear regression model always increase the

accuracy of the model basically adding too many independent variables without

any theoretical justification may result in an overfit model an overfit model is

a real problem because it is too complicated for your data set and not

general enough to be used for a prediction so it is recommended to avoid

using many variables for a prediction there are different ways to avoid

overfitting a model in regression however that is outside the scope of

this video the next question is should independent

variables be continuous basically categorical independent variables can be

incorporated into a regression model by converting them into numerical variables

for example given a binary variable such as car type the code dummy 0 for manual

and 1 for automatic cars as a last point remember that multiple linear regression

is a specific type of linear regression so there needs to be a linear

relationship between the dependent variable and each of your independent

variables there are a number of ways to check for linear relationship for

example you can use scatter plots and then visually check for linearity if the

relationship displayed in your scatter plot is not linear then you need to use

nonlinear regression hello and welcome in this video we’ll be

every nonlinear regression basics so let’s get started these data points

correspond to China’s gross domestic product or GDP from 1960 to 2014 the

first column is the years and the second is China’s corresponding annual gross

domestic income in u.s. dollars for that year this is what the data points look

like now we have a couple of interesting questions first can gdp be predicted

based on time and second can we use a simple linear regression to model it

indeed if the data shows a curvy trend then linear regression will not produce

very accurate results when compared to a nonlinear regression simply because as

the name implies linear regression presumes that the data is linear the

scatterplot shows that there seems to be a strong relationship between GDP and

time but the relationship is not linear as you can see the growth starts off

slowly then from 2005 onward the growth is very significant and finally it

decelerates slightly in the 2010 it kind of looks like either a logistical or

exponential function so it requires a special estimation method of the

nonlinear regression procedure for example if we assume that the model for

these data points are exponential functions such as y hat equals theta 0

plus theta 1 theta 2 transpose X or to the power of X our job is to estimate

the parameters of the model ie Thetas and use the fitted model to predict GDP

for unknown or future cases in fact many different regressions exist that can be

used to fit whatever the data set looks like you can see a quadratic and cubic

regression lines here and it can go on and on to infinite degrees in essence we

can call all of these polynomial regression where the relationship

between the independent variable X and the dependent variable Y is modeled as

an nth degree polynomial in X with many types of regression to choose from

there’s a good chance that one will fit your data set well

remember it’s important to pick a regression that fits the data the best

so what is polynomial regression polynomial regression fits a curved line

to your data a simple example of polynomial with degree three is shown as

y hat equals theta zero plus theta 1 X plus theta two x squared plus theta

three x cubed or to the power of three where Thetas are parameters to be

estimated that makes the model fit perfectly to the underlying data though

the relationship between x and y is nonlinear here and polynomial regression

can fit them a polynomial regression model can still be expressed as linear

regression I know it’s a bit confusing but let’s look at an example given the

third degree polynomial equation by defining x1 equals x and x2 equals x

squared or X to the power of two and so on

the model is converted to a simple linear regression with new variables as

Y hat equals theta 0 plus theta 1 x1 plus theta 2 x2 plus theta 3 x3 this

model is linear in the parameters to be estimated right

therefore this polynomial regression is considered to be a special case of

traditional multiple linear regression so you can use the same mechanism as

linear regression to solve such a problem

therefore polynomial regression models can fit using the model of least squares

least squares is a method for estimating the unknown parameters in a linear

regression model by minimizing the sum of the squares of the differences

between the observed dependent variable in the given data set and those

predicted by the linear function so what is nonlinear regression exactly first

nonlinear regression is a method to model a nonlinear relationship between

the dependent variable and a set of independent variables second for a model

to be considered non linear Y hat must be a nonlinear function of the

parameters theta is not necessarily the features X when it comes to nonlinear

equation it can be the shape of exponential

logarithmic and logistic or many other types as you can see in all of these

equations the change of Y hat depends on changes in the parameters theta not

necessarily on X only that is a nonlinear regression a model is

nonlinear by parameters in contrast to linear regression we cannot use the

ordinary least-squares method to fit the data in nonlinear regression and in

general estimation of the parameters is not easy let me answer two important

questions here first how can I know if a problem is linear or non-linear in an

easy way to answer this question we have to do two things the first is to

visually figure out if the relation is linear or non-linear it’s best to plot

bivariate plots of output variables with each input variable also you can

calculate the correlation coefficient between independent and dependent

variables and if for all variables it is 0.7 or higher there is a linear tendency

and thus it’s not appropriate to fit a nonlinear regression the second thing we

have to do is to use nonlinear regression instead of linear regression

when we cannot accurately model the relationship with linear parameters the

second important question is how should I model my data if it displays non

linear on a scatter plot well to address this you have to use either a polynomial

regression use a nonlinear regression model or transform your data which is

not in scope for this course hello and welcome in this video we’ll learn a

machine learning method called logistic regression which is used for

classification in examining this method will specifically answer these three

questions what is logistic regression what kind of problems can be solved by

logistic regression and in which situations do we use logistic regression

so let’s get started logistic regression is a statistical and

machine learning technique for classifying records of a data set based

on the values of the input fields let’s say we have a telecommunication data set

that we’d like to analyze in order to understand which customers might leave

us next month this is historical customer data where each row represents

one customer imagine that you’re an analyst at this company and you have to

find out who is leaving and why you’ll use the data set to build a model based

on historical records and use it to predict the future churn within the

customer group the data set includes information about services that each

customer has signed up for customer account information demographic

information about customers like gender and age range and also customers who’ve

left the company within the last month the column is called churn we can use

logistic regression to build a model for predicting customer churn using the

given features in logistic regression we use one or more independent variables

such as tenure age and income to predict an outcome such as churn which we call a

dependent variable representing whether or not customers will stop using the

service logistic regression is analogous to linear regression but tries to

predict a categorical or discrete target field instead of a numeric one in linear

regression we might try to predict a continuous value of variables such as

the price of a house blood pressure of the patient or fuel consumption of a car

but in logistic regression we predict a variable which is binary such as yes/no

true/false successful or not successful pregnant not pregnant and so on all of

which can be coded as 0 or 1 in logistic regression dependent variable should be

continuous if categorical they should be dummy or indicator coded this means we

have to transform them to some continuous value please note that

logistic regression can be used for both binary classification and multi-class

classification but for simplicity in this video we’ll focus on binary

classification let’s examine some applications of logistic regression

before we explain how they work as mentioned logistic regression is a type

of classification algorithm so it can be used in different situations for example

to predict the probability of a person having a heart attack within a specified

time period based on our knowledge of the person’s

age sex and body mass index or to predict the chance of mortality in an

injured patient or to predict whether a patient has a given disease such as

diabetes based on observed characteristics of that patient such as

weight height blood pressure and results of various blood tests and so on in a

marketing context we can use it to predict the likelihood of a customer

purchasing a product or halting a subscription as we’ve done in our turn

example we can also use logistic regression to predict the probability of

failure of a given process system or product we can even use it to predict

the likelihood of a homeowner defaulting on a mortgage these are all good

examples of problems that can be solved using logistic regression notice that in

all these examples not only do we predict the class of each case we also

measure the probability of a case belonging to a specific class there are

different machine algorithms which can classify or estimate a variable the

question is when should we use logistic regression here are for situations in

which logistic regression is a good candidate first when the target field in

your data is categorical or specifically is binary such as 0 1 yes/no churn or no

churn positive negative and so on second you need the probability of your

prediction for example if you want to know what the probability is of a

customer buying a product logistic regression returns a probability score

between 0 & 1 for a given sample of data in fact

logistic regression predicts the probability of that sample and we’ve

mapped the cases to a discrete class based on that probability third if your

data is linearly separable the decision boundary of logistic

regression is a line or a plane or a hyperplane a classifier will classify

all the points on one side of the decision boundary as belonging to one

class and all those on the other side as belonging to the other class for example

if we have just two features and are not applying any polynomial processing we

can obtain an inequality like theta 0 plus theta 1 X 1 plus theta 2 X 2 is

greater than 0 which is a half plain easily plausible please note that in

using logistic regression we can also achieve a complex decision boundary

using polynomial processing as well which is out of scope

here you’ll get more insight from decision boundaries when you understand

how logistic regression works fourth you need to understand the impact of a

feature you can select the best features based on the statistical significance of

the logistic regression model coefficients or parameters that is after

finding the optimum parameters a feature X with the weight theta one close to

zero has a smaller effect on the prediction than features with large

absolute values of theta one indeed it allows us to understand the impact an

independent variable has on the dependent variable while controlling

other independent variables let’s look at our data set again we define the

independent variables as X and dependent variable as Y notice that for the sake

of simplicity we can code the target or dependent values to 0 or 1 the goal of

logistic regression is to build a model to predict the class of each sample

which in this case is a customer as well as the probability of each sample

belonging to a class given that let’s start to formalize the problem X is our

data set in the space of real numbers of M by n that is of M dimensions or

features and n records and Y is the class that we want to predict which can

be either 0 or 1 ideally a logistic regression model so

called y hat can predict that the class of the customer is 1 given its features

X it can also be shown quite easily that the probability of a customer being in

class 0 can be calculated as 1 minus the probability that the class of the

customer is 1 hello in this video we’ll give you an introduction to

classification so let’s get started it machine learning classification is a

supervised learning approach which can be thought of as a means of categorizing

or classifying some unknown items into a discrete set of classes classification

attempts to learn the relationship between a set of feature variables and a

target variable of interest the target attribute and classification is a

categorical variable with discrete values so how does classification and

classifiers work given a set of training data points along with the target labels

classification determines the class label for an unlabeled test case let’s

explain this with an example a good sample of classification is the loan

default prediction suppose a bank is concerned about the

potential for loans not to be repaid if previous loan default data can be used

to predict which customers are likely to have problems repaying loans these bad

risk customers can either have their loan application declined or offered

alternative products the goal of a loan default predictor is to use existing

loan default data which is information about the customers such as age income

education that said around to build a classifier pass a new customer or

potential future to falter to the model and then label it ie the data points as

defaulter or not default ur or for example 0 or 1 this is how a classifier

predicts an unlabeled test case please notice that this specific example was

about a binary classifier with two values we can also build classifier

models for both binary classification and multi-class classification for

example imagine that you’ve collected data about a set of patients all of whom

suffered from the same illness during their course of treatment each patient

responded to one of three medications you can use this labelled data set with

a classification algorithm to build a classification model then you can use it

to find out which drug might be appropriate for a future patient with

the same illness as you can see it is a sample of multi-class classification

classification has different business use cases as well for example to predict

the category to which a customer belongs for churn detection where we predict

whether a customer switches to another provider or brand or to predict whether

or not a customer responds to a particular advertising campaign data

classification has several applications in a wide variety of industries

essentially many problems can be expressed as associations between

feature and target variables especially when label data is available this

provides a broad range of applicability for classification for example

classification can be used for email filtering speech recognition handwriting

recognition biometric identification document classification and much more

here we have the types of classification algorithms in machine learning

they include decision trees naive Bayes linear discriminant analysis k nearest

neighbor logistic regression neural networks and support vector

machines there are many types of classification algorithms we will only

cover a few in this course hello and welcome in this video we’ll be covering

the K nearest neighbors algorithm so let’s get started

imagine that a telecommunications provider has segmented his customer base

by service usage patterns categorizing the customers into four groups

if demographic data can be used to predict group membership the company can

customize offers for individual prospective customers this is a

classification problem that is given the data set with predefined labels we need

to build a model to be used to predict the class of a new or unknown case the

example focuses on using demographic data such as region age and marital

status to predict usage patterns the Target Field called cust cat has four

possible values that correspond to the four customer groups as follows basic

service a service plus service and total service our objective is to build a

classifier for example using the row 0 to 7 to predict the class of row 8 we

will use a specific type of classification called K nearest neighbor

just for sake of demonstration let’s use only two fields as predictors

specifically age and income and then plot the customers based on their group

membership now let’s say that we have a new customer for example record number 8

with a known age and income how can we find the class of this customer can we

find one of the closest cases and assign the same class label to our new customer

can we also say that the class of our new customer is most probably group 4 ie

total service because its nearest neighbor is also of class 4 yes we can

in fact it is the first nearest neighbor now the question is to what extent can

we trust our judgment which is based on the first nearest neighbor it might be a

poor judgment especially if the first nearest neighbor is a very specific case

or an outlier correct now let’s look at our scatter plot again rather than

choose the first nearest neighbor what if we chose the five nearest neighbors

and did a majority vote among them to define the class

our new customer in this case we see that three out of five nearest neighbors

tell us to go for class three which is plus service doesn’t this make more

sense yes in fact it does in this case the value of K in the K nearest

neighbors algorithm is five this example highlights the intuition behind the K

nearest neighbors algorithm now let’s define the K nearest neighbors the K

nearest neighbors algorithm is a classification algorithm that takes a

bunch of label points and uses them to learn how to label other points this

algorithm classifies cases based on their similarity to other cases and K

nearest neighbors data points that are near each other are said to be neighbors

k nearest neighbors is based on this paradigm similar cases with the same

class labels are near each other thus the distance between two cases is a

measure of their dissimilarity there are different ways to calculate the

similarity or conversely the distance or dissimilarity of two data points for

example this can be done using Euclidean distance now let’s see how the K nearest

neighbors algorithm actually works in a classification problem the K nearest

neighbors algorithm works as follows one pick a value for K to calculate the

distance from the new case holdout from each of the cases in the data set three

search for the K observations in the training data that are nearest to the

measurements of the unknown data point and four predict the response of the

unknown data point using the most popular response value from the K

nearest neighbors there are two parts in this algorithm that might be a bit

confusing first how to select the correct K and second how to compute the

similarity between cases for example among customers let’s first start with a

second concern that is how can we calculate the similarity between two

data points assume that we have two customers customer 1 and customer 2 and

for a moment assume that these two customers have only one feature age we

can easily use a specific type of Minkowski distance to calculate the

distance of these two customers indeed the Euclidean distance distance

of X 1 from X 2 is root of 34 minus 32 power of 2 which is 4 what about if we

have more than one feature for example age and income if we have income and age

for each customer we can still use the same formula but this time we’re using

it in a two dimensional space we can also use the same distance matrix for

multi-dimensional vectors of course we have to normalize our feature set to get

the accurate dissimilarity measure there are other dissimilarity measures as well

that can be used for this purpose but as mentioned it is highly dependent on data

type and also the domain that classification is done for it as

mentioned K and K nearest neighbors is the number of nearest neighbors to

examine it is supposed to be specified by the user so how do we choose the

right cake assume that we want to find the class of the customer noted as

question mark on the chart what happens if we choose a very low value of K let’s

say k equals 1 the first nearest point would be blue which is class 1 this

would be a bad prediction since more of the points around it are magenta or

class 4 in fact since its nearest neighbor is blue we can say that we

captured the noise in the data or we chose one of the points that was an

anomaly in the data a low value of K causes a highly complex model as well

which might result in overfitting of the model

it means the prediction process is not generalized enough to be used for

out-of-sample cases out-of-sample data is data that is outside of the data set

used to train the model in other words it could not be trusted to be used for

prediction of unknown samples it’s important to remember that overfitting

is bad as we want a general model that works for any data not just the data

used for training now on the opposite side of the spectrum if we choose a very

high value of K such as K equals 20 then the model becomes over and generalized

so how can we find the best value for K the general solution is to reserve a

part of your data for testing the accuracy of the model once you’ve done

so choose K equals 1 and then use the training part for

modeling and calculate the accuracy of prediction using all samples in your

test set repeat this process increasing the K and

see which K is best for your model for example in our case k equals four will

give us the best accuracy nearest neighbors analysis can also be used to

compute values for a continuous target in this situation the average or median

target value of the nearest neighbors is used to obtain the predicted value for

the new case for example assume that you are predicting the price of a home based

on its feature set such as number of rooms square footage the year it was

built and so on you can easily find the three nearest neighbor houses of course

not only based on distance but also based on all the attributes and then

predict the price of the house as the medium of neighbors hello and welcome in

this video we’ll be covering evaluation metrics for classifiers so let’s get

started evaluation metrics explain the

performance of a model let’s talk more about the model evaluation metrics that

are used for classification imagine that we have an historical data set which

shows the custom return for a telecommunication company we have

trained the model and now we want to calculate its accuracy using the test

set we pass the test set to our model and refine the predictive labels now the

question is how accurate is this model basically we compare the actual values

in the test set with the values predicted by the model to calculate the

accuracy of the model evaluation metrics provide a key role in the development of

a model as they provide insight to areas that might require improvement there are

different model evaluation metrics but we just talk about three of them here

specifically jacquard index f1 score and log loss let’s first look at one of the

simplest accuracy measurements the jacquard index also known as the Jaccard

similarity coefficient let’s say Y shows the true labels of the churn data set

and Y hat shows the predicted values by our classifier then we can define

jacquard as the size of the intersection divided by the size of the Union

to label sex for example for a test set of size 10 with 8 correct predictions or

8 interceptions the accuracy by the jacquard index would be zero point six

six if the entire set of predicted labels for a sample strictly matches

with the true set of labels then the subset accuracy is one point zero

otherwise it is zero point zero another way of looking at accuracy of

classifiers is to look at a confusion matrix for example let’s assume that our

test set has only 40 rows this matrix shows the corrected and wrong

predictions in comparison with the actual labels each confusion matrix row

shows the actual true labels in the test set and the columns show the predicted

labels by classifier let’s look at the first row the first row is for customers

whose actual churn value in the test set is 1 as you can calculate out of 40

customers the churn value of 15 of them is 1 and out of these 15 that classifier

correctly predicted 6 of them as 1 and 9 of them as 0 this means that for 6

customers the actual churn value was 1 in the test set and the classifier also

correctly predicted those as 1 however while the actual label of 9 customers

was 1 the classifier predicted those as 0 which is not very good we can consider

this as an error of the model for the first row what about the customers with

a churn value 0 let’s look at the second row it looks like there were 25

customers whose churn value was 0 the classifier correctly predicted 24 of

them 0 and one of them wrongly predicted as 1 so it has done a good job in

predicting the customers with a churn value of 0 a good thing about the

confusion matrix is that it shows the model’s ability to correctly predict or

separate the classes in the specific case of a binary classifier such as this

example we can interpret these numbers as the count

true positives false negatives true negatives and false positives based on

the count of each section we can calculate the precision and recall of

each label precision is a measure of the accuracy provided that a class label has

been predicted it is defined by precision equals true positive divided

by true positive plus false positive and recall is the true positive rate it is

defined as recall equals true positive divided by true positive plus false

negative so we can calculate the precision and recall of each class now

we’re in the position to calculate the f1 scores for each label based on the

precision and recall of that label the f1 score is the harmonic average of the

precision and recall where an f1 score reaches its s value at 1 which

represents perfect precision and recall and it’s worst at 0 it is a good way to

show that a classifier has a good value for both recall and precision it is

defined using the f1 score equation for example the f1 score for class 0 ie

churn equals 0 is zero point eight three and the f1 score for class one ie churn

equals one is zero point five five and finally we can tell the average accuracy

for this classifier is the average of the f1 score for both labels which is

zero point seven two in our case please notice that both Jaccard and f1 score

can be used for multi class classifiers as well which is out of scope for this

course now let’s look at another accuracy metric for classifiers

sometimes the output of a classifier is the probability of a class label instead

of the label for example in logistic regression the output can be the

probability of customer churn ie yes or equals to one this probability is a

value between 0 & 1 logarithmic loss also known as log loss

measures the performance of a classifier where the predicted output is a

probability value between 0 & 1 so for example predicting a probability of 0.1

3 when the actual label is 1 would be back and would result in a high log loss

we can calculate the log loss for each row using the log loss equation which

measures how far each prediction is from the actual label then we calculate the

average log loss across all rows of the test set it is obvious that more ideal

classifiers have progressively smaller values of log loss so the classifier

with the lower log loss has better accuracy hello and welcome in this video

we’re going to introduce and examine decision trees so let’s get started what

exactly is a decision tree how do we use them to help us classify how can I grow

my own decision tree these may be some of the questions that you have in mind

from hearing the term decision tree hopefully you’ll soon be able to answer

these questions and many more by watching this video

imagine that you’re a medical researcher compiling data for a study you’ve

already collected data about a set of patients all of whom suffered from the

same illness during their course of treatment each patient responded to one

of two medications will call them drug a and drug B part of your job is to build

a model to find out which drug might be appropriate for a future patient with

the same illness the feature sets of this data set are age gender blood

pressure and cholesterol of our group of patients and the target is the drug that

each patient responded to it is a sample of binary classifiers and you can use

the training part of the data set to build a decision tree and then use it to

predict a class of an unknown patient in essence to come up with a decision on

which drug to prescribe to a new patient let’s see how a decision tree is built

for this data set decision trees are built by splitting the training set into

distinct nodes where one node contains all of or most of one category of the

data if we look at the diagram here we can see that it’s a patient’s classifier

so as mentioned we want to prescribe a drug to a new patient but the decision

to choose drug a or B will be influenced by the patient’s situation we start with

age which could be young middle aged or senior if the patient is middle-aged

then we’ll definitely go for drug B on the other hand if he is a young or a

senior patient we’ll need more details to help us determine which drug to

prescribe the additional decision variables can be things such as

cholesterol levels gender or blood pressure for example if the patient is

female and we will recommend drug a but if the patient is male then we’ll go for

drug B as you can see decision trees are about testing an attribute and branching

the cases based on the result of the test each internal node corresponds to a

test and each branch corresponds to a result of the test and each leaf node

assigns a patient to a class now the question is how can we build such a

decision tree here is the way that a decision tree is built a decision tree

can be constructed by considering the attributes one by one first choose an

attribute from our data set calculate the significance of the attribute in the

splitting of the data in the next video we will explain how to calculate the

significance of an attribute to see if it’s an effective attribute or not next

split the data based on the value of the best attribute then go to each branch

and repeat it for the rest of the attributes after building this tree you

can use it to predict the class of unknown cases or in our case the proper

drug for a new patient based on his or her characteristics hello and welcome in

this video we’ll be covering the process of building decision trees so let’s get

started consider the drug data set again the question is how do we build a

decision tree based on that data set decision trees are built using recursive

partitioning to classify the data let’s say we have 14 patients in our dataset

the algorithm chooses the most predictive feature to split the data on

what is important in making a decision tree is to determine which attribute is

the best or more predictive to split data based on the feature

let’s say we pick cholesterol as the first attribute to split data it will

split our data into two branches as you can see if the patient has high

cholesterol we cannot say with high confidence that

drug B might be suitable for him also if the patient’s cholesterol is normal we

still don’t have sufficient evidence or information to determine if either drug

a or drug B is in fact suitable it is a sample of bad attribute selection for

splitting data so let’s try another attribute again we have our 14 cases

this time we pick the sex attribute of patients it will split our data into two

branches male and female as you can see if the patient is female we can say

drunk B might be suitable for her with high certainty but if the patient is

male we don’t have sufficient evidence or information to determine if drug a or

drug being is suitable however it is still a better choice in comparison with

the cholesterol attribute because the result in the nodes are more pure

it means nodes that are either mostly drug a or drug B so we can say the sex

attribute is more significant than cholesterol or in other words it’s more

predictive than the other attributes indeed predictiveness is based on

decrease in impurity of nose we’re looking for the best feature to decrease

the impurity of patients in the leads after splitting them up based on that

feature so the sex feature is a good candidate in the following case because

it almost found the pure patients let’s go one step further for the male patient

branch we again test other attributes to split the subtree we test cholesterol

again here as you can see it results in even more pure leaves so we can easily

make a decision here for example if a patient is male and his cholesterol is

high we can certainly prescribed drug a but if it is normal we can prescribe

drug B with high confidence as you might notice the choice of attribute to split

data is very important and it is all about purity of the leaves after the

split a node in the tree is considered pure if

in 100% of the cases the nodes fall into a specific category of the target field

in fact the method uses recursive partitioning to split the training

records into segments by minimizing the impurity at each step impurity of nodes

is calculated by entropy of data in the node so what is entropy entropy is the

amount of information disorder or the amount of randomness in the data the

entropy in the node depends on how much random data is in that node and is

calculated for each node in decision trees we’re looking for trees that have

the smallest entropy in their nodes the entropy is used to calculate the

homogeneity of the samples in that node if the samples are completely

homogeneous the entropy is zero and if the samples are equally divided it has

an entropy of one this means if all the data in a node are either drug a or drug

B then the entropy is zero but if half of the data or drug a and other half are

B then the entropy is one you can easily calculate the entropy of a node using

the frequency table of the attribute through the entropy formula where p is

for the proportion or ratio of a category such as drug a or b please

remember though that you don’t have to calculate these as it’s easily

calculated by the libraries or packages that you use as an example let’s

calculate the entropy of the data set before splitting it we have nine

occurrences of drug B and v of drug a you can embed these numbers into the

entropy formula to calculate the impurity of the target attribute before

splitting it in this case it is 0.94 so what is entropy after splitting now we

can test different attributes to find the one with the most predictiveness

which results in two more fewer branches let’s first select the cholesterol of

the patient and see how the data gets split based on its values for example

when it is normal we have 6 for drug B and 2 for drug a we can calculate the

entropy of this node based on the distribution of drug a and B which is

0.8 in this case but when cholesterol is high the data is

split into three for drug B and three for drug a calculating its entropy we

can see it would be 1.0 we should go through all the attributes and calculate

the entropy after the split and then choose the best attribute okay

let’s try another field let’s choose the sex attribute for the next check as you

can see when we use the sex attribute to split the data when its value is female

we have three patients that respond to to drug B and for patients that respond

to to drug a the entropy for this node is 0.98 which is not very promising

however on the other side of the branch when the value of the sex attribute is

male the result is more pure with six for drug B and only one for drug a the

entropy for this group is zero point five nine now the question is between

the cholesterol and sex attributes which one is a better choice which one is

better at the first attribute to divide the data set into two branches or in

other words which attribute results in more pure nodes for our drugs or in

which tree do we have less entropy after splitting rather than before splitting

the sex attribute with entropy of 0.98 at 0.5 nine or the cholesterol attribute

with entropy of 0.8 one and one point zero in its branches

the answer is the tree with the higher information gained after splitting so

what is information gain information game is the information that can

increase the level of certainty after splitting it is the entropy of a tree

before the split minus the weighted entropy after the split by an attribute

we can think of information gain and entropy as opposites as entropy or the

amount of randomness decreases the information gained or amount of

certainty increases and vice versa so constructing a decision tree is all

about finding attributes that return the highest information game let’s see how

information gain is calculated for the sex attribute as mentioned the

information gained is the entropy of the tree before the split minus the weighted

edge after the split the entropy of the tree

before the split is 0.94 the portion of female patients is 7 out of 14 and his

entropy is 0.98 5 also the portion of men is 7 out of 14 and the entropy of

the male node is 0.5 9 – the result of a square bracket here is the weighted

entropy after the split so the information gained of the tree if we use

the sex attribute to split the data set is 0.15 one as you can see we will

consider the entropy over the distribution of samples falling under

each leaf node and we’ll take a weighted average of that entropy weighted by the

proportion of samples falling under that leaf we can calculate the information

gain of the tree if we use cholesterol as well it is zero point four eight now

the question is which attribute is more suitable well as mentioned the tree with

the higher information gained after splitting this means the sex attribute

so we select the sex attribute as the first splitter now what is the next

attribute after branching by the sex attribute well as you can guess we

should repeat the process for each branch and test each of the other

attributes to continue to reach the most pure leaves this is the way you build a

decision tree hello and welcome in this video we’ll learn a machine learning

method called logistic regression which is used for classification and examining

this method will specifically answer these three questions what is logistic

regression what kind of problems can be solved by logistic regression and in

which situations do we use logistic regression so let’s get started logistic

regression is a statistical and machine learning technique for classifying

records of a data set based on the values of the input fields let’s say we

have a telecommunication data set that we’d like to analyze in order to

understand which customers might leave us next month this is historical

customer data where each row represents one customer imagine that you’re an

analyst at this company and you have to find out who is leaving and why you’ll

use the data set to build a model based on historical records and use it to

predict the future churn within the customer group the dataset includes

information about services that each customer has signed up for customer

account information demographic information about customers like gender

and age range and also customers who’ve left the company within the last month

the column is called churn we can use logistic regression to build a model for

predicting customer churn using the given features in logistic regression we

use one or more independent variables such as tenure age and income to predict

an outcome such as churn which we call a dependent variable representing whether

or not customers will stop using the service logistic regression is analogous

to linear regression but tries to predict a categorical or discrete target

field instead of a numeric one in linear regression we might try to predict a

continuous value of variables such as the price of a house blood pressure of

the patient or fuel consumption of a car but in logistic regression we predict a

variable which is binary such as yes/no true/false successful or not successful

pregnant not pregnant and so on all of which can be coded as 0 or 1 in logistic

regression dependent variables should be continuous if categorical they should be

dummy or indicator coded this means we have to transform them to some

continuous value please note that logistic regression can be used for both

binary classification and multi-class classification but for simplicity in

this video we’ll focus on binary classification let’s examine some

applications of logistic regression before we explain how they work

as mentioned logistic regression is a type of classification algorithm so it

can be used in different situations for example to predict the probability of a

person having a heart attack within a specified time period based on our

knowledge of the person’s age sex and body mass index or to predict the chance

of mortality in an injured patient or to predict whether a patient has a given

disease such as diabetes based on observed characteristics of that patient

such as weight height blood pressure and results of various blood tests and so on

in a marketing context we can use it to predict the likelihood of a customer

purchasing a product or halting a subscription as we’ve done in our churn

example we can also use logistic regression to predict the probability of

failure of a given process system or product we can even use it to predict

the likelihood of a homeowner defaulting on a mortgage these are all good

examples of problems that can be solved using logistic regression notice that in

all these examples not only do we predict the class of each case we also

measure the probability of a case belonging to a specific class there are

different machine algorithms which can classify or estimate a variable the

question is when should we use logistic regression here are for situations in

which logistic regression is a good candidate first when the target field in

your data is categorical or specifically is binary such as 0 1 yes/no churn or no

churn positive negative and so on second you need the probability of your

prediction for example if you want to know what the probability is of a

customer buying a product logistic regression returns a probability score

between 0 & 1 for a given sample of data in fact logistic regression predicts the

probability of that sample and we’ve mapped the cases to a discrete class

based on that probability third if your data is linearly separable the decision

boundary of logistic regression is a line or a plane or a hyperplane a

classifier will classify all the points on one side of the decision boundary as

belonging to one class and all those on the other side as belonging to the other

class for example if we have just two features and are not applying any

polynomial processing we can obtain an inequality like theta 0 plus theta 1 x1

plus theta 2 x2 is greater than 0 which is a half plane easily plausible please

note that in using logistic regression we can also achieve a complex decision

boundary using polynomial processing as well which is out of scope here you’ll

get more insight from decision boundaries when you understand how

logistic regression works fourth you need to understand the impact of a

feature you can select the best features based on the statistical significance of

the logistic regression model coefficients or parameters that is after

finding the optimum parameters a feature X with the weight theta 1 close to 0 has

a smaller effect on the prediction than features with large absolute values of

theta 1 indeed it allows us to understand the

impact an independent variable has on the dependent variable while controlling

other independent variables let’s look at our data set again we define the

independent variables as X and dependent variable as Y notice that for the sake

of simplicity we can code the target or dependent values to 0 or 1 the goal of

logistic regression is to build a model to predict the class of each sample

which in this case is a customer as well as the probability of each sample

belonging to a class given that let’s start to formalize the problem X is our

data set in the space of real numbers of M by n that is of M dimensions or

features and n records and Y is the class that we want to predict which can

be either 0 or 1 ideally a logistic regression model so called y hat can

predict that the class of the customer is 1 given its features X it can also be

shown quite easily that the probability of a customer being in class 0 can be

calculated as 1 minus the probability that the class of the customer is 1

hello and welcome in this video we will learn the difference between linear

regression and logistic regression we go over linear regression and see why it

cannot be used properly for some binary classification problems we also look at

the sigmoid function which is the main part of logistic regression let’s start

let’s look at the telecommunication data set again the goal of logistic

regression is to build a model to predict the class of each customer and

also the probability of each sample belonging to a class ideally we want to

build a model y hat that can estimate that the class of a customer is 1 given

its features X I want to emphasize that Y is the labels vector also called

actual values that we would like to predict and Y hat is the vector of the

predicted values by our model mapping the class labels to integer numbers can

we use linear regression to solve this problem first let’s recall how linear

regression works to better understand logistic regression forget about the

churn prediction for a minute and assume our goal is to predict the income of

customers in the data set this means that instead of predicting churn which

is a categorical value let’s predict income which is a

continuous value so how can we do this let’s select an independent variable

such as customer age and predict a dependent variable such as income of

course we can have more features but for the sake of simplicity let’s just take

one feature here we can plot it and show age as an independent variable and

income has the target value we would like to predict with linear regression

you can fit a line or polynomial through the data we can find this line through

training our model or calculating it mathematically based on the sample sets

we’ll say this is a straight line through the sample set this line has an

equation shown as a plus B x1 now use this line to predict the continuous

value Y that is use this line to predict the income of an unknown customer based

on his or her age and it is done what if we want to predict churn can we use the

same technique to predict a categorical field such as churn okay let’s see say

we’re given data on customer churn and our goal this time is to predict the

churn of customers based on their age we have a feature age denoted as x1 and a

categorical feature churn with two classes churn is yes and churn is known

as mentioned we could map yes and no two integer values 0 and 1 how can we model

it now well graphically we could represent our data with a scatter plot

but this time we have only two values for the y axis in this plot class 0 is

denoted in red and class 1 is denoted in blue our goal here is to make a model

based on existing data to predict if a new customer is red or blue let’s do the

same technique that we use for linear regression here to see if we can solve

the problem for a categorical attribute such as chert with linear regression you

again can fit a polynomial through the data which is shown traditionally as a

plus B X this polynomial can also be shown traditionally has Fatah’s 0 plus

theta 1 x 1 this line has two parameters which are shown with vector theta where

the values of the vector R theta 0 and we can also show the equation of this

line formally as theta transpose X and generally we can show the equation for a

multi-dimensional space as theta transpose X where theta is the

parameters of the line in two-dimensional space or parameters of a

plane in three-dimensional space and so on as Stata is a vector of parameters

and is supposed to be multiplied by X it is shown conventionally as transpose

theta theta is also called the wayans factor or confidences of the equation

with both these terms used interchangeably and X is the feature set

which represents a customer anyway given a dataset all the feature sets X theta

parameters can be calculated through an optimization algorithm or mathematically

which results in the equation of the fitting line for example the parameters

of this line are minus 1 and 0.1 and the equation for the line is minus 1 plus

0.1 X 1 now we can use this regression line to predict the turn of a new

customer for example for our customer or let’s say a data point with X value of

age equals 13 we can plug the value into the line formula and the Y value is

calculated and returns a number for instance for p1 point we have theta

transpose x equals minus 1 plus 0.1 x x1 equals minus 1 plus 0.1 times 13 equals

0.3 we can show it on our graph now we can define a threshold here for example

at 0.5 to define the class so we write a rule here for our model Y hat which

allows us to separate class 0 from class 1 if the value of theta transpose X is

less than 0.5 then the class is 0 otherwise if the value of theta

transpose X is more than 0.5 then the class is 1 and because our customers Y

value is less than the threshold we can say it belongs to class 0 based on our

model but there is one problem here what is the probability that this customer

belongs to class 0 as you can see it’s not the best model to solve this problem

also there are some other issues which verify that linear regression is not the

proper method for classification problems so as mentioned if we use the

regression line to calculate the class of a point it always returns a number

such as three or negative two and so on then we should use a threshold for

example zero point five to assign that point to either class of zero or one

this threshold works as a step function that outputs zero or one regardless of

how big or small positive or negative the input is so using the threshold we

can find the class of a record notice that in the step function no matter how

big the value is as long as it’s greater than 0.5 it simply equals 1 and vice

versa regardless of how small the value Y is the output would be zero if it is

less than 0.5 in other words there is no difference between a customer who has a

value of 1 for 1000 the outcome would be 1 instead of having this step function

wouldn’t it be nice if we had a smoother line one that would project these values

between 0 and 1 indeed the existing method does not really give us the

probability of a customer belonging to a class which is very desirable we need a

method that can give us the probability of falling in a class as well so what is

the scientific solution here well if instead of using theta transpose

X we use a specific function called sigmoid then sigmoid of theta transpose

X gives us the probability of a point belonging to a class instead of the

value of y directly I’ll explain the sigmoid function in a second but for now

please accept that it will do the trick instead of calculating the value of

theta transpose X directly it returns the probability that a theta transpose X

is very big or very small it always returns a value between 0 and 1

depending on how large the theta transpose X actually is now our model is

sigmoid of theta transpose X which represents the probability that the

output is 1 given X now the question is what is the sigmoid function let me

explain in detail what sigmoid really is the sigmoid function also called the

logistic function resembles the step function and is used

by the following expression in the logistic regression the sigmoid function

looks a bit complicated at first but don’t worry about remembering this

equation it’ll make sense to you after working with it notice that in the

sigmoid equation when theta transpose X is very big the e power minus theta

transpose X in the denominator of the fraction becomes almost zero and the

value of the sigmoid function gets closer to one if theta transpose X is

very small the sigmoid function gets closer to zero depicting on the in

sigmoid plot when theta transpose X gets bigger the value of the sigmoid function

gets closer to 1 and also if the theta transpose X is very small the sigmoid

function gets closer to 0 so the sigmoid functions output is always between 0 and

1 which makes it proper to interpret the results as probabilities it is obvious

that when the outcome of the sigmoid function gets closer to 1 the

probability of y equals 1 given X goes up and in contrast when the sigmoid

value is closer to 0 the probability of y equals 1 given X is very small so what

is the output of our model when we use the sigmoid function in logistic

regression we model the probability that an input X belongs to the default class

y equals 1 and we can write this formula as probability of y equals 1 given X we

can also write probability of Y belongs to class 0 given X is 1 minus

probability of y equals 1 given X for example the probability of a customer

staying with the company can be shown as probability of churn equals 1 given a

customer’s income and age which can be for instance 0.8 and the probability of

churn is 0 for the same customer given a customer’s income and age can be

calculated as 1 minus 0.8 equals 0.2 so now our job is to train the model to set

its parameter values in such a way that our model is a good estimate of

probability of y equals 1 given X in fact this is what a good classifier

model built by logistic regression is supposed to do for us

also it should be a good estimate of probability of y belongs to class zero

given X that can be shown as 1 minus Sigma of theta transpose X now the

question is how can we achieve this we can find theta through the training

process so let’s see what the training process is step 1 initialize theta

vector with random values as with most machine learning algorithms for example

minus 1 or 2 step 2 calculate the model output which is sigmoid of theta

transpose X for a sample customer in your training set X in theta transpose X

is the feature vector values for example the age and income of the customer for

instance 2 & 5 and theta is the confidence or weight that you’ve set in

the previous step the output of this equation is the prediction value in

other words the probability that the customer belongs to class 1 step 3

compare the output of our model y hat which could be a value of let’s say zero

point 7 with the actual label of the customer which is for example one for

chert then record the difference as our models error for this customer which

would be one minus zero point seven which of course equals zero point three

this is the error for only one customer out of all the customers in the training

set step four calculate the error for all customers as we did in the previous

steps and add up these errors the total error is the cost of your model and is

calculated by the models cost function the cost function by the way basically

represents how to calculate the error of the model which is the difference

between the actual and the models predicted values so the cost shows how

poorly the model is estimating the customers labels therefore the lower the

cost the better the model is at estimating the customers labels

correctly and so what we want to do is to try to minimize this cause step 5 but

because the initial values for theta were chosen randomly it’s very likely

that the cost function is very high so we change the theta in such a way to

hopefully reduce the total cost step 6 after changing the values of theta we go

back to step 2 then we start another iteration and

calculate the cost of the model again and we keep doing those steps over and

over changing the values of theta each time until the cost is low enough so

this brings up two questions first how can we change the values of theta so

that the cost is reduced across iterations and second when should we

stop the iterations there are different ways to change the values of theta but

one of the most popular ways is gradient descent also there are various ways to

stop iterations but essentially you stop training by calculating the accuracy of

your model and stop it when it’s satisfactory hello and welcome in this

video we’ll learn more about training a logistic regression model also we’ll be

discussing how to change the parameters of the model to better estimate the

outcome finally we talked about the cost function and gradient descent in

logistic regression as a way to optimize the model so let’s start the main

objective of training in logistic regression is to change the parameters

of the model so as to be the best estimation of the labels of the samples

in the dataset for example the customer churn how do we do that in brief first

we have to look at the cost function and see what the relation is between the

cost function and the parameters theta so we should formulate the cost function

then using the derivative of the cost function we can find how to change the

parameters to reduce the cost or rather the error let’s dive into it to see how

it works but before I explain it I should highlight for you that it needs

some basic mathematical background to understand it however you shouldn’t

worry about it as most data science languages like

Python R and Scala have some packages or libraries that calculate these

parameters for you so let’s take a look at it let’s first find the cost function

equation for a sample case to do this we can use one of the customers in the

churn problem there’s normally a general equation for calculating the cost the

cost function is the difference between the actual values of y and our model

output Y hat this is a general rule for most cost functions in machine learning

we can show this as the cost of our model comparing it with actual labels

which is the difference between the predicted value of our model and actual

value of the target field where the predicted value of our model is sigmoid

of theta trans X usually the square of this equation is

used because of the possibility of the negative result and for the sake of

simplicity half of this value is considered as the cost function through

the derivative process now we can write the cost function for all the samples in

our training set for example for all customers we can write it as the average

sum of the cost functions of all cases it is also called the mean squared error

and as it is a function of a parameter vector theta it is shown as J of theta

okay good we have the cost function now how do we find or set the best weights

or parameters that minimize this cost function

the answer is we should calculate the minimum point of this cost function and

it’ll show us the best parameters for our model although we can find the

minimum point of a function using the derivative of a function there’s not an

easy way to find the global minimum point for such an equation given this

complexity describing how to reach the global minimum for this equation is

outside the scope of this video so what is the solution well we should find

another cost function instead one which has the same behavior that is easier to

find its minimum point let’s plot the desirable cost function for our model

recall that our model is y hat our actual value is y which equals 0 or 1

and our model tries to estimate it as we want to find a simple cost function for

our model for a moment assume that our desired value for Y is 1 this means our

model is best if it estimates y equals 1 in this case we need a cost function

that returns zero if the outcome of our model is one which is the same as the

actual label and the cost should keep increasing as the outcome of our model

is farther from one and cost should be very large if the outcome of our model

is close to zero we can see that the minus log function provides such a cost

function for us it means if the actual value is 1 and the model also predicts 1

the minus log function returns zero cost but if the prediction is smaller than 1

the minus log function returns a larger cost value so we can use the minus log

function for calculating the cost of our logistic regression model so if you

recall we previously noted that in general it is difficult to calculate the

derivative of the cost function well we can now change it with a minus

log of our model we can easily prove that in the case that desirable y is 1

the cost can be calculated as minus log y hat and in the case that desirable y

is 0 the cost can be calculated as minus log 1 minus y hat now we can plug it

into our total cost function and rewrite it as this function so this is the

logistic regression cost function as you can see for yourself it penalizes

situations in which the class is 0 and the model output is 1 and vice versa

remember however that Y hat does not return a class as output but it’s a

value of 0 or 1 which should be assumed as a probability now we can easily use

this function to find the parameters of our model in such a way as to minimize

the cost ok let’s recap what we’ve done our objective was to find a model that

best estimates the actual labels finding the best model means finding the best

parameters theta for that model so the first question was how do we find the

best parameters for our model well by finding and minimizing the cost function

of our model in other words to minimize the J of theta we just defined the next

question is how do we minimize the cost function

the answer is using an optimization approach there are different

optimization approaches but we use one of the most famous and effective

approaches here gradient descent the next question is what is gradient

descent generally gradient descent is an iterative approach to finding the

minimum of a function specifically in our case gradient descent is a technique

to use the derivative of a cost function to change the parameter values to

minimize the cost or error let’s see how it works the main objective of gradient

descent is to change the parameter values so as to minimize the cost

how can gradient descent do that think of the parameters or weights in our

model to be in a two dimensional space for example theta 1 theta 2 for two

feature sets age and income recall the cost function J that we discussed in the

previous slides we need to minimize the cost function J which is a function of

variables theta 1 and theta 2 so let’s add a dimension for the observed call

or error J function let’s assume that if we plot the cost function based on all

possible values of theta 1 and theta 2 we can see something like this it

represents the error value for different values of parameters that is error which

is a function of the parameters this is called your error curve or inner Bowl of

your cost function recall that we want to use this error Bowl to find the best

parameter values that result in minimizing the cost value now the

question is which point is the best point for your cost function yes you

should try to minimize your position on the error curve so what should you do

you have to find the minimum value of the cost by changing the parameters but

which way will you add some value to your weights or deduct some value and

how much would that value be you can select random parameter values that

locate a point on the bowl you can think of our starting point being the yellow

point you change the parameters by delta theta1 and delta theta2 and take one

step on the surface let’s assume we go down one step in the bowl as long as

we’re going downwards we can go one more step a steeper the slope the further we

can step and we can keep taking steps as we approach the lowest point the slope

diminishes so we can take smaller steps until we reach a flat surface this is

the minimum point of our curve and the optimum theta1 theta2

what are these steps really I mean in which direction should we take these

steps to make sure we descend and how big should the steps be to find the

direction and size of these steps in other words to find how to update the

parameters you should calculate the gradient of the cost function at that

point the gradient is the slope of the surface

at every point and the direction of the gradient is the direction of the

greatest uphill now the question is how do we calculate the gradient of a cost

function at a point if you select a random point on this surface for example

the yellow point and take the partial derivative of J of theta with respect to

each parameter at that point it gives you the slope of the move for each

parameter at that point now if we move in the opposite direction of that slope

it guarantees that we go down in the error curve for example if we calculate

the derivative of J with respect to theta one we find out that it is a

positive number this indicates that function is

increasing as theta one increases so to decrease J we should move in the

opposite direction this means to move in the direction of

the negative derivative for theta one ie slope we have to calculate it for other

parameters as well at each step the gradient value also indicates how big of

a step to take if the slope is large we should take a large step because we’re

far from the minimum if the slope is small we should take a smaller step

gradient descent takes increasingly smaller steps towards the minimum with

each iteration the partial derivative of the cost function J is calculated using

this expression if you want to know how the derivative of the J function is

calculated you need to know the derivative concept which is beyond our

scope here but to be honest you don’t really need to remember all the details

about it as you can easily use this equation to calculate the gradients so

in a nutshell this equation returns the slope of that point and we should update

the parameter in the opposite direction of the slope a vector of all these

slopes is the gradient vector and we can use this vector to change or update all

the parameters we take the previous values of the parameters and subtract

the error derivative this results in the new parameters for theta that we know

will decrease the cost also we multiply the gradient value by a constant value

mu which is called the learning rate learning rate gives us additional

control on how fast we move on the surface in sum we can simply say

gradient descent is like taking steps in the current direction of the slope and

the learning rate is like the length of the step you take so these would be our

new parameters notice that it’s an iterative operation and in each

iteration we update the parameters and minimize the cause until the algorithm

converge is on an acceptable minimum okay let’s recap what we’ve done to this

point by going through the training algorithm again step by step step one we

initialize the parameters with random values step two we feed the cost

function with the training set and calculate the cost we expect a high

error rate as the parameters are set randomly step three we calculate the

gradient of the cost function keeping in mind that we have to use a partial

derivative so to calculate the gradient vector we need all the training data to

feed the equation for each parameter of course this is an expensive part of

the algorithm but there are some solutions for this step 4

we update the weights with new parameter values step 5

here we go back to step 2 and feed the cost function again which has new

parameters as was explained earlier we expect less error as we’re going down

the error surface we continue this loop until we reach a short value of cost or

some limited number of iterations step 6 the parameter should be roughly found

after some iterations this means the model is ready and we can use it to

protect the probability of a customer staying or leaving hello and welcome in

this video we will learn a machine learning method called support vector

machine or SVM which is used for classification so let’s get started

imagine that you’ve obtained a dataset containing characteristics of thousands

of human cell samples extracted from patients who were believed to be at risk

of developing cancer analysis of the original data showed that many of the

characteristics differed significantly between benign and malignant samples you

can use the values of the cell characteristics and samples from other

patients to give an early indication of whether a new sample might be benign or

malignant you can use support vector machine or SVM as a classifier to train

your model to understand patterns within the data that might show benign or

malignant cells once the model has been trained it can be used to predict your

new or unknown cell with rather high accuracy now let me give you a formal

definition of SVM a support vector machine is a supervised algorithm that

can classify cases by finding a separator SVM works by first mapping

data to a high dimensional feature space so that data points can be categorized

even when the data are not otherwise linearly separable then a separator is

estimated for the data the data should be transformed in such a way that a

separator could be drawn as a hyperplane for example consider the following

figure which shows the distribution of a small set of cells only based on their

unit size and clump thickness as you can see the data points fall into two

different categories it represents a linearly non

several data set the two categories can be separated with a curve but not

aligned that is it represents a linearly non-separable data set which is the case

for most real-world data sets we can transfer this data to a higher

dimensional space for example mapping it to a three dimensional space after the

transformation the boundary between the two categories can be defined by a

hyperplane as we are now in three dimensional space the separator is shown

as a plane this plane can be used to classify new or unknown cases therefore

the SVM algorithm outputs an optimal hyperplane that categorizes new examples

now there are two challenging questions to consider first how do we transfer

data in such a way that a separator could be drawn as a hyperplane and two

how can we find the best or optimized hyperplane separator after

transformation let’s first look at transforming data to see how it works

for the sake of simplicity imagine that our data set is one dimensional data

this means we have only one feature X as you can see it is not linearly separable

so what can we do here well we can transfer it into a two dimensional space

for example you can increase the dimension of data by mapping X into a

new space using a function with outputs x and x squared now the data is linearly

separable right notice that as we are in a two dimensional space the hyperplane

is a line dividing a plane into two parts where each class lays on either

side now we can use this line to classify new cases basically mapping

data into a higher dimensional space is called kernel in the mathematical

function used for the transformation is known as the kernel function and can be

of different types such as linear polynomial radial basis function or RBF

and sigmoid each of these functions has its own characteristics its pros and

cons and its equation but the good news is that you don’t need to know them as

most of them are already implemented in libraries of data science programming

languages also as there’s no easy way of knowing which function performs best

with any given data set we usually choose different functions in turn

and compare the results now we get to another question specifically how do we

find the right or optimize separator after transformation basically SVM’s are

based on the idea of finding a hyperplane that best divides a dataset

into two classes as shown here as we’re in a two dimensional space you can think

of the hyperplane as a line that linearly separates the blue points from

the red points one reasonable choice as the best hyperplane is the one that

represents the largest separation or margin between the two classes so the

goal is to choose a hyperplane with as big a margin as possible examples

closest to the hyperplane are support vectors it is intuitive that only

support vectors matter for achieving our goal and thus other training examples

can be ignored we try to find the hyperplane in such a way that it has the

maximum distance to support vectors please note that the hyperplane and

boundary decision lines have their own equations so finding the optimized

hyperplane can be formalized using an equation which involves quite a bit more

math so I’m not going to go through it here in detail that said the hyperplane

is learned from training data using an optimization procedure that maximizes

the margin and like many other problems this optimization problem can also be

solved by gradient descent which is out of scope of this video therefore the

output of the algorithm is the values W and B for the line you can make

classifications using this estimated line it is enough to plug in input

values into the line equation then you can calculate whether an unknown point

is above or below the line if the equation returns a value greater than 0

then the point belongs to the first class which is above the line and vice

versa the two main advantages of support vector machines are that they’re

accurate in high dimensional spaces and they use a subset of training points in

the decision function called support vectors so it’s also memory efficient

the disadvantages of support vector machines include the fact that the

algorithm is prone for overfitting if the number of features is much greater

than the number samples also SVM’s do not directly

provide probability estimates which are desirable in most classification

problems and finally svms are not very efficient computationally if your

dataset is very big such as when you have more than 1,000 rows and now our

final question is in which situation should I use SVM well SVM is good for

image analysis tasks such as image classification and handwritten digit

recognition also SVM is very effective in text mining taxes particularly due to

its effectiveness in dealing with high dimensional data for example it is used

for detecting spam next category assignment and sentiment analysis

another application of SVM is in gene expression data classification again

because of its power in high dimensional data classification SVM can also be used

for other types of machine learning problems such as regression outlier

detection and clustering I’ll leave it to you to explore more about these

particular problems hello and welcome in this video we’ll give you a high-level

introduction to clustering its applications and different types of

clustering algorithms let’s get started imagine that you have a customer data

set and you need to apply customer segmentation on this historical data

customer segmentation is the practice of partitioning a customer base into groups

of individuals that have similar characteristics it is a significant

strategy as it allows a business to target specific groups of customers so

as to more effectively allocate marketing resources for example one

group might contain customers who are high profit and low risk that is more

likely to purchase products or subscribed for a service knowing this

information allows a business to devote more time and attention to retaining

these customers another group might include customers from nonprofit

organizations and so on a general segmentation process is not

usually feasible for large volumes of varied data therefore you need an

analytical approach to deriving segments and groups from large data sets

customers can be grouped based on several factors including age gender

interest spending habits and so on the important requirement is to use the

available data to under stand and identify how customers are

similar to each other let’s learn how to divide a set of customers into

categories based on characteristics they share one of the most adopted approaches

that can be used for customer segmentation is clustering clustering

can group data only unsupervised based on the similarity of customers to each

other it will partition your customers into mutually exclusive groups for

example into three clusters the customers in each cluster are similar to

each other demographically now we can create a profile for each group

considering the common characteristics of each cluster for example the first

group is made up of affluent and middle-aged customers the second is made

up of young educated and middle-income customers and the third group includes

young and low-income customers finally we can assign each individual in our

dataset to one of these groups or segments of customers now imagine that

you cross join this segment at data set with the data set of the product or

services that customers purchase from your company this information would

really help to understand and predict the differences in individual customers

preferences and their buying behaviors across various products indeed having

this information would allow your company to develop highly personalized

experiences for each segment customer segmentation is one of the popular

usages of clustering cluster analysis also has many other applications in

different domains so let’s first define clustering and then we’ll look at other

applications clustering means finding clusters in a data set unsupervised so

what is a cluster a cluster is a group of data points or objects in a data set

that are similar to other objects in the group and dissimilar to data points in

other clusters now the question is what is different between clustering and

classification let’s look at our customer data set again classification

algorithms predict categorical class labels this means assigning instances to

predefined classes such as defaulted or not defaulted for example if an analyst

wants to analyze customer data in order to know which customers might default on

their payments she uses a labeled data set as training data and uses

classification approaches such as a decision tree support vector machine

or SVM or logistic regression to predict the default value for a new or unknown

customer generally speaking classification is a

supervised learning where each training data instance belongs to a particular

class in clustering however the data is unlabeled and the process is

unsupervised for example we can use a clustering algorithm such as k-means to

group similar customers as mentioned and assign them to a cluster based on

whether they share similar attributes such as age education and so on while

I’ll be giving you some examples in different industries I’d like you to

think about more samples of clustering in the retail industry clustering is

used to find associations among customers based on their demographic

characteristics and use that information to identify buying patterns of various

customer groups also it can be used in recommendation systems to find a group

of similar items or similar users and use it for collaborative filtering to

recommend things like books or movies to customers in banking analyst find

clusters of normal transactions to find the patterns of fraudulent credit card

usage also they use clustering to identify clusters of customers for

instance to find loyal customers versus churn customers in the insurance

industry clustering is used for fraud detection in claims analysis or to

evaluate the insurance risk of certain customers based on their segments in

publication media clustering is used to auto categorize news based on its

content or to tag news then cluster it so as to recommend similar news articles

to readers in medicine it can be used to characterize patient behavior based on

their similar characteristics so as to identify successful medical therapies

for different illnesses or in biology clustering is used to group genes with

similar expression patterns or to cluster genetic markers to identify

family ties if you look around you can find many other applications of

clustering but generally clustering can be used for one of the following

purposes exploratory data analysis summary generation or reducing the scale

outlier detection especially to be used for fraud detection or noise removal

finding duplicates and datasets or as a pre-processing step for either

prediction other data mining tasks or as part of a

complex system let’s briefly look at different clustering algorithms and

their characteristics partition based clustering is a group of clustering

algorithms that produces sphere-like clusters such as k-means k median or

fuzzy c means these algorithms are relatively efficient and are used for

medium and large sized databases hierarchical clustering algorithms

produce trees of clusters such as agglomerative and divisive algorithms

this group of algorithms are very intuitive and are generally good for use

with small size datasets density based clustering algorithms produce arbitrary

shaped clusters they are especially good when dealing with spatial clusters or

when there is noise at your data set for example the DB scan algorithm hello and

welcome in this video we’ll be covering k-means clustering so let’s get started

imagine that you have a customer data set and you need to apply a customer

segmentation on this historical data customer segmentation is the practice of

partitioning a customer base into groups of individuals that have similar

characteristics one of the algorithms that can be used for customer

segmentation is k-means clustering k-means can group data only unsupervised

based on the similarity of customers to each other let’s define this technique

more formally there are various types of clustering algorithms such as

partitioning hierarchical or density based clustering k-means is a type of

partitioning clustering that is it divides the data into Kate

non-overlapping subsets or clusters without any cluster internal structure

or labels this means it’s an unsupervised algorithm objects within a

cluster are very similar and objects across different clusters are very

different or dissimilar as you can see for using k-means we have to find

similar samples for example similar customers now we face a couple of key

questions first how can we find the similarity of samples in clustering and

then how do we measure how similar two customers are with regard to their

demographics though the objective of k-means is to form clusters in such a

way that similar samples go into a cluster and dissimilar samples fall in

different clusters it can be shown that instead of a similarity metric we can

use dissimilarity metrics in other words conventionally the distance of samples

from each other is used to shape the clusters so we can say k-means tries to

minimize the intra cluster distances and maximize the inter cluster distances now

the question is how can we calculate the disome Alera T or distance of two cases

such as two customers assume that we have two customers will call them

customer one and two let’s also assume that we have only one feature for each

of these two customers and that feature is age we can easily use a specific type

of Minkowski distance to calculate the distance of these two customers indeed

it is the Euclidean distance distance of X 1 from X 2 is root of 34 minus 30

power 2 which is 4 what about if we have more than one feature for example age

and income for example if we have income and age for each customer we can still

use the same formula but this time in a two dimensional space also we can use

the same distance matrix for multi-dimensional vectors of course we

have to normalize our feature set to get the accurate dissimilarity measure there

are other dissimilarity measures as well that can be used for this purpose but it

is highly dependent on data type and also the domain that clustering is done

for it for example you may use Euclidean

distance cosine similarity average distance and so on indeed the similarity

measure haile controls how the clusters are formed so it is recommended to

understand the domain knowledge of your data set and data type of features and

then choose the meaningful distance measurement now let’s see how K means

clustering works for the sake of simplicity let’s assume that our dataset

has only two features the age and income of customers this means it’s a two

dimensional space we can show the distribution of customers using a

scatterplot the y axis indicates age and the x axis shows income of customers we

try to cluster the customer data set into distinct groups or clusters based

on these two dimensions in the first step we should determine the number of

clusters the key concept of the k-means algorithm is that it randomly picks a

center point for each cluster it means we must initialize K which represents

number of clusters essentially determining the number of clusters in a

data set or K is a hard problem in k-means that we will discuss later for

now let’s put k equals three here for our sample data set in his life we have

three representative points for our clusters these three data points are

called centroids of clusters and should be of same feature size of our customer

features set there are two approaches to choose these centroids one we can

randomly choose three observations out of the data set and use these

observations as the initial means or two we can create three random points and

centroids of the clusters which is our choice that is shown in the plot with

red color after the initialization step which was defining the centroid of each

cluster we have to assign the each customer to the closest center for this

purpose we have to calculate the distance of each data point or in our

case each customer from the centroid points as mentioned before depending on

the nature of the data and the purpose for which clustering is being used

different measures of distance may be used to place items into clusters

therefore you will form a matrix where each row represents the distance of a

customer from each centroid it is called the distance matrix the main objective

of k-means clustering is to minimize the distance of data points from the

centroid of his cluster and maximize the distance from other cluster centroids

so in this step we have to find the closest centroid to each data point we

can use the distance matrix to find the nearest centroid two data points finding

the closest centroids for each data point we assign each data point to that

cluster in other words all the customers will fall to a cluster based on their

distance from centroids we can easily say that it does not

result in good clusters because the centroids were chosen randomly from the

first indeed the model would have a high error here

errors the total distance of each point from its centroid it can be shown as

within cluster sum of squares error intuitively we try to reduce this error

it means we should shape clusters in such a way that the

distance of all members of a cluster from its centroid be minimized now the

question is how can we turn it into better clusters with less error okay we

move centroids in the next step each cluster Center will be updated to be the

mean four data points in its cluster indeed each centroid moves according to

their cluster members in other words the centroid of each of the three clusters

becomes the new mean for example if point eight coordination is seven point

four and three point six and point B features are seven point eight and three

point eight the new centroid of this cluster with two points would be the

average of them which is seven point six and three point seven now we have new

centroids as you can guess once again we will have to calculate the distance of

all points from the new centroids the points re clustered and the centroids

move again this continues until the centroid is no

longer move please note that whenever a centroid moves each points distance to

the centroid needs to be measured again yes k-means is an iterative algorithm

and we have to repeat steps two to four until the algorithm converges in each

iteration it will move the centroids calculate the distances from new

centroids and assign data points to the nearest centroid it results in the

clusters with minimum error or the most dense clusters however as it is a

heuristic algorithm there is no guarantee that it will converge to the

global optimum and the result may depend on the initial clusters it means this

algorithm is guaranteed to converge to a result but the result may be a local

optimum ie not necessarily the best possible outcome to solve this problem

it is common to run the whole process multiple times with different starting

conditions this means with randomized starting centroids it may give a better

outcome and as the owner of them is usually very fast it wouldn’t be any

problem to run it multiple times hello and welcome in this video we’ll look at

k-means accuracy and characteristics let’s get started let’s define the

algorithm more concretely before we talk about its accuracy a k-means algorithm

works by randomly placing K centroids one for each cluster

the farther apart the clusters are placed the better the next step is to

calculate the distance of each data point or object from the centroids

Euclidean distance is used to measure the distance from the object to the

centroid please note however that you can also use different types of distance

measurements not just Euclidean distance Euclidean distance is used because it’s

the most popular then assign each data point or object to its closest centroid

creating a group next once each data point has been classified to a group

recalculate the position of the case centroids the new centroid position is

determined by the mean of all points in the group finally this continues until

the centroids no longer move now the question is how can we evaluate the

goodness of the clusters formed by k-means in other words how do we

calculate the accuracy of k-means clustering one way is to compare the

clusters with the ground truth if it’s available

however because Kate means is an unsupervised algorithm we usually don’t

have ground truth and real world problems to be used but there is still a

way to say how bad each cluster is based on the objective of the k-means this

value is the average distance between data points within a cluster also

average of the distances of data points from their cluster centroids can be used

as a metric of error for the clustering algorithm essentially determining the

number of clusters in a dataset or K as in the k-means algorithm is a frequent

problem in data clustering the correct choice of K is often ambiguous because

it’s very dependent on the shape and scale of the distribution of points in a

data set there are some approaches to address this problem but one of the

techniques that is commonly used is to run the clustering across the different

values of K and looking at a metric of accuracy for clustering this metric can

be mean distance between data points and their clusters centroid which indicate

how dense our clusters are or to what extent we minimize the error of

clustering then looking at the change of this metric we can find the best value

for K but the problem is that with increasing the number of clusters the

distance of centroids two data points will always reduce this means increasing

K will always decrease the error so the value of the metric as a function of K

is plotted and the elbow point is determined where the rate

of decrease sharply shifts it is the right K for clustering this method is

called the elbow method so let’s recap k-means clustering k-means is a

partition based clustering which is a relatively efficient on medium and

large-sized datasets B produces sphere-like clusters because the

clusters are shaped around the centroids and C its drawback is that we should pre

specify the number of clusters and this is not an easy task hello and welcome in

this video we’ll be covering hierarchical clustering so let’s get

started let’s look at this chart an international team of scientists led by

UCLA biologists use this dendrogram to report genetic data from more than 900

dogs from 85 breeds and more than 200 wild gray wolves worldwide including

populations from North America Europe the Middle East and East Asia they use

molecular genetic techniques to analyze more than 48,000 genetic markers this

diagram shows hierarchical clustering of these animals based on the similarity in

their genetic data hierarchical clustering algorithms build a hierarchy

of clusters where each node is a cluster consisting of the clusters of its

daughter nodes strategies for hierarchical clustering generally fall

into two types divisive and agglomerative divisive is top-down

so you start with all observations in a large cluster and break it down into

smaller pieces think about divisive as dividing the

cluster agglomerated is the opposite of divisive so it is bottom-up where each

observation starts in its own cluster and pairs of clusters are merged

together as they move up the hierarchy agglomeration means to amass or collect

things which is exactly what this does with the cluster the agglomerative

approach is more popular among data scientists and so it is the main subject

of this video let’s look at a sample of agglomerated clustering this method

builds the hierarchy from the individual elements by progressively merging

clusters in our example let’s say we want to cluster 6 cities in Canada based

on their distances from one another they are Toronto Ottawa Vancouver

Montreal Winnipeg and Edmonton we construct a distance matrix at this

stage where the numbers in the row I column J is the distance between the I

and J cities in fact this table shows the distances between each pair of

cities the algorithm is started by assigning each city to its own cluster

so if we have six cities we have six clusters each containing just one city

let’s note each city by showing the first two characters of its name the

first step is to determine which cities let’s call them clusters from now on to

merge into a cluster usually we want to take the two closest clusters according

to the chosen distance looking at the distance matrix Montreal and Ottawa are

the closest clusters so we make a cluster out of them please notice that

we just use a simple one-dimensional distance feature here but our object in

the multi-dimensional and distance measurement can either be in clitty in’

pearson average distance or many others depending on data type and domain

knowledge anyhow we have to merge these two closest cities in the distance

matrix as well so rows and columns are merged as the cluster is constructed as

you can see in the distance matrix rows and columns related to Montreal and

Ottawa cities are merged as the cluster is constructed then the distances from

all cities to this new merged cluster get updated but how for example how do

we calculate the distance from Winnipeg to the Ottawa Montreal cluster well

there are different approaches but let’s assume for example we just select the

distance from the centre of the Ottawa Montreal cluster to Winnipeg updating

the distance matrix we now have one less cluster next we look for the closest

clusters once again in this case Ottawa Montreal and Toronto are the closest

ones which creates another cluster in the next step the closest distance is

between the Vancouver cluster and the Edmonton cluster forming a new cluster

their data in the matrix table gets updated essentially the rows and columns

are merged as the clusters are merged and the distance updated this is a

common way to implement this to clustering and has the benefit of

cashing distances between clusters in the same way agglomerated

algorithm proceeds by merging clusters and we repeat it until all clusters are

merged and the tree becomes completed it means until all cities are clustered

into a single cluster of size 6 hierarchical clustering is typically

visualized as a dendrogram as shown on this slide

each merge is represented by a horizontal line the y coordinate of the

horizontal line is the similarity of the two clusters that were merged where

cities are viewed as singleton clusters by moving up from the bottom layer to

the top node a dendrogram allows us to reconstruct the history of merges that

resulted in the depicted clustering essentially hierarchical clustering does

not require a pre specified number of clusters however in some applications we

want a partition of disjoint clusters just as in flat clustering in those

cases the hierarchy needs to be cut at some point for example here cutting in a

specific level of similarity we create three clusters of similar cities hello

and welcome in this video we’ll be covering more details about hierarchical

clustering let’s get started let’s look at agglomerated algorithm for

hierarchical clustering remember that agglomerative clustering is a bottom-up

approach let’s say our dataset has n data points first we want to create n

clusters one for each data point then each point is assigned as a cluster next

we want to compute the distance proximity matrix which will be an N by n

table after that we want to iteratively run the following steps until the

specified cluster number is reached or until there is only one cluster left

first merge the two nearest clusters distances are computed already in the

proximity matrix second update the proximity matrix with the new values we

stopped after we’ve reached the specified number of clusters or there is

only one cluster remaining with the results stored in the dendrogram so in

the proximity matrix we have to measure the distances between clusters and also

merge the clusters that are nearest so the key operation is the computation

of the proximity between the clusters with one point and also clusters with

multiple data points at this point there are a number of key questions that need

to be answered for instance how do we measure the distances between these

clusters and how do we define the nearest among clusters we also can ask

which points do we use first let’s see how to calculate the distance between

two clusters with one point each let’s assume that we have a data set of

patients and we want to cluster them using hierarchy clustering so our data

points are patients with a feature set of three dimensions for example age body

mass index or BMI and blood pressure we can use different distance measurements

to calculate the proximity matrix for instance Euclidean distance so if we

have a data set of n patients we can build an N by n dissimilarity distance

matrix it will give us the distance of clusters with one data point however as

mentioned we merge clusters in agglomerative clustering now the

question is how can we calculate the distance between clusters when there are

multiple patients in each cluster we can use different criteria to find the

closest clusters and merge them in general it completely depends on the

data type dimensionality of data and most importantly the domain knowledge of

the data set in fact different approaches to defining the distance

between clusters distinguish the different algorithms as you might

imagine there are multiple ways we can do this

the first one is called single linkage clustering single linkage is defined as

the shortest distance between two points in each cluster such as point a and meet

next up is complete linkage clustering this time we are finding the longest

distance between the points in each cluster

such as the distance between point A and V the third type of linkage is average

linkage clustering or the mean distance this means we’re looking at the average

distance of each point from one cluster to every point in another cluster the

final linkage type to be reviewed is centroid linkage clustering centroid is

the average of the feature sets of points in a cluster this linkage takes

into account the centroid of each cluster when determining the minimum

instance there are three main advantages to using hierarchical clustering first

we do not need to specify the number of clusters required for the algorithm

second hierarchical clustering is easy to implement and third the dendrogram

produced is very useful in understanding the data there are some disadvantages as

well first the algorithm can never undo any previous steps so for example the

algorithm clusters two points and later on we see that the connection was not a

good one the program cannot undo that step second the time complexity for the

clustering can result in very long computation times in comparison with

efficient algorithms such as k-means finally if we have a large data set it

can become difficult to determine the correct number of clusters by the

dendrogram now let’s compare hierarchical clustering with k-means

Gate means is more efficient for large datasets in contrast to k-means

hierarchical clustering does not require the number of clusters to be specified

hierarchical clustering gives more than one partitioning depending on the

resolution whereas K means gives only one partitioning of the data

hierarchical clustering always generates the same clusters in contrast with

k-means that returns different clusters each time it is run due to random

initialization of centroids hello and welcome in this video we’ll be covering

DD scan a density based clustering algorithm which is appropriate to use

when examining spatial data so let’s get started most of the traditional

clustering techniques such as k-means hierarchical and fuzzy clustering can be

used to group data in an unsupervised way however when applied to tasks with

arbitrary shaped clusters or clusters within clusters traditional techniques

might not be able to achieve good results that is elements in the same

cluster might not share enough similarity or the performance may be

poor additionally while partitioning based algorithms such as k-means may be

easy to understand and implement in practice the algorithm has no notion of

outliers that is all points are assigned to a cluster even if they do not belong

in any in the domain of anomaly detection this causes problems as

anomalous points will be assigned to the same cluster

normal data points the anomalous points pulled the cluster centroid towards them

making it harder to classify them as anomalous points in contrast density

based clustering locates regions of high density that are separated from one

another by regions of low density density in this context is defined as

the number of points within a specified radius a specific and very popular type

of density based clustering is DP scan DB scan is particularly effective for

tasks like class identification on a spatial context the wonderful attribute

of the DB scan algorithm is that it can find out any arbitrary shaped cluster

without getting affected by noise for example this map shows the location of

weather stations in Canada DB scan can be used here to find the group of

stations which show the same weather condition as you can see it not only

finds different arbitrary shape clusters it can find the denser part of data

centered samples by ignoring less dense areas or noises now let’s look at this

clustering algorithm to see how it works DB scan stands for density based spatial

clustering of applications with noise this technique is one of the most common

clustering algorithms which works based on density of object DB scan works on

the idea that if a particular point belongs to a cluster it should be near

to lots of other points in that cluster it works based on two parameters radius

and minimum points are determines a specified radius that if it includes

enough points within it we call it a dense area mm determines the minimum

number of data points we want in a neighborhood to define the cluster let’s

define radius as two units for the sake of simplicity assume it as radius of two

centimeters around a point of interest also let’s set the minimum point

or M to be 6 points including the point of interest to see how DB scan works we

have to determine the type of points each point in our data set can be either

a core order or outlier point don’t worry I’ll explain what these points are

in a moment but the whole idea behind the DB scan algorithm is to visit each

point and find its type first then we group points as clusters based

there types let’s pick a point randomly first we check to see whether it’s a

core data point so what is a core point a data point is a core point if within

our neighborhood of the point there are at least M points for example as there

are six points in the 2-centimeter neighbor of the rent point we mark this

point as a core point okay what happens if it’s not a core point let’s look at

another point is this point a core point no as you can see there are only five

points in this neighborhood including the yellow point so what kind of point

is this one in fact it is a border point what is a border point a data point is a

border point if a it’s neighborhood contains less than M data points or B it

is reachable from some core point here reach ability means it is within our

distance from a core point it means that even though the yellow point is within

the 2-centimeter neighborhood of the red point it is not by itself a core point

because it does not have at least six points in its neighborhood we continue

with the next point as you can see it is also a core point and all points around

it which are not core points or border points next core point and next core

point let’s pick this point you can see it is not a core point nor is it a

border point so we label it as an outlier what is an outlier an outlier is

a point that is not a core point and also is not close enough to be reachable

from a core point we continue and visit all the points in the data set and label

them as either core order or outlier the next step is to connect core points that

are neighbors and put them in the same cluster so a cluster is formed as at

least one core point plus all reachable core points plus all their borders it

simply shapes all the clusters and finds outliers as well let’s review this one

more time to see why DB scan is cool D bees can confine arbitrarily shaped

clusters it can even find a cluster completely surrounded by a different

cluster DB scan has a notion of noise and is

us two outliers on top of that DB scan makes it very practical for use in many

real-world problems because it does not require one to specify the number of

clusters such as K in k-means hello and welcome in this video we’ll be going

through a quick introduction to recommendation systems so let’s get

started even though people’s tastes may vary

they generally follow patterns by that I mean that there are similarities and the

things that people tend to like or another way to look at it is that people

tend to like things in the same category or things that share the same

characteristics for example if you’ve recently purchased a book on machine

learning and Python and you’ve enjoyed reading it it’s very likely that you’ll

also enjoy reading a book on data visualization people also tend to have

similar taste to those of the people they’re close to in their lives

recommender systems try to capture these patterns and similar behaviors to help

predict what else you might like recommender systems have many

applications that I’m sure you’re already familiar with indeed recommender

systems are usually at play on many websites for example suggesting books on

Amazon and movies on Netflix in fact everything on Netflix is website is

driven by customer selection if a certain movie gets viewed frequently

enough Netflix’s recommender system ensures

that that movie gets an increasing number of recommendations another

example can be found in a daily use mobile app where a recommender engine is

used to recommend anything from where to eat or what job to apply to on social

media sites like Facebook or LinkedIn regularly recommend friendships

recommender systems are even used to personalize your experience on the web

for example when you go to a news platform website a recommender system

will make note of the types of stories that you clicked on and make

recommendations on which types of stories you might be interested in

reading in future there are many of these types of examples and they are

growing in number every day so let’s take a closer look at the main benefits

of using a recommendation system one of the main advantages of using

recommendation systems is that users get a broader exposure to many different

products they might be interested in this exposure encourages users towards

continual usage or purchase of their product not only does this provide a

better experience for the user but it benefits the service provider as

well with increased potential revenue and better security for its customers

there are generally two main types of recommendation systems content-based and

collaborative filtering the main difference between each can be summed up

by the type of statement that a consumer might make for instance the main

paradigm of a content-based recommendation system is driven by the

statement show me more of the same of what I’ve liked before content based

systems try to figure out what a user’s favorite aspects of an item are and then

make recommendations on items that share those aspects collaborative filtering is

based on a user saying tell me what’s popular among my

neighbors because I might like it to collaborative filtering techniques finds

similar groups of users and provide recommendations based on similar tastes

within that group in short it assumes that a user might be interested in what

similar users are interested in also there are hybrid recommender systems

which combine various mechanisms in terms of implementing recommender

systems there are two types memory based and model based in memory based

approaches we use the entire user item data set to generate a recommendation

system it uses statistical techniques to approximate users or items examples of

these techniques include Pearson correlation cosine similarity and

Euclidean distance among others in model-based approaches a model of users

is developed in an attempt to learn their preferences models can be created

using machine learning techniques like regression clustering classification and

so on hello and welcome in this video we’ll be covering content-based

recommender systems so let’s get started a content-based recommendation system

tries to recommend items to users based on their profile the users profile

revolves around that users preferences and tastes it is shaped based on user

ratings including the number of times that user has clicked on different items

or perhaps even liked those items the recommendation process is based on the

similarity between those items similarity or closeness of items is

measured based on the similarity in the content of those items when we say

content we’re talking about things like the items

Gouri tag genre and so on for example if we have four movies and if the user

likes or rates the first two items and if item three is similar to item one in

terms of their genre the engine will also recommend item three to the user in

essence this is what content-based recommender system engines do now let’s

dive into a content-based recommender system to see how it works let’s assume

we have a data set of only six movies this data set shows movies that our user

has watched and also the genre of each of the movies for example Batman vs

Superman is in the adventure superhero genre and guardians of the galaxy is in

the comedy adventure superhero and science fiction genre z’m let’s say the

user has watched and rated three movies so far and she has given a rating of two

out of ten – the first movie 10 out of 10 – the second movie an 8 out of 10 to

the third the task of the recommender engine is to recommend one of the three

candidate movies to this user or in other words we want to protect what the

users possible rating would be of the three candidate movies if she were to

watch them to achieve this we have to build the user profile first we create a

vector to show the users ratings for the movies that she’s already watched we

call it input user ratings then we encode the movies through the one-hot

encoding approach genre movies are used here as a feature set we use the first

three movies to make this matrix which represents the movie feature set matrix

if we multiply these two matrices we can get the weighted feature set for the

movies let’s take a look at the result this matrix is also called the way to

genre matrix and represents the interests of the user for each genre

based on the movies that she’s watched now given the weighted shun remai tricks

we can shape the profile of our active user essentially we can aggregate the

weight of genres and then normalize them to find the user profile it clearly

indicates that she likes superhero movies more than other genres we use

this profile to figure out what movie is proper to recommend to this user

recall that we also had three candidate movies for recommendation that haven’t

been washed by the user we encode these movies as well now we’re in the position

where have to figure out which of them is most

suited to be recommended to the user to do this we simply multiply the user

profile matrix by the candidate movie matrix which results in the weighted

movies matrix it shows the weight of each genre with

respect to the user profile now if we aggregate these weighted ratings we get

the active users possible interest level in these three movies in essence is our

recommendation lists which we can sort to rank the movies and recommend them to

the user for example we can say that the Hitchhiker’s Guide to the galaxy has the

highest score in our list and is proper to recommend to the user

now you can come back and fill the predicted ratings for the user so to

recap what we’ve discussed so far the recommendation in a content based system

is based on users taste and the content or feature set items such a model is

very efficient however in some cases it doesn’t work for example assume that we

have a movie in the drama genre which the user has never watched so this genre

would not be in her profile therefore shall only get recommendations

related to genres that are already in her profile and the recommender engine

may never recommend any movie within other genres this problem can be solved

by other types of recommender systems such as collaborative filtering hello

and welcome in this video we’ll be covering a recommender system technique

called collaborative filtering so let’s get started

collaborative filtering is based on the fact that relationships exist between

products and people’s interests many recommendation systems use collaborative

filtering to find these relationships and to give an accurate recommendation

of a product that the user might like or be interested in collaborative filtering

has basically two approaches user based and item based user based collaborative

filtering is based on the users similarity or neighborhood item based

collaborative filtering is based on similarity among items let’s first look

at the intuition behind the user based approach in user based collaborative

filtering we have active user for whom the recommendation

is aimed the collaborative filtering engine first looks for users who are

similar that is users who share the active users raining patterns

collaborative filtering basis this similarity on things like history

preference and choices that users make when buying watching or enjoying

something for example movies that similar users have rated highly then it

uses the ratings from these similar users to predict the possible ratings by

the active user for a movie that she had not previously watched for instance if

two users are similar or our neighbors in terms of their interested movies we

can recommend the movie to the active user that her neighbor has already seen

now let’s dive into the algorithm to see how all of this works assume that we

have a simple user item matrix which shows the ratings of four users for five

different movies let’s also assume that our active user has watched and rated

three out of these five movies let’s find out which of the two movies that

are active user hasn’t watched should be recommended to her the first step is to

discover how similar the active user is to the other users how do we do this

well this can be done through several different statistical and Victoria

techniques such as distance or similarity measurements including

Euclidean distance Pearson correlation cosine similarity and so on to calculate

the level of similarity between two users we use the three movies that both

the users have rated in the past regardless of what we use for similarity

measurement let’s say for example the similarity could be 0.7 0.9 and 0.4

between the active user and other users these numbers represent similarity

weights or proximity of the active user to other users in the dataset the next

step is to create a weighted rating matrix we just calculated the similarity

of users to our active user in the previous slide now we can use it to

calculate the possible opinion of the active user about our to target movies

this is achieved by multiplying the similarity weights to the user ratings

it results in a weight of ratings matrix which represents the user’s neighbors

opinion about our two candidate movies for recommendation in fact it

incorporates the behavior of other users and gives more weight to the ratings of

those users who are more similar to the active user now we can generate the

recommendation matrix by aggregating all of the weight of rates however as three

users rated the first potential movie and two users rated the second movie we

have to normalize the weighted rating values we do this by dividing it by the

sum of the similarity index for users the result is the potential rating that

our active user will give to these movies based on her similarity to other

users it is obvious that we can use it to rank the movies for providing

recommendation to our active user now let’s examine what’s different between

user based and item based collaborative filtering in the user based approach the

recommendation is based on users of the same neighborhood with whom he or she

shares common preferences for example as user 1 and user 3 both liked item 3 and

item 4 we consider them as similar or neighbor users and recommend item 1

which is positively rated by user 1 to user 3 in the item based approach

similar items build neighborhoods on the behavior of users please note however

that it is not based on their contents for example item 1 and item 3 are

considered neighbors as they were positively rated by both user 1 and user

2 so item 1 can be recommended to user 3 as he has already shown interest in item

3 therefore the recommendations here are based on the items in the neighborhood

that a user might prefer collaborative filtering is a very effective

recommendation system however there are some challenges with it as well one of

them is Dana sparsity data sparsity happens when you have a large data set

of users who generally rate only a limited number of items as mentioned

collaborative based recommenders can only predict snoring of an item if there

are other users liberated them due to sparsity we might not have enough

ratings in the user item data set which makes it impossible to provide proper

recommendations another issue to keep in mind is something called cold start cold

start refers to the difficulty the recommendation system has when there is

a new user and as such a profile doesn’t exist for them yet cold start can also

happen when we have a new item which is not received a rating scalability can

become an issue as well as the number of users or items increases and the amount

of data expands collaborative filtering algorithms will begin to suffer drops in

performance simply due to growth in the similarity computation there are some

solutions for each of these challenges such as using hybrid based recommender

systems but they are out of scope of this course

thanks for watching you you

π Contents & Time Line β±

ποΈ 0:00:10 – Welcome To ML with Python

ποΈ 0:01:42 – High-Level Introduction to ML

ποΈ 0:08:27 – Python For Machine Learning

ποΈ 0:13:21 – Supervised Vs Unsupervised Learning

ποΈ 0:18:07 – π Regression

ποΈ 0:21:59 – βπ Linear Regression

ποΈ 0:32:01 – βπ Model Evaluation Approache

ποΈ 0:38:34 – βπ Evaluation Metrics in RM

ποΈ 0:41:01 – βπ Multiple Linear Regression

ποΈ 0:51:54 – βπ Non-Linear Regression

ποΈ 0:57:57 – βπ Logistic Regression

ποΈ 1:04:15 – π Classification

ποΈ 1:07:20 – βπ K – Nearest Neighbours

ποΈ 1:14:40 – βπ Evaluation Metrics (Classification)

ποΈ 1:20:22 – βπ Decision Trees

ποΈ 1:23:32 – βπ Building Decision Trees

ποΈ 1:32:00 – βπ Logistic Regression

ποΈ 1:38:18 – βπ Linear Vs Logistic Regression

ποΈ 1:50:41 – βπ Logistic Regression Model Training

ποΈ 2:01:42 – βπ Support Vector Machine (SVM)

ποΈ 2:08:45 – π Clustering

ποΈ 2:15:07 – βπ K – Means Clustering Basic

ποΈ 2:22:52 – βπ K – Means Clustering Advanced

ποΈ 2:25:53 – βπ Hierarchical Clustering Basic

ποΈ 2:30:52 – βπ Hierarchical Clustering Advanced

ποΈ 2:35:32 – βπ DBSCAN Clustering

ποΈ 2:41:03 – πRecommendation Systems (RS)

ποΈ 2:44:39 – βπContent-Based RS

ποΈ 2:48:48 – βπCollaborative Filtering

ποΈ 2:54:31 – π Recommended Projects

ποΈ 2:54:36 – βπ Twitter Sentiment Analysis

ποΈ 2:56:11 – βπ Handwriting Recognition

ποΈ 2:57:26 – βπ Stock Price Prediction

ποΈ 2:58:22 – βπ Football Match Prediction

ποΈ 2:59:36 – βπ Movie Recommendation System

ποΈ 3:00:10 – Where To Go From Here?

Thanks For the Time Line

This is amazing. Thank you so much!