Python
Python
10 ** 2 → 100
3 * ‘a’ → ‘aaa’ → string is multiplied 3 times
Inbuilt function → type(1) → to check the datatype of any function → therefore it
will return int
We dont dont need to write int a =10 ⇒ only a= 10.
String can be defined in “ or ‘
print((a*b) + (a/b)) → follows bodmas rule .
To learn about a inbuilt tag → shift+tab → eg : print
Printing complex sentence :
1)Method 1
first_name='Krish'
last_name='Naik'
print("My first name is {} and last name is {}".format(“jay”, “modak”))
My first name is krish and last name is Naik
Dot operator is used . first and last are replaced in the closed br.
2)Method 2
print("My First name is {first} and last name is
{last}".format(last=last_name,first=first_name))
Expl : even if we change the order . Since we have assigned the value
O/p : My First name is Krish and last name is Naik
len(‘jay’) → o/p 3 → length of string
List is a grp of different data types
Python data structure :
1)Boolean
2)Boolean and logical operators
3)Lists
4)Comparison operators
5)Dictionaries
6)Tuples and sets
Tab to see all the keywords . eg : str. → opens a drop down
print(my_str.isalnum()) #check if all char are numbers. Combination of alphabet
and numbers
print(my_str.isalpha()) #check if all char in the string are alphabetic
print(my_str.isdigit()) #test if string contains digits
print(my_olr. islille()) #test if string contains title words(capital)
print(my_str.isupper()) #test if string contains upper case
print(my_str.islower()) #test if string contains Lower case
print(my str. isspace()) #test if string contains spaces
print(my_str.endswith('d')) #test if string endswith a d → case sensitive
print(my_slr. startswith('H')) #lest if string starts with H
→ returns true or false
Datatypes :
Lists :
Can store different data str.
mutable, or changeable, ordered sequence of elements. Each element or value
item.
Values between square brackets [ ].
Indexing→ 0,1,2 . . . .
Indexing a list of elements : we want to select a list of
elements([‘maths’,’chem’,100,’phy’]) → 1st[:] → selects all the elements. If we
want to select from chem to end → 1st[1 : ]
Select from chem to 100 → 1st [1 : 3] → it selects the element before 3
Initialising a list :
1.type( [ ] )
2.1_eg=[ ]
type(1_eg)
3.1st=list()
type(1st)
4.1st=[‘maths’,’chem’,100]
Functions in list :
1.Append : add items in a string .
Eg : 1st.append(‘phy’)
[‘maths’,’chem’,100,’phy’]
2.To check what the element is 1st[2] → o/p chem
3.Append : add element
1st.append([‘john’]) → creates a nested list →
[‘maths’,’chem’,100,’phy’,[‘john’]]
4.Insert : 1st[1 : ‘pushkar’] → [‘maths’,’pushkar’,’chem’,100,’phy’]
5.Extend : adds elements at the end pf the list → 1st.extend([8,9]) →
[‘maths’,’chem’,100,8,9]
6.Sum : adds all the numbers in a list eg : 2nd=[1,2,3] → sum(2nd) → 6
7.Pop : removes the last element → 2nd.pop() → 2nd=[1,2]
OR 2nd.pop(0) → 2nd=[2,3]
8.Count : calculates total occurrence of given element of list
9.Index : return the index of first occurrence . start and
Eg : 2nd.index(1) → 0
Syntax : index(value, start , end)
10. Multiplication of list : 2nd*2 → [1,2,3,1,2,3]
Sets :
unordered collection data type that is iterable , mutable, and has no duplicate
elements . pythons set class represents the mathematical notion of a set . this
based on a data structure known as hash table.
does not support indexing ie we cannot access the element like list →
eg : set[0] → error
does not support subscripting eg: set1[1]
CODE :
1)Set1=set()
2)set={1,2,3,3} → o/p printing → {1,2,3} // duplicate elements taken as 1
element
Inbuilt functions:
1)Add : set1.add(“jay”) → added in the last
We can also do unions and intersections in sets (maths).
2)Difference : set1={“jay”,”krishna”,”balram”}
set2= {“jay”}
set2.difference(set1)
o/p → {”krishna”,”balram”}
This does not updates set2 .
3)Difference update : changes the value of set2 = o/p → {”krishna”,”balram”}
Dictionaries:
Collection of unordered , changeable and indexed . written with curly braces,
they have keys : value pairs
Declaration :
dic=() ⇒ just use empty braces
Eg : dic={“car1”:”audi”,”car2”:”pagani”} ⇒ manual way of creating
Inbuilt function ⇒ dict() → creates a empty dict .
For accessing the elements of the dict using indexing → it will NOT be index
number
→ it will be key names (eg : car1,car2 )
Eg :iterating through the keys → therefore prints all the values
for x in dict:
print(x)
Eg :iterating through the values → therefore prints all the values
For x in dict.values():
print(x)
o/p {‘audi’, ‘bmw’}
Eg : iterating through key and value both
For x in dict.items():
Print (x)
Nested dictionaries :
Eg:
car1_model={'Mercedes':1960}
car2_model={'Audi':1970}
car3_model={'Ambassador' : 1980}
print(car_type)
Tuples :
● Not mutable , cannot change element
● It supports indexing using no.
● Supports diff data types
● We use round braces
● Eg : tup=(“jay”,”nitai”,”manohar”)
● We can replace the whole tuple.
Libraries :
NUMPY :
High dimensional array object, tools for working with arrays(data str with
similar datatyp)
After installing python→ command prompt → pip install numpy
Importing numpy : Import numpy as np
Np → alias for numpy
Reference type : sharing the same value . so if we change 1 variable other
is also updated . eg : array refer page 7.
Value type :can be an integer value , assign it to something else. If we
change the other value . then the updation will not take place
Codes :
my_1st=[1,2,3,4,5]
arr=np. array(my_1st )
In : type(arr)
Out : numpy.ndarray
In : arr
Out : array([1,2,3,4,5])
● Arr.shape → helps us to find how many no of rows & cols are there
In : arr.shape
Out : (5,) → 5 rows
Creating a 2D arr
arr=np.array([my_lst1,my_lst2,my_lst3])
In : arr
Out :array([[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[9, 7, 6, 8, 9]]) → 2 closing br indicates a 2D arr
In: arr. shape
Out : (3, 5) → (rows,columns)
● Arr.reshape : converts
● During reshape the no. of elements should remain constant
In: arr.reshape(5,3) → return an array containing the same data with a new
shape
Out: ([[1,2,3],
[4,5,2],
[3,4,5],
[6,9,7],
[6,8,9]])
Indexing in array:
● Arr[0] ⇒ for 1d arr
● Arr[: , :] ⇒ Picks up all the elements
array([[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[9, 7, 6, 8, 9]])
● Arr[0:2, :] → we want the 0th row and the 1st row
Until the 2nd index in the row
o/p : array([[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
● Arr [0:2,0:2]
o/p array([1,2],
[2,3]])
Arr[1:,3:]
o/p : array([[5,6],
[8,9]])
● Linspace : syntax :
np.linspace[‘start’,’stop’,’num=50’,’endpoint=true’,’retstep=false’,’dtype=non
e’,
’axis=a’]
Used in ML
np.linspace(0,10,50) ⇒ we want from 1 to 10 we want 50 points → creates
EQUALLY spaced points
o/p : array([ 1. ,1.91836735,2.83, → ,9.08,9.81,10. ])
● Copy & broadcasting :
Eg :
Arr[3: ]= 100 ⇒ From index 3 to the last replace all by 100
arr1=arr
Arr1[3 : ]= 500
Print (arr)
o/p : [1,2,3,500,500,500] **Reference type
NOTE : Thus to prevent this updation we have the copy function : syntax :
arr1.arrcopy() → this creates another memory space to store the value of
arr1 .
Random distribution :
Type 1 :
● np.random.rand(3,3) → 3 row and 3 col random element array
● The elements will not be >1 and < 0.
Type 2 :
● np.random.rand(4,4) → selects value based upon standard normal
distribution(stats)
Type 3 :
● np.random.randint(low,high=none,size=none,dtype=’1’)
np.random.randint(0,100,8) ⇒ between 0 and 100 select 8
numbers
We can also reshape it ⇒ np.random.randint(0,100,8).reshape(2,4)
PANDAS :
● Importing pandas : import pandas as pd
Import numpy as np
● Data Frames : combination of columns and rows .2D representation
format how data looks in the excel sheet
○ Eg : df=pd.DataFrame(np.arrange(0,20).reshape(5,4),
index=[‘Row1’,’Row2’,’Row3’,’Row4’,’Row5’,],columns=[‘column
1’,’column2’,’column3’,’column4’],dtype=int) ⇒ we are making a
2D array with 20 elements and arranging it in 5R and 4 Col
df.head()
o/p :
But if the above file has semicolon instead of comma then we will put sep
as ;
Eg : df=pd.read_csv(file.csv,sep=’ ; ’)
o/p :
Data conversion :
CSV :
Basic:
● from io import StringIO,BytesIO
● data = ('col1, col2, col3\n'
'x, y, \n'
'a,b,2\n'
'c,d,3')
// “\n for new line ”
// “This data is in the form of string , we can also put this line as a csv file
and then load it ”
● pd.read_csv(StringIO(data)) // stringIO → converts text to table
● // df=pd.read_csv(StringIO(data),usecols=landa x: x.upper() in
[‘COL1’,’COL3’])
● df=pd.read_csv(StringIO(data),usecols=[‘COL1’,’COL3’])
● Df
○ o/p :
● // converting above table back to data
● df.to_csv(‘test.csv’) → this is saved in the same file
● // if we want to take values other than strings ⇒
● df=pd.read_csv(StringIO(data),dtype=object) → now all datatypes will be
considered as objects → in our case it will be considered as a string
● df
● Df[‘a’]
○ o/p : ‘5’ → gives a string value
○ o/p :
● // eg 2
● pd.read_csv(StringIO(data),index_col=1)
○ o/p :
● data = ('index, a,b, c\n'
'4, apple, bat, 5.7\n'
'8,orange,cow,10')
● pd.read_csv(StringIO(data))
○ o/p :
○ // When default data type is none it follows same order as the data. If
it is a number ⇒ index
● pd.read_csv(StringIO(data),index_col=False)
○ o/p :
● data= ‘a,b \n “hello , \\”bob\\’”, “ , 5’
● pd.read_csv(StringIO(data),escapechar=’\\’) → skips \\
● df=pd.read_csv(‘https://download.bis.gov.item’,sep=’\t’)
● df.head()
○
●
● df.to_csv(wine.csv) → actually converts json to csv
● df.to_json(orient=”index”) → convert object to a json string
○ Df.to_json()
■ o/p
'{"employee_name":{"0":"James"},"email":{"0":"james@gmail.co
m"},"job_profile":{"0":{"title1":"Team Lead","title2":"Sr.
Developer"}}}'
○ df.to_json(orient=”records”) IMP
■ '[{"employee_name":"James","email":"james@gmail.com",
"job_profile":{"title1":"Team Lead", "title2":"Sr. Developer"}}]'
// makes the o/p record by record.
PICKLING:
Create machine learning algo to pickles.
To_pickle methods which use pythons pickle module to save data structures to
disk using pickle format.
df_excel.to_pickle(‘df_excel’)
df=pd.read_pickle(‘df_excel’)
df.head()
o/p : // displays the content of the pickle
NOTE : search pandas documentation . shows all the info for pandas
MatplotLib tutorial :
Dont remember matplot. Seaborn is better .
Plotting library for the python and its numerical mathematics extension NumPy. It
provides an object oriented API for embedding plots into applications using
general purpose GUI tools like Tkinter , wxpython , Qt or GTK+
●
● X = np.arange(1,11)
● y=3*x+5
● plt.title("Matplotlib demo")
● plt.xlabel("x axis caption")
● plt.ylabel("y axis caption")
● plt.plot(x,y)
● plt. show()
●
● # Compute the x and y coordinates for points on a sine curve
● X = np.arange(0, 4 * np.pi, 0.1)// 0.1 is the stepsize
● y = np.sin(x)
● plt.title("sine wave form")
● # Plot the points using matplotlib
● plt.plot(x, y)
● plt. show()
●
● #Subplot()
● # Compute the x and y coordinates for points on sine and cosine curves
● X = np.arange(0, 5 * np.pi, 0.1)
● y_sin = np.sin(x)
● y_cos = np.cos(x)
● # Set up a subplot grid that has height 2 and width 1,
● # and set the first such subplot as active.
● plt.subplot(2, 1, 1)
● # Make the first plot
● plt.plot(x, y_sin)
● plt.title('sine')
● # Set the second subplot as active, and make the second plot.
● plt.subplot(2, 1, 2)
● plt.plot(x, y_cos)
● plt.title('Cosine')
● // show the figure
● plt.show()
Bar plot
● X = [2,8,10]
● y = [11,16,9]
● x2 = [3,9,11]
● y2 = [6,15,7]
● plt.bar(x, y)
● plt.bar(x2, y2, color = 'g')
● plt.title('Bar graph')
● plt.ylabel('Y axis')
● plt.xlabel('x axis')
● plt. show()
○
● //https://www.youtube.com/watch?v=czQO1_GEEos&list=PPSV
Histograms:
wrt to numbers (as shown below) what is the density/ count on the y axis
By default in a histogram there are 10 bins .
● a = np. array ([22, 87, 5, 43, 56, 73, 55, 54, 11, 20, 51, 5, 79, 31, 27])
● plt.hist(a)
● plt.title("histogram")
● plt. show()
○ for 10 bins
○
● On x axis between 0 and 10. We have 3 values.
● For 20bins , we will write input as : plt.hist(a,bins=20) -> all the range will be
divided into 20 bins.
○
Pie chart :
● # Data to plot
● labels = 'Python', 'C++", 'Ruby', 'Java'
● sizes = [215, 130, 245, 210] // sizes based on the cumulative total. Thus
each value is assigned as percentage .
● colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
● explode = (0.1, 0, 0, 0) // explode 1st slice . how far the 1st slice has to go .
if we write (0.1,0,0.2,0) then the third slice is exploded/ moved away
● // Plot
● plt.pie(sizes, explode=explode, labels=labels, colors=colors,
● autopct='%1.1f%%', shadow=True) // autopct defines the format in which
we want the slice . %1.1f -> floating format .
● plt.axis('equal')
● plt.show()
○
________________________________________________________________
SEABORN tutorial :
Statistical tools .
Dataset : f1 , f2 ,f3 ,f4 . based on the classification and regression problem we
will be dividing the dataset into independent and dependent features eg : f1, f2 &
f3 , f4. Suppose f3 and f4 are the o/p feature that we need to compute ->
dependent feature . suppose with 2 features f1 and f2 we can draw a 2D plot. 3
feature -> 3D diagram . 4 f -> 4D d.
If we only have f1 -> univariate analysis
F1 &f2 -> bivariate (WRT supervised ML )
Distribution plots :
These distribution plots helps us to analyze how many no. of features are there
in a dataset .
● Displot
● Joinplot : bivariate
● Pairplot : more than 2 features
Practice problems on IRIS dataset
● Import seaborn as sns
● df = sns.load_dataset(“tips”) // inbuilt function like load_dataset and tips
● df.head()
// we should be able to create a model wherein we can assume what tip it will be
based on the other features like total bill, day, sex etc . here tip is a dependent
feature . else all are indep. Feature . since tip is dependent on the day, time etc .
df.dtypes
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
dtype: object
________________________________________________________________
Correlation with heatmap :
1. Uses colored cells . typically in monochromatic scale to show a 2D correlation
matrix (table ) between two discrete dimensions or event types .
2. This correlation can only be found out if our values are integer or floating .
3. We cant find out for categorical features because they are object type .
4. Values will be ranging from -1 to +1 (coefficient Pearson’s correlation)
Above data is used to make the heatmap
● Df.corr()
●
○
________________________________________________________________
Join plot :
A join plot allows to study the relationship between 2 numeric variables . the
central chart display their correlation . it is usually a scatterplot, a hexbinplot, a
2D histogram or a 2D density plot.
Bivariate analysis
sns. jointplot(x='tip', y='total_bill', data=df, kind='hex')// at x and y axis we take all
the features . kind : features which are displayed in between , their shape .
● //Major concentration -> dark spots at the same pt on above and rhs
histograms are higher .
● //this shows that many people have given a tipa somewhere around 2
dollars. And majority of the bill is between 10 – 20.
● // there are outliers -> whose bill was more than 50$ and tip 10$
● sns. jointplot(x='tip', y='total_bill', data=df, kind='reg') // reg is regression .
draw a probability density function also known as KDE(kernel density
estimation). it will also draw a regression line .
○
________________________________________________________________
Pair plot(Scatterplot) :
More than 2 indep feature.
In which one variable in the same data row is matched with another variables
value .
It will to all the possible perm. And com. Of all the features
● sns.pairplot(df)
○
○
● //for using a category for scatterplot eg sex.
● sns.pairplot(df.hue=’sex’)
○
________________________________________________________________
Dist plot :
Helps us to check distributions of column feature
● Sns.distplot(df[‘tip’])
○
● Sns.distplot(df[‘tip’] ,kde= False ,bins=10). Removes the continuous line .
________________________________________________________________
Categorical plots
1) Boxplot
2) Violinplot
3) Countplot
4) Barplot
Count plot:
Show the count of observation in each
//using the same tip data set
Sns.countplot(‘sex’,data=df)
● // y = ‘sex’ then the graph is horizontal .
Bar plot :
Give both x and y val.
Sns.barplot(x=’total_bill’,y=’sex’,data=df)
Box plot :
● Sns.boxplot(‘smoker’,’total_bill’,data=df) // smoker -> x axis
○
● Sns.boxplot(x=”day”,y=”total_bill”,data=df,palette=’rainbow’)
○
● // without giving rainbow we get blue, yellow, green, red.
● sns.boxplot(x=’total_bill’,y=’day’,hue=”smoker”,data=df) // hue = smoker à
classify the pts wrt smoker
○
Violin plot :
● //We are able to see data in terms of kernel density estimation and the box
plot .
● Sns.violinplot(x=’total_bill’,y=’day’,data=df,palette=’rainbow’)
●
● Try to practise iris = sns.load_dataset(‘iris’)
Read kaggle kernels problems . problems from MEDIUM
________________________________________________________________
○
Roughly 20 percent of the Age data is missing. The proportion of Age
missing is likely small enough for reasonable replacement with some form
of imputation.
Looking at the Cabin column, it looks like we are just missing too much of
that data to do something useful with at a basic level. We'll probably drop
this later, or
change it to another feature like "Cabin Known: 1 or 0"
Let's continue on by visualising some more of the data! Check out the video
for full explanations over these plots, this code is just to serve as reference.
● sns.set_style(‘whitegrid’)
● sns.countplot(x-’survived’,data-train)
○
● sns.set_style(‘whitegrid’)
● sns.countplot(x-’survived’,hue=’sex’,data-train,palette=’RdBu_r’)
● // in the ship women and children→ priority → more survive
○
● sns.set_style(‘whitegrid’)
● sns.countplot(x=’survived’,hue=’Pclass’,data=train,palette=’rainbow’)
○
○ // pclass is the passenger class. The rich people survived by
bribing the sailors.
○ // pclass 1 is richer.
● sns.displot(train[‘age’].dropna(),kde=false,color=’darkened’,bins=10)
○
● sns.countplot(x=’sibsp’,data=train) // 0 → no sibling and spouse
○
● train[‘fare’].hist(color=’green’,bins=40,figsize=(0,4))
○
Data cleaning :
Removing the null values. Which are present in age and cabin values .
We first found out the relation between passenger class and age
● plt.figure(figsize=(12,7))
● sns.boxplot(x-’pclass’,y-’age’,data-train,palette-’winter’)
○
● Def inpute_age(col); // inpute age is a function
Age = cols[0]
Pclass = cols[1]
If pd.isnull(age):
If pclass ==1 :
Return 37 ; because the average value of passenger in 1st class
in 37 . from box plot graph
elIf pclass ==2 :
Return 29 ;
Else :
Return 24;
Else :
Return age;
● Train[‘age’] = train[(‘age’,’pclass’)].apply(input_age,axis = 1) // age and
pclass are passed onto inpute age function .
● // we check the heat map
● sns.heatmap(train.isnull(),yticklabels=False,char=false,cmap=’virdis’)
○
● So to replace cabin values we need a lot of feature engineering .
convenient to remove it.
● train.drop(‘cabin’,axis=1,inplace=true) // this completely removes the
column cabin
● train.head()
○
● sns.heatmap(train.isnull(),yticklabels=False,char=false,cmap=’virdis’)
○
○ Thus we have successfully handled all the Nan values
● Details like passenger id , name , ticket no are not required .
Converting categorical features :
We are going to convert sex & embarked into an integer function
using pandas get dummy .
Converts all the columns into dummies .
● pd.get_dummies(train[‘embarked’],drop_first=true).head() // removes
the first column because the other 2 columns can represent the first
column.
Suppose we had 3 columns like p,q,s then . 01 will be for S , 10 will
be for Q and 00 for P . therefore drop P.
○
● //Similarly for sex we will do the same .
● Sex = pd.get_dummies(train[‘sex’],drop_first=True)
● embark= pd.get_dummies(train[‘embarked’],drop_first=true)
● train.drop([‘sex’,’embarked’,’name’,’ticket’],axis=1,inplace=true)
● train.head()
○
● Train = pd.concat([train,sex,embark],axis =1) // Q & S for embark and
male for sex data
● train.head()
● // survived is a dependent feature , rest all are indep .
________________________________________________________________
○
● From sklearn.model_selection import train_test_split
● X_train,x_test,y_train,y_test =
train_test_split(‘survived’,axis=1),train[‘survived’],test_size=0.30,
random_state=101)
Functions in python :
● //Common code
Num =24
Def even_odd(num):
If num%2==0:
print(“the number is even”)
Else:
print(“it is odd “)
Map function :
1.2 parameters : function & iterables
2.Uses LAZY LOADING technique
3.
● def even_or_odd(num):
if num%2 == 0:
return "The number {} is Even".format (num)
Else:
return "The number {} is Odd".format(num)
● even_or_odd(24)
○ 'The number 24 is Even'
● lst=[1,2,3,4,5,6,7,8,9,24,56,78]
● map(even_or_odd,lst)
○ <map at 0x26164655c50> // the memory has not been instantiated by
using map
● list(map(even_or_odd,lst))
○ ['The number 1 is odd',
○ 'The number 2 is Even',
.......
○ 'The number 78 is Even'
Lambda function :
1.Or anonymous function
2.A function with no name
3.It works faster than a normal function
4.If the function has a single line of code → convert it to lambda
Eg : return a + b
5.Similar to inline in c++
● Addition = Lambda a,b : a+b // function → stored in variable “addition”
● addition(12,50) // we can take multiple variables(a,b,c . . . )
○ 62
● even1= lambda a:a%2==0
● even1(12)
○ True
Filter function
● Def even(num):
if%2==0:
Return true
● 1st=[1,2,3,4,5,6,7,8,9,0]
● list(filter(even,1st))
○ [2, 4, 6, 8, 0]
● list(filter(lamda num:num%2==0,1st)) // best way
○ [2, 4, 6, 8, 0]
● list(filter(lamda num:num%2==0,1st))
○ [false,true . . . . . . . . ]
List comprehension :
1. Concise way to create lists
2.More line of code → more memory is occupied
3.It consists of brackets containing an expression followed by a for clause ,
then zero or more for or if clauses. The expressions can be anything,
meaning you can put in all kinds of objects in lists
● Def 1st_square(lst):
For i in lst:
lst1.append(i*i)
Return 1st1
● 1st_square([1,2,3,4,5,6,7])
○ [1,4,9,16,25,36,49]
● lst=[1,2,3,4,5,6,7]
● 1st = [i * i for i in 1st ] // this replaces all the line of code we have written in
the first bullet point
○ [1,4,9,16,25,36,49]
//If we want to square only the even nos
● 1st = [i * i for i in 1st if i%2 == 0 ]
● print(1st )
○ [4 , 16 , 36]
String formatting :
● Def welcome (name, age ):
Return “welcome {name1} , your age is {age1}”.format(name1=name ,
age1=age)
● welcome(‘jay’,55)
○ ‘Welcome jay your age is 55 ‘
● lst = [1,2,3,4,5,6,7]
● For i in lst :
print(i)
○ 1
2
3
.. ..
● iter(lst) → *
* iter operator : will convert a list(iterable) into an iterator . what will happen is
that all the values will not be initialised in the memory at once . So we have to
call an inbuilt function called next. Which will initialize it one by one
● lst1= iter(lst) → **
● next(lst1)
○ 1 → we execute / run again → o/p 2 ⇒ pick up next element
● For i in lst1:
Print(i)
○ 1
2
3
.......
after we reach the last element we get the stop iteration in NEXT . 4
We don't get in for loop because it stops at the last element
OOPs in python
● class car :
Pass → we dont have any properties defined yet . .
● car1=car()
● car1.windows=5
car1.doors=5
print(car1.windows) -> o/p 5
Init function → acts as a constructor
● class Dog:
def __init__(self, name, age):
self.name = name
self.age = age
def owner(self): →method inside a class→ here “self will contain the name”
Return “His name is ()”.format(self.name)
● dog1=dog(trevor,6) → self is used for dog1
● dog1.owner() → his name is trevor
Self parameter → reference to the instance of the class itself , similar to THIS in
c++
Interview Q : Errors: Include both syntax and runtime errors. Syntax errors must
be fixed in the code.
Exceptions: Are runtime errors that can be handled to prevent program crashes.
Exception handling :
Finally block :
The code in finally block is written after else block .
The code is executed irrespective error is caught or not .
Use this block to close the database . we cant use the else block .
o/p :
Age is valid
________________________________________________________________
Access specifiers :
OOPs - public,private , protected
● Class car()
Def __init__(self,windows,doors,enginetype):;
self._windows=windows
self._doors=doors
self._enginetype=enginetype
● bmw=car(4,5,”petrol”)
NOTE: 1. Java,c# = strongly typed → functionality is restricted
2.Python → can be overridden
3. ._ → we add underscore →PROTECTED
4.dir(bmw) → see all the attributes → o/p :
{‘__class__’ // here functions are also displayed
‘__dir__’
‘Doors’ ,
‘Enginetype’ ,
‘windows ’ → not written with _ →public
}
Protected can only be accessed from subclasses by inheritance .
Overriding should only be done from the subclass
Continue from above :
● Class truck(car ):
Def __init__(self,windows,doors,enginetype,horsepower)
// horsepower is a new parameter
Super ().__init__(windows,doors,enginetype)
self.horsepower=horsepower
// super ⇒ to inherit all the para. like windows , doors , engine type
● Truck=truck(4,4,”diesel”,4000)
● dir(truck)
○ ‘_doors’ → protected
‘_enginetype’
‘Horsepower’ → public
‘_windows ’
5.Private → __ → double underscore → cannot access anywhere or modified
outside of the class . if we want to modify it we can override it
6.Private parameters in dir → _car__doors
_car__enginetype
_car__windows
________________________________________________________________
Inheritance :
● Class car()
Def __init__(self,windows,doors,enginetype):;
self.windows=windows
self.doors=doors
self.enginetype=enginetype
● Def drive(self):
print(“can drive ”)
● car= car(4,5,”electric”)
● Class audi(car):
Def __init__(self,windows,doors,enginetype,luxury):
super().__init__(windows,doors,enginetype)
self.luxury=luxury → luxury is boolean
def selfdriving(self):
print(“audi supports self driving ”)
● audiQ7=audi(5,5,”electric”,True)
● audiQ7.selfdriving()
○ audi supports self driving
________________________________________________________________
Univariate :
Bivariate :
Using seaborn .
Pearsons correlation helps us to determine if one feature is changing how is the
other one being affected. (-1 → +1 )
● import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
● df=pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-p
ages/data/iris.csv')
● df. head()
●
● df.shape
○ (150,5) → (records , features)
3 fo different color
○
Z score :
U = μ(mu )
Z score = xi - μ
_____
σ
Normalization or standard norm wherein we apply the form for each
case .
μ - σ to μ + σ → 68% of all data lies
μ - 2σ to μ + 2σ → 95%
Sample Q :
μ = 75 , σ = 10 . probability student score > 60
For 3 rd region we get z score = 0.0668
1st region → -1.5 to 0
2nd region → 0 → RHS → 50 %
Let the region 1 be x
Total ⇒ 100 = x + 50 + 0.0668
X = 44 %
________________________________________________________________
____________________________________________________________
Linear regression Indepth math intuition
But we cant keep on applying this formula again and again . this is
discussed later .
Example :
Consider a graph with data points as y^ = mx + c
We generally consider c = 0 (ie passess through origin ) otherwise we will have
to draw a 3d diagram
Putting x = 1 in our equation , y =1
m = 1 (consideration )
Cost function = [1 / 2 m] [(1 - 1)2 + (2 - 2)2 +(3 - 3)2] = 0
Gradient descent
If we get the initial point high (x= 0, y = 2). So we have to go “downwards ”
to get the minimal value . so we use Convergence theorem .
Convergence theorem
m = m - (δ𝑚/dm) × α
○
● dataset.columns= df.feature_names
● dataset.head()
○
● Df.target.shape
○ (506,) → 506 rows
We create a new column called price and add target variable there
● dataset[“price”]=df.target
● dataset.head()
○
● x=dataset.iloc[:,:-1] //independent feature
● y=dataset.iloc[:,-1] //dependent feature
Ridge regression :
● from sklearn.linear_model import RIdge
from sklearn.model_selection import GridSearchCV
● ridge=Ridge()
● parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2, 1, 5, 10, 20, 30, 35, 40,
45, 50, 55, 100]}
● ridge_regressor=GridSearchCV(ridge,parameters,scoring='neg_mean_squ
ared_error', cv=5)
ridge_regressor.fit(x,y)
● print(ridge_regerssor.best_params_) // best param helps to find out which
λ 𝑣𝑎𝑙𝑢𝑒 𝑖𝑠 𝑠𝑢𝑖𝑡𝑎𝑏𝑙𝑒 .
print(ridge_regerssor.best_score_)
○ {'alpha': 100}
○ -29.871945115432595
1e-15 → 10 -15
Lasso regression :
● from sklearn. linear_model import Lasso
from sklearn.model_selection import GridSearchCV
lasso=Lasso()
parameters {'alpha':[1e-15,le-10,1e-8,1e-3,1e-2, 1,5, 10, 20, 30, 35,40, 45,
50, 55,100]}
● lasso_regressor GridSearchCV(lasso, parameters,
scoring-'neg_mean_squared_error', cv=5)
lasso_regressor.fit(X,y)
print(lasso_regressor.best_params_)
print(lasso_regressor.best_seore_)
○ {'alpha': 1}
-35.491283263627095
● from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=0)
● predictior_lasso-lasso_regressor.predict(X_test)
predictinr_ridge=ridge_regressor.predirt(X_test)
● import seaborn as sns
sns.distplot(y_test-prediction_lasso)
○
● import seaborn as sns
sns.distplot(y_test-prediction_ridge)
○
● Thus ridge and lasso give a similar graph . good prediction
________________________________________________________________
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 0)
EXPLANATION:
3 categorical features → california , florida , new york
***r2 score .
○
// shows the expenditure in various departments . so based on them we
have predict the sales value .
We solve the problem with help of ordinary least square (OLS)→ used in multiple
linear regression
OLS:
1.estimating coefficients of linear regression equations which
describe the relationship between one or more
independent quantitative variables and a dependent
variable
2.Why we choose to minimize the sum of squared errors
instead of the sum of errors directly.
○
○ // below (columnwise ) are β1 , β2 & β3. This indicates that if there is a
unit increase in sales value . then we need to increase the
expenditure of TV by(0.0458units) . newspaper → -ve → we dont
have to spend a lot → if we decrease the spending then sales may
still inc (by 1 unit ).
○ // std err → low → none of the features have a multicoll. problem .
If there is a correlation then std err → bigger number .
○ // P value for all is less but not for newspaper (0.860)
● Df_salary = pd.read_csv(‘data/Salary_data.csv’)
● df_salary.head()
○
○ // years exp and age are indep features, salary -> dep feature
● x= df_salary[(‘YearsExperience’,’age’)]
y=df_salary[‘salary’]
● x =sm.add_constant(X)
model=sm.OLS(y,x).fit()
● model.summary()
○
○ // if we increase the age by 1 year then how much should our salary
increase .
1.R2 is also small → fits good
2.But std err is HUGE VALUE
3.If theere is a multicollinearity problem then std error is big value .
4.If we add another feature which is correlated with other features
then the std will be VERY VERY HIGH.
5.P value is > 0.05 for age and years of experience might have
some kind of corr . thus to confirm it we write the below code .
● x.iloc[:,1:].corr()
○
○ // age and years of exp has 98% correlation . thus we may drop the
age feature .
________________________________________________________________
y^ = mx + c
Y = value in graph
_________________________
A good model should have both low bias & low variance .
Ridge regression :
Note : steep slope ⇒ overfitting
Steep – unit ↓ in x axis → small dec in y axis
λ → 0 to + ve integer
(y - y ^)2 + λ(slope)2 — 2
2<1
Now green line is the new best fit line is the new best fit line
Note : We PENALIZE features with higher slopes (↑ M) & make it less steeper
Y = mx + c → 1 feature
y = m1x1 + m2x2 + c1 → 2 slopes
y ⇒ formula → λ[m12 + m22]
When we apply the condn with less slope to our test case → the difference will
be lesser as compared to steeper slope .
Lasso regression :
→ overfitting
→ feature selection
2 – overfitting condn
1)Decision tree
2)Random forest
– multiple decision trees in parallel
– Scenario – low Bias & high var
– also called BOOTSTRAP aggregation
– bootstrap agg. We take a data set & give it to multiple models
– NOT complete records but partial records to bootstrap agg. Which gives it to
multiple decision trees & we get the o/p .
– the o/p is then aggregated
– since the decision trees are in parallel then high variance → low variance
Q ) what kind of technique → xg boost has ? High bias , low var or low bias &
high var etc .
________________________________________________________________
R2 & adjusted R2 :
R2 = sample R2
P = no of predictors / indep features
N = total sample size
Hypothesis testing :-
Steps :
1.Make initial assumption(Ho)
2.Collect data
3.Gather evidences to reject or accept the null hypothesis .
CONFUSION MATRIX :
Lets say we ask the q , what is the difference in proportion of male and female ?
→ say h1 → yes there is a difference in proportion .
→ the above is a sample data set
** continued below
●
● At 2.5% region → reject null hypo.
● Since value is far from mean value
** continued
We make a Ho,H1 & test table , table . lets say H1 - there is a diff b/w the
proportion of male and female .
Ho → there is no diff
Considering a test case using one categorical feature we need to apply a test
which says that where we have this
Null hypo. As true what is the likelihood that our alternate hypo. is true .
We take p ≤ 0.05
1.For one sample feature the test -> One sample prop. Test
P value is selected before the test.
P -> 0.05 has the same graph as mentioned above .
If p <0.05 -> reject null hypo
2.If we have 2 categorical feature -> test -> chi square test
3.T test
a.1st case continuous variable (eg : height )
b.2nd case 1 numerical var and cat var with only 2 categories (M & F)
4.2 numerical variable -> test -> correlation (eg: for pearsons value range ->
-1 to +1 if near to 0 -> no correlation )
5.Anova test
a. one numerical var and one categorical var .
b. cat var + cat var which has more categories (eg : age : adult ,elderly,
young)
Features selected
H0 No difference
H1 Some diff.
T test :
A t test is a type of inferential statistics which is used to determine if there is a
significant difference between the means of two groups which may be related in
certain features.
1.One sample
Tells whether the sample and the population are different
Where,
𝜇= Proposed constant for the population mean
X(x bar)= Sample mean
n = Sample size (i.e., number of observations)
s = Sample standard deviation
Poisson distribution
● import numpy as np
import pandas as pd
import scipy.stats as stats
import math
np.random.seed(6)
school_ages=stats.poisson.rvs(loc=18,mu=35, size=1500)
classA_ages=stats.poisson.rvs(loc=18,mu=30,size=60)
// loc =18 means our ages start from 18 ,
Mu = mean
Loc also means nodes extreme left value (in bell curve )
● classA_ages.mean()
○ 46.9
● _,p_value=stats.ttest_1samp(a=classA_ages, popmean =
school_ages.mean())
● school_ages.mean()
○ 53.303333333333335
● P_mean
○ 1.13 e -13
● We reject the null hypo
Two sample T test :
● np.random.seed(12)
ClassB_age5=stats.poisson.rvs(loc=18,mu=33,size=60)
ClassB_ages.mean()
○ 50.63333333333333
● _,p_value=stats.ttest_ind(a=classA_height,b=ClassB_ages,equal_var=Fals
e)
// a value -> 1st grp
b value -> 2nd grp
● weight1=[25,30,28,35, 28, 34, 26, 29, 30, 26, 28,32, 31, 30,45]
weight2=weight1+stats.norm.rvs(scale=5,loc =- 1.25,size=15)
W2 is the change in wt after some years
● print(weight1)
print(weight2)
○ [25, 30, 28, 35, 28, 34, 26, 29, 30, 26, 28, 32, 31, 30, 45]
[30.57926457 . . . 41.32984284]
● weight_df=pd.DataFrame({"weight_10":np.array(weight1),
"weight_20":np.array(weight2),
"weight_change":np.array(weight2)-np.array(weight1)})
○
● _,p_value=stats.ttest_rel(a=weight1,b=weight2)
A -> previous wt
B -> recent wt
● print(p_value)
○ 0.5732936534411279
● if p_value < 0.05:# alpha value is 0.05 or 5%
print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")
○ we are accepting null hypothesis
Correlation
● Import seaborn as sns
● df=sns.load_dataset('iris')
● df.shape
○ (150, 5)
● df.corr()
○
// from the above data we can see that sepal length and petal length are
highly correlated
If value was nearer to 0 not much corr.
● Sns.pairplot()
○
● dataset.head()
○
● dataset_table=pd.crosstab(dataset['sex'],dataset['smoker'])
print(dataset_table)
○ Smoker Yes no
sex
Male 60 97
Female 33 54
● dataset_table.values
○ array([[60, 97],
[33, 54]], dtype=int64)
● #Observed Values
Observed_Values = dataset_table.values
print("Observed Values :- \n",observed_Values]
○ Observed Values :-
[[60 97]
[33 54]]
● Val =stats.chi2_contingency( dataset_table)
^ chi square conti. Function -> shift + tab -> finds p value
● val
○ (0.008763290531773594, 0.925417020494423, 1,
array([[59.84016393, 97.15983607]
[33.15983607, 53.84016393]]))
Underlined values ->we see difference in expected and observed values
● Expected_Values=val[3]
● no_of_rows=len(dataset_table.iloc[0:2,0]) // dataset table = cross tab info
no_of_columns=len(dataset_table.iloc[0,0:2])
ddof=(no_of_rows-1)*(no_of_columns-1) // formula
print("Degree of Freedom :- ",ddof)
Alpha = 0.05 // 95% variance we should capture between the 2 features
○ Degree of freedom = - 1
●
O -> observed
E -> expected
● from scipy.stats import chi2
chi_square=sum([(o-e) ** 2./e for o,e in
zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
● print("chi-square statistic :- ",chi_square_statistic)
○ chi-square statistic :- 0.001934818536627623
● critical_value=chi2.ppf(q=1-alpha,df=ddof)
//ppf -> percent point function (inverse of cdf ).
print('critical_value:',critical_value)
● critical_value: 3.841458820694124
● #p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('p-value:",p_value)
○ p-value: 0.964915107315732
Significance level: 0.05
Degree of Freedom: 1
p-value: 0.964915107315732
● if chi_square_statistic>=critical_value:
print("Reject H0,There is a relationship between 2 categorical variables")
else:
print("Retain H0,There is no relationship between 2 categorical
variables")
____________________________________________________________
Metrics in classification
1.Confusion mtx
2.FPR(type 1 error)
3.FNR(type 2 error)
4. RECALL(TPR, SENSITIVE)
5.PRECISION(+VE PRED VAL)
6.Accuracy
7.F beta / F 1 score
8.Cohen kappa
9.ROC curve,ABC score
10. PR curve
1)Class labels
Eg : in binary classification there will be 2 classes A and B .
By default the THRESHOLD VALUE = 0.5 ⇒ say if value is > 0.5 then B
class else A class(<0.5)
1)Class labels
Balanced dataset → 1000 records ⇒ 600 yes , 400 no / 700 yes & 300 no
Yes and no are almost equal . so when we provide our ML algorithm with
the data it will not be biased based on the majority of output.
If we have have 800 Y & 200 N → biased o/p
1.Confusion mtx
2 X 2 mtx for binary classification where the top values are the actual values
LHS → predicted values
T → true
F → false
P → positive
N → negative
Imbalanced dataset :
Recall,precision , F beta
Out of the total positive actual values how many values did we correctly
predicted positively ⇒circle below ⇒ recall / TPR (true positive rate ) / sensitivity
Out of the total predicted positive result how many results are actually positive
⇒precision / positive prediction value
Stock market /Cancer prediction → recall value . if the test is +ve but model says
-ve → disaster
statquest
WHENEVER FP is much more IMP use PRECISION
If FN imp → RECALL
F beta : whenever we need both FP & FN (recall & precision )
If β = 1 → F1 score .
Similarly if β = 2 → F2 score
when β = 1
Fβ = 2(precision x recall)/ (precision + recall) = HARMONIC MEAN
= 2xy / (x + y)
Logistic regression :
Microsoft onenote
Time complexity (prop to )input
● Logistic Regression
used for Binary classification
Classification :
1.Binary
2.Multiclass
Why we call it regression?
● If wt is > 75 we say them to be obese .
If y >=0.5 then we consider y =1 for it ie we consider him to be obese
By using one line we can solve classification
problem then why do we need logistic regression.
See above example of 90 kg Reason: -
Above the plane distance is always positive. For below the plane is -ve
Which shows that if we have this kind of scenario.
From 1 we know that y -> +ve
Thus it is getting properly classified
As we saw in our first case the we could not take our best fit line. Sigmoid
function -> it takes the - ve value ( - 500) the value is transformed b/w 0 to
+1.
IMP : model gives the highest information gain -> used to split the decision tree
Gini impurity in decision trees
Graph -
_______
ADABOOST:
In Adaboost the weights are assigned to the records
In our dataset we have features such as f1 , f2 , f3 and o/p .
All these records get sample weight .
Initially all the records are assigned the same weights .
We created our base learner with the help of decision trees in Adaboost .
Here the decision trees are created with the help of only 1 depth . With 2
leaf nodes
These decision trees are called stumps .
Here all the base learners are decision trees .
1.Algorithm
2.Metrics
Euclidian
Manhattan
3.Elbow method (selecting the K value )
Eucladian :
Hypotenuse distance
Manhattan distance (for hypotenuse ) :
K means
It is able to find the similarity in a group of outputs and group them into
clusters .
Centroids -- clusters
In the example on rhs . The points are divided and are closer to the
centroids . Thus they are divided into two groups
Elbow method :
Lec 71
______
Hierarchical clustering intuition :
-- unsupervised ML
Trick : how to find the exact number of clusters . Longest vertical line such
that none of the horizontal line passess through it
DBSCAN :
-- unsupervised ML algorithm
-- 1 epsilon
-- Minpts
we try to make clusters -- helps us to find out the most similar points in a
distribution
If the point a with radius ep. Has 4 pts int it --> CORE PT.
2 cond for core pt :
1) boundary with ep.
2) min pts should be <=4
. Advantages of DBSCAN:
. Is great at separating clusters of high density versus clusters of low
density within
a given dataset.
. Is great with handling outliers within the dataset.
· Disadvantages of DBSCAN:
. Does not work well when dealing with clusters of varying densities. While
DBSCAN is great at separating high density clusters from low density
clusters,
DBSCAN struggles with clusters of similar density.
· Struggles with high dimensionality data. I know, this entire article I have
stated
how DBSCAN is great at contorting the data into different dimensions and
shapes.
However, DBSCAN can only go so far, if given data with too many
dimensions,
DBSCAN suffers
_______
Silhouette (clustering ) :
It is used to verify that the clustering algorithm we have used works properly
or not
3 steps
1.-- take 1 data pt and calculate the distance
using Euclidean or Manhattan distance.
2. take another cluster(c2) bi . We calculate the distance from points of c1
to points of c2
3. If clustering is done properly then ( ai<<bi )
For silhouette clustering the value is between -1 to +1
If value towards - ve then ai>>bi -- bad.
formula:
Curse of Dimensionality :
-- dimensions (features ) -- attributes
2d feature → 1d feature
If we have some pts on x-y plane . we can draw a principal component line such
that all the pts can be projected on that line .
______________________________________________________________
● Import pandas as pd
Import numpy as np
Import matplotlib as plt
%matplotlib inline
import seaborn as sns
● //Display all the columns of the dataframe
pd.pandas.set_option(‘display.max_column’,None)
● dataset=pd.read_csv(‘train.csv’)
//Print shape of data set with rows and columns
print(dataset,shape)
○ (1460,81) rown ,columns
● //Print the top 5 records
dataset.head()
○ //various categories which determine the house prices
Missing values :
Here we will check the percentage of nan values present in each feature
1st step make the list of features which has the missing values
● feature_with_na
print(feature, np.round(dataset[feature].isnull().mean(),4) , %missing values
) // whatever value we get → round it up upto 4 decimal pts . and we print
the st. missing values
○
Since they are many missing values we need to find the relationship b/w missing
values and sales price
We cant just drop the missing values since they might have some dependency
on the output
Numerical variables
How many features are numerical variable :
● numerical_features = feature for feature in dataset.column if
dataset[feature].dtypes is != ‘O’ ] → //If the feature is not object then it
becomes a numerical
● print('Number of numerical variables: ', len(numerical_features))
○
Yr sold → temporal variable → data will be updated each yr
From the dataset we have 4 variables .we have to extract information from the
datetime variables like no of years or no of days . one example diff in years
between the year the house was built and the house was sold . this analysis is
done in feature engineering .
# List of variables that contain year information
● year_feature = [feature for feature in numerical_features if 'Yr' in feature or
‘year’ in feature ] // we have the keywords yr or year in each category
● year_feature
dataset.groupby('YrSold')['SalePrice'].median().plot()
plt.xlabel('Year Sold')
plt.ylabel('Median House Price')
plt.title("House Price vs YearSold")
○
● year_feature
○ ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']
## Here we will compare the difference between ALL years feature with
Salesprice
● for feature in year_feature:
if feature != 'YrSold':
data=dataset.copy()
## We will capture the difference between year variable and year the house
was sold for
data[feature]=data['YrSold' ] -data[feature]
plt.scatter(data[feature],data['SalePrice'])
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.show()
○
Observe from above graph that if the house was very old (140 yr) low price
○
Similarly for other features
## Numerical variables are usually of 2 type
## 1. Continuous variable and Discrete Variables
●
● discrete_feature=[feature for feature in numerical_features if
len(dataset[feature].unique())<25 and feature not in year_feature+[‘Id’]
print("Discrete Variables Count: {}".format(len(discrete_feature)))
○ Discrete Variables Count: 17
● Discrete_feature
○ [ ‘MSSubclass’,
‘Fullbath’,
‘Halfbath’,
. . . . . . . . .]
● dataset[discrete_feature].head()
○
## Lets Find the realtionship between them and Sale PRice
● for feature in discrete_feature:
data=dataset.copy()
data.groupby(feature)['SalePrice'].median().plot.bar()
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.title(feature)
plt.show()
○
//Similarly we have other graphs
Continuous Variable
● continuous_feature=[feature for feature in numerical_features if feature not
in discrete_feature+year_feature+[‘Id’]]
● print("Continuous feature Count {}".format(len(continuous_feature)))
○ Continuous feature Count 16
○
Similarly other graphs are not gaussian distribution so we convert them into
normal distribution
6. Outliers
● for feature in continuous_feature:
data=dataset.copy()
if 0 in data[teature].unique():
pass
else :
data[feature]=np.log(data[feature])
data.boxplot(column=feature)
plt.ylabel(feature)
plt.title(feature)
plt.show
○
○ // many outliers for all features . this does not work for categorical
features . we only use for continuous f.
Categorical var
● Categorical_features = for feature in dataset.columns if data[feature].dtypes
==’O’]
● categorical _features
○ ['MSZoning',
○ 'Street',
○ 'Alley', . . . . . .
● dataset[ategorical_features].head()
○ // display top 5 results
○ // focus on cardinality values → how many different categories we
have inside categorical feature .
● For feature in categorical_features:
print(‘the features is () and number of categories are
()’.format(feature,len(dataset[feature].unique())))
○ The feature is MSZoning and number of categories are 5
The feature is Street and number of categories are 2
The feature is Alley and number of categories are 3
The feature is LotShape and number of Categories are 4
........
○ ..........
____________
● import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# to visualize all the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)
● dataset=pd.read_csv('train.csv")
dataset.head()
○
In a kaggle problem st . we have a train data and test data . now usually in
kaggle we want very good accuracy we try to combine train and test data . once
when we combine both then we do the feature eng. Because of that there is data
leakage. Some info from train data to test data and vice versa . accuracy
Missing values
dataset=replace_cat_feature(dataset,features_nan)
dataset[features_nan].isnull().sum()
○ Alley: 0
MasVnrType:0
BsmtQual:0
## Now lets check for, numerical variables the contains missing values
● numerical_with_nan=[feature for feature in dataset.columns if dataset [
feature ].isnull().sum()>1 and dataset[feature].dtypes!=’O’
## We will print the numerical nan variables and percentage of missing values
● dataset[numerical_with_nan].isnull().sum()
○ LotFrontage 0
MasVnrArea 0
GarageYrBlt 0
● dataset.head(50)
○
// date time variables → ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']
○
● dataset[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']].head()
○
Feature eng- part 2 :
● dataset.head()
○
● import numpy as np
num_features=['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'Saleprice']
// convert into log normal dist
for feature in num_features:
dataset[feature]=np.log(dataset[feature])
● dataset.head()
○
Feature scaling :
We have many features which are measured through different units .
Minmaxscalar : convert data b/w 0 &1
Stdscalar:
Code :
● feature_scale=[feature for feature in dataset.columns if feature not in ['Id',
'SalePrice']]
○
● data.to_csv(‘X_train.csv’,index=False)
○
● We drop salesprice and Id . becoz id is continuously increasing and sales
price is a dependent feature
● ## Capture the dependent feature
y_train=dataset[['SalePrice']]
In [30]:
○ total features: 82
selected features: 21
features with coefficients shrank to zero: 61
● selected_feat
○ Index(['MSSubClass', 'MSZoning', 'Neighborhood', 'OverallQual',
'YearRemodAdd', . . . . . .
● X_train=X_train[selected_feat]
● X_train.head()
○
_____
___
○ [[4 1 1]
[3 6 0]
[6 2 2]]
● import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as na
%matplotlib inline
○
○
##Using KNN
Remember that we are trying to come up with a model to predict whether
someone will TARGET CLASS or not. We'll start with k=1.
○
Choosing a K value :
● error_rate = []
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors= i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pifed_i != y_test))
● plt.figure(figsize=(10,6))
plt.plot(range(1,40), error_rate, color='blue', linestyle='dashed', marker="'o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate")
○
##Here we can see that that after arouns K>23 the error rate just tends to hover
around 0.06-0.05 Let's retrain the model with that and check the classification
report!
● knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
print('WITH K-1')
print('\n")
print(confusion_matrix(y_test,pred))
print('\n")
print(classification_report(y_test,pred))
○
● knn = KNeighborsClassifier(n_neighbors=23)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
print('WITH K=23')
print('\n")
print(confusion_matrix(y_test,pred))
print('\n")
print(classification_report(y_test,pred))
## here error rate has decreased
○
Working of K nearest
Regression usecase :
Only 1 category . average mean of all the nearest neighbors
K=5
__
Ensemble technique :
.Combining multiple models
1.Bagging(bootstrap aggregation)
a.Random forest
2.Boosting
a.ADABOOST
b.GRADIENT BOOSTING
c. Xgboost
1.Bagging
The output is 0 / 1 . here we will use voting classifier → majority of the votes by
models → considered
Random forest :
In random forest the models (in bootstrap agg ) are called decision trees
Decision tree to its depth -> low bias(get properly trained - error less) high
variance (prone to give large error with new test data ) → Overfitting
When we combine all the decision trees (individually with high variance) the high
variance is converted to low variance .
Code :
● import numpy as np
import pandas as pd
import sklearn
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from pylab import rcParams
rcParams['figure.figsize'] = 14, 8
RANDOM SEED = 42
LABELS = ["Normal", "Fraud"]
○
Class → dependent feature
Remaining → independent feature
0 - normal
1 - fraudulent
● data.info()
○
○ (284807, 30)
(284807,)
Randomized search .
line 40 Works in various parameters and tries to find out xg boost will work for
which parameter .
Line 36 : we will take those parameters which are present inside XGB classifier
We give various values learning rate [0.05,0.10,0.15] . the randomized search
algo will do Perm&Comb for each of the values .
We shouldn't lower the value of learning rate beyond 0.05 otherwise it will lead to
overfitting condn and more training time .
Gamma and colsample_bytree should be less than 1
Line 41 : verbose → to give message abt the time and status of the job etc
Auto encoders and decoders are used in dimensional reduction in case of deep
learning . but in case of machine learning we use PCA
Day 1
D2
D3
D4
D5
Convert it to
o/p :
d1 d2 d3 d4 d5
d6
Similarly if we want prediction of day 7 we will take day 2 to 6. Day 1 will be
removed