Python Codebook
Python Codebook
aaryani6991@gmail.com
9T6KVYUXDH
                          Codebook
    Data Science is the art and science of solving real world problems and making data driven decisions. It involves an
    amalgamation of three aspects and a good data scientist has expertise in all three of them. These are:
    Your lack of expertise should not become an impediment in your journey in Data Science. With consistent effort, you
    can become fairly proficient in coding skills over a period of time. This Codebook is intended to help you become
    comfortable with the finer nuances of Python and can be used as a handy reference for anything related to data science
    codes throughout the program journey and beyond that.
    Please keep in mind there is no one right way to write a code to achieve an intended outcome. There can be multiple
    ways of doing things in Python. The examples presented in this document use just one of the approaches to perform
    the analysis. Please explore by yourself different ways to perform the same thing.
aaryani6991@gmail.com
9T6KVYUXDH
                                                                                                                       1
                            This file is meant for personal use by aaryani6991@gmail.com only.
                           Sharing or publishing the contents in part or full is liable for legal action.
        Contents
PREFACE ................................................................................................................................................... 1
        Arithmetic Operations............................................................................................................................................... 5
          Addition................................................................................................................................................................... 5
          Subtraction.............................................................................................................................................................. 5
          Multiplication ........................................................................................................................................................... 5
          Division ................................................................................................................................................................... 5
          Square .................................................................................................................................................................... 5
          Square Root ............................................................................................................................................................ 5
        Loops......................................................................................................................................................................... 5
          For Loop ................................................................................................................................................................. 5
          While Loop .............................................................................................................................................................. 6
Index Error................................................................................................................................................................. 8
Type Error.................................................................................................................................................................. 9
PRE-PROCESSING ................................................................................................................................. 11
                                                                                                                                                                                     2
                                            This file is meant for personal use by aaryani6991@gmail.com only.
                                           Sharing or publishing the contents in part or full is liable for legal action.
        Null Value Check ..................................................................................................................................................... 11
Outlier Check........................................................................................................................................................... 11
        Scaling ..................................................................................................................................................................... 12
          Standard Scaler .................................................................................................................................................... 12
          Min-Max Scaler ..................................................................................................................................................... 12
STATISTICS ............................................................................................................................................ 13
VISUALISATIONS .................................................................................................................................. 15
Histogram ................................................................................................................................................................ 15
Pairplot .................................................................................................................................................................... 15
Boxplot .................................................................................................................................................................... 15
                                                                                                                                                                                      3
                                            This file is meant for personal use by aaryani6991@gmail.com only.
                                           Sharing or publishing the contents in part or full is liable for legal action.
    Table of Figures
    Figure 1: Syntax Error ................................................................................................................................................................. 8
    Figure 2: Syntax Error- EOF while parsing .................................................................................................................................. 8
    Figure 3: Index Error ................................................................................................................................................................... 8
    Figure 4: Module Not Found Error .............................................................................................................................................. 8
    Figure 5: Import Error ................................................................................................................................................................. 9
    Figure 6: Key Error ..................................................................................................................................................................... 9
    Figure 7: Value Error .................................................................................................................................................................. 9
    Figure 8: Type Error.................................................................................................................................................................... 9
    Figure 9: Name Error ................................................................................................................................................................ 10
    Figure 10: Indentation Error ...................................................................................................................................................... 10
    Figure 11: Zero Division Error .................................................................................................................................................. 10
    Table of Equations
    Equation 1: Z-Statistic............................................................................................................................................................... 17
    Equation 2: T-Statistic............................................................................................................................................................... 17
    Equation 3: T-Statistic for 2 samples ......................................................................................................................................... 18
    Equation 4: F-Statistic ............................................................................................................................................................... 18
aaryani6991@gmail.com
9T6KVYUXDH
                                                                                                                                                                                           4
                                          This file is meant for personal use by aaryani6991@gmail.com only.
                                         Sharing or publishing the contents in part or full is liable for legal action.
      Python Basics
      Arithmetic Operations
      Addition
      a=4
      b=7
      a+b
      11
      Subtraction
      a-b
      -3
      Multiplication
      a*b
      28
      Division
      a/b
      0.5714285714285714
aaryanSiq
        69u9a1re@gmail.com
9T6KVa*Y*U2XDH
      16
      Square Root
      a*0.5
      2.0
      Loops
      For Loop
      When the number of iterations required is known i.e. n, the ‘for’ is used.
                                                                                                                  5
                                  This file is meant for personal use by aaryani6991@gmail.com only.
                                 Sharing or publishing the contents in part or full is liable for legal action.
       While Loop
When the number of iterations required is unknown i.e. n, the ‘while’ is used.
       Conditional Statements
       IF Statement
       This statements check the condition provided and if the condition is True, then the program moves ahead with defined steps
     x=300
     if x>200:
aaryani6p9r9in1t(@
                 'Hgem
                     y,axil.ic
                             s ogm
                                 reater than 200!')
9T6KVYUXDH
     Hey, x is greater than 200!
       IF-Else Statement
       This statement is an extension to IF statement. The program moves to Else statement and performs the alternative steps defined in
       case ‘IF’ condition is not met.
       x=100
       if x>200:
          print('Hey, x is greater than 200!')
       else: print('Hey, x is smaller than 200!')
       Hey, x is smaller than 200!
       IF-ELIF Statement
       This statement is an extension to IF-Else statement. The program moves to ELIF statement and performs the alternative steps defined
       in case ‘IF’ condition is not met and if the else if ‘ELIF’ in python is not met, then the program moves to another ELIF or Else
       statement.
Syntax along with example In this example else would be executed only when x=y
       x=100
       y=150
       if x>y:
          print('Hey, x is greater than y!')
       elif x<y:
          print('Hey, y is greater than x!')
       else: print('Hey, x is equal to y!')
       Hey, y is greater than x!
                                                                                                                                       6
                                        This file is meant for personal use by aaryani6991@gmail.com only.
                                       Sharing or publishing the contents in part or full is liable for legal action.
      User Defined Functions
      User-defined functions are very helpful in automating repetitive tasks like selecting odd numbers out of a series or converting a
      series of dates to timestamps
      #A function is defined using 'def' followed by function name and arguments the function takes in
      def addition(x,y):
#this is a function which returns addition of two variables passed into it.
      #calling a function
      a=1
b=2
      addition(a,b)
      3
      Importing Libraries/Modules
aaryani6991@gmail.com
9T6KVAYmUoXdDle is a file containing Python definitions and statements. Use ‘import’ and aliasing statement ‘as’
            uH
      Example:
      import pandas as pd
      Here, we are importing pandas module with an alias ‘pd’.
                                                                                                                                          7
                                 This file is meant for personal use by aaryani6991@gmail.com only.
                                Sharing or publishing the contents in part or full is liable for legal action.
      Python Error Debugging
      Syntax Error
      When the syntax used is wrong. In below snapshot, the print statement is missing parenthesis.
Missing one or more parenthesis like in below snapshot, the ‘)’ is missing.
      Index Error
     When the Index Position is out of bounds. In below snapshot, the lst[8] looks for index position 8 which is not present in the given
     list lst.
aaryani6991@gmail.com
9T6KVYUXDH
      Import Error
      When the specified module function is not found. In below snapshot, ‘feature_importances_’ is not found in module sklearn.tree.
                                                                                                                                        8
                                 This file is meant for personal use by aaryani6991@gmail.com only.
                                Sharing or publishing the contents in part or full is liable for legal action.
                                                           Figure 5: Import Error
    Key Error
    When the dictionary’s key is not found in the given dictionary. In below snapshot, key ‘d’ does not exist for the given dictionary.
    Value Error
    When an inappropriate value is passed into a function. In below snapshot, ‘hello’ is inappropriately passed in for typecasting to
    float.
aaryani6991@gmail.com
9T6KVYUXDH
    Type Error
    When an unsupported operation is performed like below snapshot the subtraction of ‘10’ with integer 10.
    Name Error
    When an undefined variable/object is used ike in below snapshot.
                                                                                                                                        9
                               This file is meant for personal use by aaryani6991@gmail.com only.
                              Sharing or publishing the contents in part or full is liable for legal action.
                                                           Figure 9: Name Error
    Indentation Error
    When there is an incorrect indentation like in below example of ‘for’ loop
aaryani6991@gmail.com
9T6KVYUXDH
                                                                                                               10
                               This file is meant for personal use by aaryani6991@gmail.com only.
                              Sharing or publishing the contents in part or full is liable for legal action.
      Pre-processing
df.isnull().sum()
      To impute null values, use Simple Imputer: Imputation transformer for completing missing values. For numeric data, if there are
      no outliers, use ‘median’ and for categorical data, use ‘most_frequent’
      Below is an example of imputation of null values in an “object” type column, hence in SimpleImputer strategy used in “most
      frequent”.
      objects=df[cols].select_dtypes(include='object').columns
      non_objects=df[cols].select_dtypes(exclude='object').columns
      Use this custom function to check and treat outliers. This is also known as “95% - 5%” capping technique. Based on the count of
      outliers, decision could be made to use a different capping percentage like 99%-1%.
      def remove_outlier(col):
        sorted(col)
        Q1,Q3=np.percentile(col,[25,75])
        IQR=Q3-Q1
        lower_range= Q1-(1.5 * IQR)
        upper_range= Q3+(1.5 * IQR)
        return lower_range, upper_range
      lower_range,upper_range=remove_outlier(df[column])
      df[column]=np.where(df[column]>upper_range,upper_range,df[column])
      df[column]=np.where(df[column]<lower_range,lower_range,df[column])
      Splitting arrays or matrices into random train and test subsets. Model will be fitted on train set and predictions will be made on the
      test set.
                                                                                                                                         11
                                 This file is meant for personal use by aaryani6991@gmail.com only.
                                Sharing or publishing the contents in part or full is liable for legal action.
      Scaling
Standard Scaler
It is a scaling technique that scales down the data with mean equal to zero and standard deviation equal to 1.
Min-Max Scaler
      This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero
      and one.
Source: scikit-learn
                                                                                                                                           12
                                   This file is meant for personal use by aaryani6991@gmail.com only.
                                  Sharing or publishing the contents in part or full is liable for legal action.
     Statistics
Packages Required: Usually the packages required for this module are numpy, pandas, statsmodels, scipy.stats.
     Descriptive Statistics
     Descriptive Statistics is a collective term for the summary statistics of a data set. It is comprised of mean, median, mode, standard
     deviation, Inter Quartile Range (IQR) etc. and is studied through tables, graphs and charts.
     Population refers to the entire set of observations of a data set. Sample refers to a subset of the population on which most studies
     about the population are based. Sample is drawn randomly to make inferences about the population parameters.
     import numpy as np
     a = 100,56,29,90,102,134,809
     np.mean(a)
     188.57142857142858
Alternatively, print(df.mean()) would give you the mean of all columns of the data ‘df’
Please note that you would need to import numpy package for this function.
aaryani6991@gmail.com
9T6KVMYeU
        diXaD
            nHrepresents the middle value.
     Syntax with example:
     import numpy as np
     a = 100,56,29,90,102,134,809
     np.median(a)
     Output: 100.0
Alternatively, print(df.median()) would give you the medin of all columns of a data set ‘df’.
print(df.mode()) gives mode of each column of data frame df. If no value appears more than once it displays NaN as output.
     Please note that you can generate the five-point summary which will display most of the measures at one place. It can be generated
     using df.describe() or df.describe().T
Measures of Dispersion
     Measures of Dispersion (Spread) are statistics that describe how data varies. Measure of dispersion gives us the sense of how
     much the data tends to diverge from the central tendency.
     Range: It shows the spread of the values contained in the variable. It is the difference between the maximum and minimum
     values.
                                                                                                                                       13
                                This file is meant for personal use by aaryani6991@gmail.com only.
                               Sharing or publishing the contents in part or full is liable for legal action.
       Interquartile Range (IQR)
       IQR gives a much better idea about the spread of the data because it doesn’t take into account the effect of outliers. IQR is more
       popular than Range.
       import numpy as np
       Q1,Q3=np.percentile(col,[25,75])          #col is the column name
       IQR=Q3-Q1
       IQR
       Correlation
       Correlation measures how strongly two variables are related to each other.
       df.corr()
       This gives the correlation table showing all variables against each other.
       Correlation Plot
       import seaborn as sns
       sns.heatmap(corr, annot=True)
aaryanTih6i9s 9p1re@
                   segnm
                       ts aailp.cicotm
                                     orial representation of the correlation data frame and is much easier to make sense of, for smaller data frames.
9T6KVYUXDH
       For bigger data frames with a lot of variables, a correlation table would be preferable.
       Skewness
       Skewness shows the asymmetry in the data. It shows where most of the data points lie.
       import pandas as pd
       import scipy.stats as stats
       Skewness = pd.DataFrame({'Skewness' : [stats.skew(df.col1),stats.skew(df.col2),stats.skew(df.col3)]}, index=[col1,'col2','col3'])
To check the Skewness of Col1, Col2, Col3 of the data frame ‘df’
Skewness (where is the df … either we should mention that these values mentioned below are representative)
       Col1        0.283729
       Col2        0.055610
       Col3        1.514180
                                                                                                                                                        14
                                    This file is meant for personal use by aaryani6991@gmail.com only.
                                   Sharing or publishing the contents in part or full is liable for legal action.
      Visualisations
      Histogram:
      df.hist()                           # Histogram of all continuous variables of the data frame ‘df’
      df[‘column’].hist()      # Histogram of a particular column of the data frame ‘df’
      Pairplot:
      It is a powerful plot which is used to know the distributions and correlations of all variables of the data frame ‘df’.
      Boxplot:
      df.boxplot(figsize=(15,4))
      Probability Distributions
aaryani6991@gmail.com
9T6KVItYiU
         s aXsDtaHtistical method that describes all the possible likelihoods for a random variable within a given range.
      Please note that loc=0 and scale =1 are default values in the codes below.
Normal Distribution
      Binomial Distribution
      import scipy.stats as stats
      n,p = 10,0.22
      k=0
      stats.binom.pmf(k,n,p)
      0.083357758312362 #output
p = Probability of success
k = random variable
For Cumulative:
      stats.binom.cdf(k,n,p)
      If k=0, then the outcome would be same as the above output.
                                                                                                                                15
                                     This file is meant for personal use by aaryani6991@gmail.com only.
                                    Sharing or publishing the contents in part or full is liable for legal action.
       Poisson Distribution
       import scipy.stats as stats
       stats.poisson.pmf(k,lambda)           #k = random variable
For Cumulative:
       stats.poisson.cdf(k,lambda)
       Inferential Statistics
       Inferential Statistics has a key, i.e., inference. So, in Inferential statistics, we draw out a sample from the population, put the
       sample through various tests to make inferences about the population.
Hypothesis Testing
       The Null Hypothesis is often denoted by 𝐻0 or 𝐻𝑁𝑢𝑙𝑙 and the Alternate Hypothesis is denoted by 𝐻1or 𝐻𝐴
       Example: 𝐻0 ≤ 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 𝜇 ; 𝐻1 > 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 𝜇
aaryan(G
       i6e9n9e1ra@
                 llygm
                     siganili.fcicoamnce level is 5%, however it can be increased or decreased as per the situation at hand)
9T6KVYUXDH
      Step 3: Identify the test to be undertaken and find the critical value
       Decide on which test is to be used, for example, t-test, z-test, etc.
       ** Please note that if Alpha is 5% (0.05), then for one tailed test the value of Alpha should be 0.05, but for two tailed test, it
       should be 0.025)
       Golden Rule: If p-value is low (means p-value is lower than Alpha), then Null Hypothesis should be rejected (reject H0). Accept
       Alternative Hypothesis.
If p-value is more than Alpha, then we fail to reject Null Hypothesis (Reject Alternate Hypothesis )
       z-Distribution
       import scipy.stats as stats
       cv = stats.norm.ppf(Alpha, 0, 1) #loc=0, scale=1
                                                                                                                                             16
                                     This file is meant for personal use by aaryani6991@gmail.com only.
                                    Sharing or publishing the contents in part or full is liable for legal action.
      t-Distribution
      import scipy.stats as stats
      cv=stats.t.ppf(0.05, df) #df is the degree of freedom n-1
      f- Distribution
      import scipy.stats as stats
      stats.f.ppf(0.95,dfn=4,dfd=26) #q=0.95
      2.7425941372218587
      Chi Square
      Sample Code:
For array, the following code can be used to calculate the z score
Equation 2: T-Statistic
Where 𝑋̅is sample mean, 𝜇 is population mean, 𝑠 is sample standard deviation, and n is sample size.
                                                                                                                 17
                                 This file is meant for personal use by aaryani6991@gmail.com only.
                                Sharing or publishing the contents in part or full is liable for legal action.
                                                                       (𝑋1 − 𝑋2)
                                                       𝑡𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 =
                                                                       (𝑠2) (𝑠2)
                                                                      √ 1 + 2
                                                                        𝑛1   𝑛2
f-test
aaryani6991@gmail.com
9T6KVYUXDH
                                                                                                                   18
                             This file is meant for personal use by aaryani6991@gmail.com only.
                            Sharing or publishing the contents in part or full is liable for legal action.