Department of Computer                                            Subject :
Engineering                                                       DSBDAL
                                              Group A
                                         Assignment No: 2
  Title of the Assignment: Data Wrangling, II
      Create an “Academic performance” dataset of students and perform the following
      operations using Python.
          1. Scan all variables for missing values and inconsistencies. If there are missing
              values and/or inconsistencies, use any of the suitable techniques to deal with
              them.
          2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
              techniques to deal with them.
          3. Apply data transformations on at least one of the variables. The purpose of this
              transformation should be one of the following reasons: to change the scale for
              better understanding of the variable, to convert a non-linear relation into a linear
              one, or to decrease the skewness and convert the distribution into a normal
              distribution.
              Reason and document your approach properly.
  Objective of the Assignment:         Students should be able to perform thedata wrangling
  operation using Python on any open source dataset
  Prerequisite:
   1. Basic of Python Programming
   2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data
      Cleaning.
Theory:
   1. Creation of Dataset using Microsoft Excel.
      The dataset is created in “CSV” format.
          ● The name of dataset is Students Performance
          ● The features of the dataset are: Math_Score, Reading_Score, Writing_Score,
              Placement_Score, Club_Join_Date .
Department of Computer                                            Subject :
Engineering                                                       DSBDAL
        ● Number of Instances: 10
        ● The response variable is: Placement_Offer_Count .
        ● Range of Values:
            Math_Score          [60-80],   Reading_Score[75-,95],      ,Writing_Score     [60,80],
            Placement_Score[75-100], Club_Join_Date [2018-2021].
2. Identification and Handling of Null Values
   In Pandas missing data is represented by two value:
        1. None: None is a Python singleton object that is often used for missing data in
            Python code.
        2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
            recognized by all systems that use the standard IEEE                    floating-point
            representation.
            To facilitate the convention, there are several useful functions for detecting,
      removing, and replacing null values in Pandas DataFrame :
           ●        isnull()
           ●        notnull()
           ●        dropna()
           ●        fillna()
           ●        replace()
        1. Checking for missing values using isnull() and notnull()
                ●    Checking for missing values using isnull()
            df.isnull()
                ●    Checking for missing values using notnull()
                     df.notnull()
        2. Filling missing values using dropna(), fillna(), replace()
            In order to fill null values in a datasets, fillna(), replace() functions are used.
            These functions replace NaN values with some value of their own. All these
            functions help in filling null values in datasets of a DataFrame.
        ● Filling null values with a single value
                  ndf=df
                  ndf.fillna(0)
                     data['math score'] = data['math score'].fillna(data['math score'].mean())
Department of Computer                                              Subject :
Engineering                                                         DSBDAL
                    data[''math score''] = data[''math score''].fillna(data[''math
            score''].median())
                    data['math score''] = data[''math score''].fillna(data[''math score''].std())
            replacing missing values in forenoon column with minimum/maximum number
            of that column
                    data[''math score''] = data[''math score''].fillna(data[''math score''].min())
                    data[''math score''] = data[''math score''].fillna(data[''math score''].max())
        ● Filling a null values using replace() method
            Following line will replace Nan value in dataframe with value -99
            ndf.replace(to_replace = np.nan, value = -99)
        ● Deleting null values using dropna() method
          In order to drop null values from a dataframe, dropna() function is used. This
            function drops Rows/Columns of datasets with Null values in different ways.
            df.dropna()
            To Drop rows if all values in that row are missing
            df.dropna(how = 'all')
            To Drop columns with at least 1 null value.
            df.dropna(axis = 1)
3. Identification and Handling of Outliers
    3.1 Identification of Outliers
    One of the most important steps as part of data preprocessing is detecting and treating the
    outliers as they can negatively affect the statistical analysis and the training process of a
    machine learning algorithm resulting in lower accuracy.
    1. What are Outliers?
          We all have heard of the idiom ‘odd one out' which means something unusual in
    comparison to the others in a group.
            Similarly, an Outlier is an observation in a given dataset that lies far from the rest
    of the observations. That means an outlier is vastly larger or smaller than the remaining
    values in the set.
    2. Why do they occur?
Department of Computer                                            Subject :
Engineering                                                       DSBDAL
            An outlier may occur due to the variability in the data, or due to experimental
    error/human error.
            They may indicate an experimental error or heavy skewness in the data(heavy-
    tailed distribution).
    3. What do they affect?
          In statistics, we have three measures of central tendency namely Mean, Median,
    and Mode. They help us describe the data.
            Mean is the accurate measure to describe the data when we do not have any
    outliers present. Median is used if there is an outlier in the dataset. Mode is used if there
    is an outlier AND about ½ or more of the data is the same.
           ‘Mean’ is the only measure of central tendency that is affected by the outliers
    which in turn impacts Standard deviation.
           Example:
            Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
    looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other
    values.
                                 Fig. Computation with and without outlier
    From the above calculations, we can clearly say the Mean is more affected than the
    Median.
    4. Detecting Outliers
            If our dataset is small, we can detect the outlier by just looking at the dataset. But
    what if we have a huge dataset, how do we identify the outliers then? We need to use
    visualization and mathematical techniques.
    Below are some of the techniques of detecting outliers
        ●   Boxplots
        ●   Scatterplots
        ●   Z-score
        ●   Inter Quantile Range(IQR)
Department of Computer                                             Subject :
Engineering                                                        DSBDAL
                    col = ['math score', 'reading score' , 'writing
                    score','placement score']
                    df.boxplot(col)
            4.1 Detecting outliers using Scatterplot:
            It is used when you have paired numerical data, or when your dependent variable
    has multiple values for each reading independent variable, or when trying to determine
    the relationship between the two variables. In the process of utilizing the scatter plot, one
    can also use it for outlier detection.
            To plot the scatter plot one requires two variables that are somehow related to
    each other. So here Placement score and Placement count features are used.
            4.2 Detecting outliers using Z-Score:
                    Z-Score is also called a standard score. This value/score helps to
            understand how far is the data point from the mean. And after setting up a
            threshold value one can utilize z score values of data points to define the outliers.
                    Zscore = (data_point -mean) / std. deviation
            4.3 Detecting outliers using Inter Quantile Range(IQR):
                    IQR (Inter Quartile Range) Inter Quartile Range approach to finding the
            outliers is the most commonly used and most trusted approach used in the
            research field.
                    IQR = Quartile3 – Quartile1
                    To define the outlier base value is defined above and below datasets
            normal range namely Upper and Lower bounds, define the upper and the lower
            bound (1.5*IQR value is considered) :
                    upper = Q3 +1.5*IQR
                    lower = Q1 – 1.5*IQR
                    In the above formula as according to statistics, the 0.5 scale-up of IQR
            (new_IQR = IQR + 0.5*IQR) is taken.
Handling of Outliers:
            For removing the outlier, one must follow the same process of removing an entry
    from the dataset using its exact position in the dataset because in all the above methods of
    detecting the outliers end result is the list of all those data items that satisfy the outlier
    definition according to the method used.
      Department of Computer                                         Subject :
      Engineering                                                    DSBDAL
Below are some of the methods of treating the outliers
      ● Trimming/removing the outlier
      ● Quantile based flooring and capping
      ● Mean/Median imputation
      Data Transformation: Data transformation is the process of converting raw data into a
      format or structure that would be more suitable for model building and also data discovery in
      general. The process of data transformation can also be referred to as extract/transform/load
      (ETL). The data transformation involves steps that are.
           ●    Smoothing: It is a process that is used to remove noise from the dataset using some
               algorithms It allows for highlighting important features present in the dataset. It
               helps in predicting the patterns
           ● Aggregation: Data collection or aggregation is the method of storing and presenting
               data in a summary format. The data may be obtained from multiple data sources to
               integrate these data sources into a data analysis description. This is a crucial step
               since the accuracy of data analysis insights is highly dependent on the quantity and
               quality of the data used.
           ● Generalization: It converts low-level data attributes to high-level data attributes
               using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
               converted into categorical value (young, old).
           ● Normalization: Data normalization involves converting all data variables into a
               given range. Some of the techniques that are used for accomplishing normalization
               are:
               ○ Min–max normalization: This transforms the original data linearly.
               ○ Z-score normalization: In z-score normalization (or zero-mean normalization)
                  the values of an attribute (A), are normalized based on the mean of A and its
                  standard deviation.
               ○ Normalization by decimal scaling: It normalizes the values of an attribute by
                  changing the position of their decimal points
   Department of Computer                                        Subject :
   Engineering                                                   DSBDAL
Conclusion: In this way we have explored the functions of the python library for Data
Identifying and handling the outliers. Data Transformations Techniques are explored with the
purpose of creating the new variable and reducing the skewness from datasets.
Viva Questions:
   1. Explain the methods to detect the outlier.
   2. Explain data transformation methods
   3. Write the algorithm to display the statistics of Null values present in the dataset.
   4. Write an algorithm to replace the outlier value with the mean of the variable.