LO2: Data
Scraping and Data
Wrangling
Python-Week 6
            • Data wrangling—also called data cleaning
              or data remediation—refers to a variety of
              processes designed to transform raw data
              into more readily used formats. The exact
              methods differ from project to project
              depending on the data you’re leveraging
              and the goal you’re trying to achieve.
            • Most Commonly-used Data Wrangling
Data
              include:
                • Merging multiple data sources into a
                  single dataset for analysis
Wrangling       • Identifying gaps in data (for example,
                  empty cells in a spreadsheet) and either
                  filling or deleting them
                • Deleting data that’s either unnecessary
                  or irrelevant to the project you’re
                  working on
                • Identifying extreme outliers in data and
                  either explaining the discrepancies or
                  removing them so that analysis can
                  take place
Data Wrangling
Steps
• Each data project requires a unique
  approach to ensure its final dataset is
  reliable and accessible. That being
  said, several processes typically
  inform the approach. These are
  commonly referred to as data
  wrangling steps or activities:
                              Data Wrangling: What It Is & Why It’s Important (hbs.edu)
                 1.    Discovery: refers to the process of
                       familiarizing yourself with data so you can
                       conceptualize how you might use it. During
                       discovery, you may identify trends or
                       patterns in the data, along with obvious
                       issues, such as missing or incomplete
                       values that need to be addressed. This is an
                       important step, as it will inform every
                       activity that comes afterward.
                      • df.head(), df.columns, df.tail(), df.info(),
Data Wrangling           df.shape, df.isnull()
Steps            2.   Structuring: Raw data is typically unusable
                      in its raw state because it’s either
                      incomplete or misformatted for its intended
                      application. Data structuring is the process
                      of taking raw data and transforming it to be
                      more readily leveraged. The form your data
                      takes will depend on the analytical model
                      you use to interpret it.
                      • quantile-based binning (from numeric to
                        categorical):
                        pd.qcut(df['points'],q=[0,0.16,0.84,0.9,1])
                      • encoding (categorical to numeric): OneHotEncoder
                 3.   Data cleaning is the process of removing
                      inherent errors in data that might distort
                      your analysis or render it less valuable.
                      Cleaning can come in different forms,
                      including deleting empty cells or rows,
                      removing outliers, and standardizing
                      inputs. The goal of data cleaning is to
                      ensure there are no errors (or as few as
Data Wrangling        possible) that could influence your final
                      analysis. Identifying and removing any
Steps                 bad data greatly impacts the rest of the
                      wrangling processes.
                          • df.drop_duplicates(inplace=True),
                            df.dropna(inplace=True),
                            df2['co2'].fillna(ave_co2, inplace
                            =True), df2["co2"].interpolate(),
                 4.   Enriching: Once you understand your
                      existing data and have transformed it
                      into a more usable state, you must
                      determine whether you have all of the
                      data necessary for the project at hand.
                      If not, you may choose to enrich or
Data Wrangling        augment your data by incorporating
                      values from other datasets. For this
Steps                 reason, it’s important to understand
                      what other data is available for use. Of
                      course: If you decide that enrichment
                      is necessary, you need to repeat the
                      steps above for any new data.
                 4.   Validating: Data validation refers to the
                      process of verifying that your data is both
                      consistent and of a high enough quality.
                      During validation, you may discover issues
                      you need to resolve or conclude that your
                      data is ready to be analyzed. Validation is
                      typically achieved through various
                      automated processes and requires
                      programming. Consistent means: Data is
                      consistently represented in a standard way
Data Wrangling        throughout the dataset.
                 For the data to be of high quality:
Steps            • Complete: The dataset contains all required
                   values and fields — nothing important is
                   missing..
                 • Unique: The data contains no duplicates or
                   redundant records.
                 • Valid: Data conforms to the syntax and
                   structure defined by the business
                   requirements.
                 • Timely: Data is sufficiently up to date for its
                   intended use.
                 • Publishing: Once your data has been
                   validated, you can publish it. This
                   involves making it available to others
Data Wrangling     within your organization for analysis. The
                   format you use to share the information
Steps              —such as a written report or electronic
                   file—will depend on your data and the
                   organization’s goals.
Data Cleaning
Data Cleaning
            Data Wrangling vs. Data Cleaning
• Despite the terms being used
  interchangeably, data wrangling and
  data cleaning are two different
  processes. It’s important to make the
  distinction that data cleaning is a critical
  step in the data wrangling process to
  remove inaccurate and inconsistent
  data. Meanwhile, data-wrangling is the
  overall process of transforming raw data
  into a more usable form
              •The choice between low and high variability in
              data for data analytics hinges upon the precise
              objectives and contextual factors guiding the
Is low or     analysis. Data analytics necessitates a careful
              balance between achieving precision and
              encompassing the entire spectrum of data
high          variability.
              •In circumstances characterized by low data
variability   variability, a notable advantage emerges: it
              demands a smaller dataset to achieve a given
              level of precision compared to situations with
better?       higher variability. However, if the primary aim is
              to comprehensively encompass a broad range
              of scenarios, embracing high variability is
              imperative, albeit at the cost of necessitating a
              larger dataset.
              •Generally speaking, it is best to consider the
              specific task and situation in order to determine
              which variability level is best suited
             Mean imputation
             • Simply calculate the mean of the
               observed values for that variable for all
               individuals who are non-missing.
Imputation   • It has the advantage of keeping the
               same mean and the same sample size,
               but many, many disadvantages. Pretty
               much every method listed below is
               better than mean imputation.
             Substitution
             • Impute the value from a new individual who was
               not selected to be in the sample. In other words,
               go find a new subject and use their value instead.
             Hot deck imputation
             • A randomly chosen value from an individual in the
               sample who has similar values on other variables.
Imputation     In other words, find all the sample subjects who
               are similar on other variables, then randomly
               choose one of their values on the missing
               variable.
             • One advantage is you are constrained to only
               possible values. In other words, if Age in your
               study is restricted to being between 5 and 10, you
               will always get a value between 5 and 10 this way.
             • Another advantage is the random component,
               which adds in some variability. This is important
               for accurate standard errors.
             Cold deck imputation
             • A systematically chosen value from an
               individual who has similar values on other
               variables.
             • This is similar to Hot Deck in most ways,
               but removes the random variation. So for
               example, you may always choose the
Imputation
               third individual in the same experimental
               condition and block.
             Regression imputation
             • The predicted value obtained by
               regressing the missing variable on other
               variables. So instead of just taking the
               mean, you’re taking the predicted value,
               based on other variables. This preserves
               relationships among variables involved in
               the imputation model.
Regression
Imputation