KEMBAR78
UNIT-IV-II IT-Python Libraries for Data Wrangling | PPTX
12/09/2024 22UIT303-Data science 1
UNIT-IV
PYTHON LIBRARIES FOR DATA WRANGLING
12/09/2024 22UIT303-Data science 2
Syllabus-UNIT-IV
Basics of Numpy arrays –Aggregations –Computations on arrays –
Comparisons, Masks, Boolean Logic – Fancy Indexing – Structured
Arrays – Data manipulation with Pandas – Data Indexing and Selection
– Operating on Data – Missing Data – Hierarchical indexing.
12/09/2024 22UIT303-Data science 3
Basics of Numpy arrays
12/09/2024 22UIT303-Data science 4
Data Wrangling
• Data Wrangling is the process of transforming data from its original "raw" form into a
more digestible format and organizing sets from various sources into a singular coherent
whole for further processing.
• Data wrangling is also called as Data Munging.
• The primary purpose of data wrangling can be described as getting data in coherent shape.
In other words, it is making raw data usable. It provides substance for further proceedings.
12/09/2024 22UIT303-Data science 5
Data wrangling covers the following processes:
1. Getting data from the various source into one place
2. Piecing the data together according to the determined setting
3. Cleaning the data from the noise or erroneous, missing elements.
Data wrangling is the process of cleaning, structuring and enriching raw data into a
desired format for better decision making in less time.
12/09/2024 22UIT303-Data science 6
• There are typically six iterative steps that make up the data wrangling process:
1. Discovering: Before you can dive deeply, you must better understand what is in your data,
which will inform how you want to analyze it. How you wrangle customer data, for example, may
be informed by where they are located, what they bought, or what promotions they received.
2. Structuring: This means organizing the data, which is necessary because raw data comes in
many different shapes and sizes. A single column may turn into several rows for easier analysis.
One column may become two. Movement of data is made for easier computation and analysis.
12/09/2024 22UIT303-Data science 7
3. Cleaning: What happens when errors and outliers skew your data? You clean the data. What happens when state
data is entered as AP or Andhra Pradesh or Arunachal Pradesh? You clean the data. Null values are changed and
standard formatting implemented, ultimately increasing data quality.
4. Enriching: Here you take stock in your data and strategize about how other additional data might augment it.
Questions asked during this data wrangling step might be : what new types of data can I derive from what I already
have or what other information would better inform my decision making about this current data?
5. Validating: Validation rules are repetitive programming sequences that verify data consistency, quality, and security.
Examples of validation include ensuring uniform distribution of attributes that should be distributed normally (e.g.
birth dates) or confirming accuracy of fields through a check across data.
6. Publishing: Analysts prepare the wrangled data for use downstream, whether by a particular user or software and
document any particular steps taken or logic used to wrangle said data. Data wrangling gurus understand that
implementation of insights relies upon the ease with which it can be accessed and utilized by others
12/09/2024 22UIT303-Data science 8
Introduction to Python
12/09/2024 22UIT303-Data science 9
Introduction to Python
• Python is a high-level scripting language which can be used for a wide variety of text processing, system
administration and internet-related tasks.
• Python is a true object-oriented language, and is available on a wide variety of platforms.
•A module is a file containing Python definitions and statements. The file name is the module name with the
suffix .py appended.
•Python support two basic modes: Normal mode and interactive mode
• Normal mode: The normal mode is the mode where the scripted and finished.py files are run in the Python
interpreter. This mode is also called as Script Mode.
• Interactive mode is a command line shell which gives immediate feedback for each statement, while running
previously fed statements in active memory.
12/09/2024 22UIT303-Data science 10
Features of Python Programming
1. Python is a high-level, interpreted, interactive and object-oriented scripting language.
2. It is simple and easy to learn.
3. It is portable.
4. Python is free and open source programming language.
5. Python can perform complex tasks using a few lines of code.
6. Python can run equally on different platforms such as Windows, Linux, UNIX, and Macintosh,
etc
7. It provides a vast range of libraries for the various fields such as machine learning, web
developer, and also for the scripting.
12/09/2024 22UIT303-Data science 11
Advantages and Disadvantages of Python
Advantages of Python
• Ease of programming
• Minimizes the time to develop and maintain code
•Modular and object-oriented
•Large community of users
• A large standard and user-contributed library
Disadvantages of python
• Interpreted and therefore slower than compiled languages
• Decentralized with packages
12/09/2024 22UIT303-Data science 12
Aggregations
12/09/2024 22UIT303-Data science 13
Aggregations
•In aggregation function is one which takes multiple individual values and returns a
summary.
•In the majority of the cases, this summary is a single value.
•The most common aggregation functions are a simple average or summation of values.
12/09/2024 22UIT303-Data science 14
Example
>>> import numpy as np
>>> arr1 = np.array([10, 20, 30, 40, 50])
>>> arr1
array([10, 20, 30, 40, 50])
>>> arr2 = np.array([[0, 10, 20], [30, 40, 50], [60, 70, 80]])
>>> arr2
array([[0, 10, 20]
[30, 40, 50]
[60, 70, 80]])
>>> arr3 = np.array([[14, 6, 9, -12, 19, 72], [-9, 8, 22, 0, 99, -11]])
>>> array3
array([[14, 6, 9, -12, 19, 72])
[-9, 8, 22, 0, 99, -11]])
12/09/2024 22UIT303-Data science 15
Python numpy sum function calculates the sum of values in an array.
arr1.sum()
arr2.sum()
arr3.sum()
>>> arr1.sum()
150
>>> arr2.sum()
360
>>> arr3.sum()
217
12/09/2024 22UIT303-Data science 16
• Python has built-in min and max functions used to find the minimum value and maximum
value of any given array.
• Python min() and max() are built-in functions in python which returns the smallest number
and the largest number of the list respectively, as the output.
Python min() can also be used to find the smaller one in the comparison of two variables or
lists.
However, Python max() on the other hand is used to find the bigger one in the comparison
of two variables or lists.
12/09/2024 22UIT303-Data science 17
Computations on Arrays
12/09/2024 22UIT303-Data science 18
Computations on Arrays
• Computation on NumPy arrays can be very fast, or it can be very slow.
•Using vectorized operations, fast computations is possible and it is implemented by using
NumPy's universial functions (ufuncs).
• A Universal Function (Ufuncs) is a function that operates on nd arrays in an element-by-
element fashion, supporting array broadcasting, type casting, and several other standard
features.
•The ufunc is a "vectorized" wrapper for a function that takes a fixed number of specific
inputs and produces a fixed number of specific outputs.
12/09/2024 22UIT303-Data science 19
12/09/2024 22UIT303-Data science 20
12/09/2024 22UIT303-Data science 21
12/09/2024 22UIT303-Data science 22
12/09/2024 22UIT303-Data science 23
12/09/2024 22UIT303-Data science 24

UNIT-IV-II IT-Python Libraries for Data Wrangling

  • 1.
    12/09/2024 22UIT303-Data science1 UNIT-IV PYTHON LIBRARIES FOR DATA WRANGLING
  • 2.
    12/09/2024 22UIT303-Data science2 Syllabus-UNIT-IV Basics of Numpy arrays –Aggregations –Computations on arrays – Comparisons, Masks, Boolean Logic – Fancy Indexing – Structured Arrays – Data manipulation with Pandas – Data Indexing and Selection – Operating on Data – Missing Data – Hierarchical indexing.
  • 3.
    12/09/2024 22UIT303-Data science3 Basics of Numpy arrays
  • 4.
    12/09/2024 22UIT303-Data science4 Data Wrangling • Data Wrangling is the process of transforming data from its original "raw" form into a more digestible format and organizing sets from various sources into a singular coherent whole for further processing. • Data wrangling is also called as Data Munging. • The primary purpose of data wrangling can be described as getting data in coherent shape. In other words, it is making raw data usable. It provides substance for further proceedings.
  • 5.
    12/09/2024 22UIT303-Data science5 Data wrangling covers the following processes: 1. Getting data from the various source into one place 2. Piecing the data together according to the determined setting 3. Cleaning the data from the noise or erroneous, missing elements. Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time.
  • 6.
    12/09/2024 22UIT303-Data science6 • There are typically six iterative steps that make up the data wrangling process: 1. Discovering: Before you can dive deeply, you must better understand what is in your data, which will inform how you want to analyze it. How you wrangle customer data, for example, may be informed by where they are located, what they bought, or what promotions they received. 2. Structuring: This means organizing the data, which is necessary because raw data comes in many different shapes and sizes. A single column may turn into several rows for easier analysis. One column may become two. Movement of data is made for easier computation and analysis.
  • 7.
    12/09/2024 22UIT303-Data science7 3. Cleaning: What happens when errors and outliers skew your data? You clean the data. What happens when state data is entered as AP or Andhra Pradesh or Arunachal Pradesh? You clean the data. Null values are changed and standard formatting implemented, ultimately increasing data quality. 4. Enriching: Here you take stock in your data and strategize about how other additional data might augment it. Questions asked during this data wrangling step might be : what new types of data can I derive from what I already have or what other information would better inform my decision making about this current data? 5. Validating: Validation rules are repetitive programming sequences that verify data consistency, quality, and security. Examples of validation include ensuring uniform distribution of attributes that should be distributed normally (e.g. birth dates) or confirming accuracy of fields through a check across data. 6. Publishing: Analysts prepare the wrangled data for use downstream, whether by a particular user or software and document any particular steps taken or logic used to wrangle said data. Data wrangling gurus understand that implementation of insights relies upon the ease with which it can be accessed and utilized by others
  • 8.
    12/09/2024 22UIT303-Data science8 Introduction to Python
  • 9.
    12/09/2024 22UIT303-Data science9 Introduction to Python • Python is a high-level scripting language which can be used for a wide variety of text processing, system administration and internet-related tasks. • Python is a true object-oriented language, and is available on a wide variety of platforms. •A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py appended. •Python support two basic modes: Normal mode and interactive mode • Normal mode: The normal mode is the mode where the scripted and finished.py files are run in the Python interpreter. This mode is also called as Script Mode. • Interactive mode is a command line shell which gives immediate feedback for each statement, while running previously fed statements in active memory.
  • 10.
    12/09/2024 22UIT303-Data science10 Features of Python Programming 1. Python is a high-level, interpreted, interactive and object-oriented scripting language. 2. It is simple and easy to learn. 3. It is portable. 4. Python is free and open source programming language. 5. Python can perform complex tasks using a few lines of code. 6. Python can run equally on different platforms such as Windows, Linux, UNIX, and Macintosh, etc 7. It provides a vast range of libraries for the various fields such as machine learning, web developer, and also for the scripting.
  • 11.
    12/09/2024 22UIT303-Data science11 Advantages and Disadvantages of Python Advantages of Python • Ease of programming • Minimizes the time to develop and maintain code •Modular and object-oriented •Large community of users • A large standard and user-contributed library Disadvantages of python • Interpreted and therefore slower than compiled languages • Decentralized with packages
  • 12.
  • 13.
    12/09/2024 22UIT303-Data science13 Aggregations •In aggregation function is one which takes multiple individual values and returns a summary. •In the majority of the cases, this summary is a single value. •The most common aggregation functions are a simple average or summation of values.
  • 14.
    12/09/2024 22UIT303-Data science14 Example >>> import numpy as np >>> arr1 = np.array([10, 20, 30, 40, 50]) >>> arr1 array([10, 20, 30, 40, 50]) >>> arr2 = np.array([[0, 10, 20], [30, 40, 50], [60, 70, 80]]) >>> arr2 array([[0, 10, 20] [30, 40, 50] [60, 70, 80]]) >>> arr3 = np.array([[14, 6, 9, -12, 19, 72], [-9, 8, 22, 0, 99, -11]]) >>> array3 array([[14, 6, 9, -12, 19, 72]) [-9, 8, 22, 0, 99, -11]])
  • 15.
    12/09/2024 22UIT303-Data science15 Python numpy sum function calculates the sum of values in an array. arr1.sum() arr2.sum() arr3.sum() >>> arr1.sum() 150 >>> arr2.sum() 360 >>> arr3.sum() 217
  • 16.
    12/09/2024 22UIT303-Data science16 • Python has built-in min and max functions used to find the minimum value and maximum value of any given array. • Python min() and max() are built-in functions in python which returns the smallest number and the largest number of the list respectively, as the output. Python min() can also be used to find the smaller one in the comparison of two variables or lists. However, Python max() on the other hand is used to find the bigger one in the comparison of two variables or lists.
  • 17.
    12/09/2024 22UIT303-Data science17 Computations on Arrays
  • 18.
    12/09/2024 22UIT303-Data science18 Computations on Arrays • Computation on NumPy arrays can be very fast, or it can be very slow. •Using vectorized operations, fast computations is possible and it is implemented by using NumPy's universial functions (ufuncs). • A Universal Function (Ufuncs) is a function that operates on nd arrays in an element-by- element fashion, supporting array broadcasting, type casting, and several other standard features. •The ufunc is a "vectorized" wrapper for a function that takes a fixed number of specific inputs and produces a fixed number of specific outputs.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.