UNIT-IV-II IT-Python Libraries for Data Wrangling

12/09/2024 22UIT303-Data science 1
UNIT-IV
PYTHON LIBRARIES FOR DATA WRANGLING

Syllabus-UNIT-IV
Basics of Numpy arrays –Aggregations –Computations on arrays –
Comparisons, Masks, Boolean Logic – Fancy Indexing – Structured
Arrays – Data manipulation with Pandas – Data Indexing and Selection
– Operating on Data – Missing Data – Hierarchical indexing.

Basics of Numpy arrays

Data Wrangling
• Data Wrangling is the process of transforming data from its original "raw" form into a
more digestible format and organizing sets from various sources into a singular coherent
whole for further processing.
• Data wrangling is also called as Data Munging.
• The primary purpose of data wrangling can be described as getting data in coherent shape.
In other words, it is making raw data usable. It provides substance for further proceedings.

Data wrangling covers the following processes:
1. Getting data from the various source into one place
2. Piecing the data together according to the determined setting
3. Cleaning the data from the noise or erroneous, missing elements.
Data wrangling is the process of cleaning, structuring and enriching raw data into a
desired format for better decision making in less time.

• There are typically six iterative steps that make up the data wrangling process:
1. Discovering: Before you can dive deeply, you must better understand what is in your data,
which will inform how you want to analyze it. How you wrangle customer data, for example, may
be informed by where they are located, what they bought, or what promotions they received.
2. Structuring: This means organizing the data, which is necessary because raw data comes in
many different shapes and sizes. A single column may turn into several rows for easier analysis.
One column may become two. Movement of data is made for easier computation and analysis.

3. Cleaning: What happens when errors and outliers skew your data? You clean the data. What happens when state
data is entered as AP or Andhra Pradesh or Arunachal Pradesh? You clean the data. Null values are changed and
standard formatting implemented, ultimately increasing data quality.
4. Enriching: Here you take stock in your data and strategize about how other additional data might augment it.
Questions asked during this data wrangling step might be : what new types of data can I derive from what I already
have or what other information would better inform my decision making about this current data?
5. Validating: Validation rules are repetitive programming sequences that verify data consistency, quality, and security.
Examples of validation include ensuring uniform distribution of attributes that should be distributed normally (e.g.
birth dates) or confirming accuracy of fields through a check across data.
6. Publishing: Analysts prepare the wrangled data for use downstream, whether by a particular user or software and
document any particular steps taken or logic used to wrangle said data. Data wrangling gurus understand that
implementation of insights relies upon the ease with which it can be accessed and utilized by others

Introduction to Python

Introduction to Python
• Python is a high-level scripting language which can be used for a wide variety of text processing, system
administration and internet-related tasks.
• Python is a true object-oriented language, and is available on a wide variety of platforms.
•A module is a file containing Python definitions and statements. The file name is the module name with the
suffix .py appended.
•Python support two basic modes: Normal mode and interactive mode
• Normal mode: The normal mode is the mode where the scripted and finished.py files are run in the Python
interpreter. This mode is also called as Script Mode.
• Interactive mode is a command line shell which gives immediate feedback for each statement, while running
previously fed statements in active memory.

Features of Python Programming
1. Python is a high-level, interpreted, interactive and object-oriented scripting language.
2. It is simple and easy to learn.
3. It is portable.
4. Python is free and open source programming language.
5. Python can perform complex tasks using a few lines of code.
6. Python can run equally on different platforms such as Windows, Linux, UNIX, and Macintosh,
etc
7. It provides a vast range of libraries for the various fields such as machine learning, web
developer, and also for the scripting.

Advantages and Disadvantages of Python
Advantages of Python
• Ease of programming
• Minimizes the time to develop and maintain code
•Modular and object-oriented
•Large community of users
• A large standard and user-contributed library
Disadvantages of python
• Interpreted and therefore slower than compiled languages
• Decentralized with packages

Aggregations

Aggregations
•In aggregation function is one which takes multiple individual values and returns a
summary.
•In the majority of the cases, this summary is a single value.
•The most common aggregation functions are a simple average or summation of values.

Example
>>> import numpy as np
>>> arr1 = np.array([10, 20, 30, 40, 50])
>>> arr1
array([10, 20, 30, 40, 50])
>>> arr2 = np.array([[0, 10, 20], [30, 40, 50], [60, 70, 80]])
>>> arr2
array([[0, 10, 20]
[30, 40, 50]
[60, 70, 80]])
>>> arr3 = np.array([[14, 6, 9, -12, 19, 72], [-9, 8, 22, 0, 99, -11]])
>>> array3
array([[14, 6, 9, -12, 19, 72])
[-9, 8, 22, 0, 99, -11]])

Python numpy sum function calculates the sum of values in an array.
arr1.sum()
arr2.sum()
arr3.sum()
>>> arr1.sum()
150
>>> arr2.sum()
360
>>> arr3.sum()
217

• Python has built-in min and max functions used to find the minimum value and maximum
value of any given array.
• Python min() and max() are built-in functions in python which returns the smallest number
and the largest number of the list respectively, as the output.
Python min() can also be used to find the smaller one in the comparison of two variables or
lists.
However, Python max() on the other hand is used to find the bigger one in the comparison
of two variables or lists.

Computations on Arrays

Computations on Arrays
• Computation on NumPy arrays can be very fast, or it can be very slow.
•Using vectorized operations, fast computations is possible and it is implemented by using
NumPy's universial functions (ufuncs).
• A Universal Function (Ufuncs) is a function that operates on nd arrays in an element-by-
element fashion, supporting array broadcasting, type casting, and several other standard
features.
•The ufunc is a "vectorized" wrapper for a function that takes a fixed number of specific
inputs and produces a fixed number of specific outputs.

UNIT-IV-II IT-Python Libraries for Data Wrangling

More Related Content

Similar to UNIT-IV-II IT-Python Libraries for Data Wrangling

Recently uploaded

UNIT-IV-II IT-Python Libraries for Data Wrangling