Feature Engineering - Introduction

Introduction to feature engineering

Uploaded by

jintugial

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views74 pages

Feature Engineering - Introduction

Introduction to feature engineering

Uploaded by

jintugial

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Feature Engineering

(Professional Elective 1- Artificial Intelligence & Machine Learning

Specialization)
Artificial Intelligence - AI
• Intelligence: The capacity to learn and solve problems.
• Artificial Intelligence (AI): Simulation of human intelligence by
machines.
• Computers with the ability to mimic or duplicate the functions of
human brain
AI
• Artificial Intelligence is basically the mechanism to incorporate human intelligence into machines
through a set of rules(algorithm)
• AI is basically the study of training your machine(computers) to mimic a human brain and its
thinking capabilities".
Machine Learning
• Study/process which provides the system(computer) to learn automatically on its own through
experiences it had and improve accordingly without being explicitly programmed
• ML is an application or subset of AI
Eg:
• Recommendation system of Netflix
• People you may know of social media platforms
Deep Learning
• Subset of machine learning that uses multilayered neural
networks, called deep neural networks, to simulate the complex
decision-making power of the human brain.
• Eg:
• can detect suspicious attempts to log into your accounts and notify you
• Natural language processing - Chatbot, language translator
Data Science
• Data science is the study of data that helps us
derive useful insight for business decision
making.
• Combines math, computer science, and
domain expertise to tackle real-world
challenges in a variety of fields.
Feature Engineering
Definition
• The process of creating new features or
transforming existing features to improve the
performance of a machine-learning model
• The process of transforming raw data into a
more effective set of inputs for machine learning
models
Feature Engineering
Data
•Raw Facts and figures
•In the world of machine learning, data is the
foundation upon which models are built
•Its potential is hidden in the relationships
between features, the patterns within, and
the transformations that can unlock critical
insights – Feature Engineering
Contd…
• Feature engineering involves creating new input variables from the
existing data to improve the performance of machine learning
algorithms
• Consider a housing dataset with features like size, number of rooms,
and price. Feature engineering could involve creating a new feature
such as price_per_sqft, which may have a stronger correlation with
the target variable than price alone.
Feature
• Also known as variable or attribute
• Individual measurable property or characteristic of a data point that is
used as input for a machine learning algorithm
• Eg:
• a dataset of housing prices, features could include the number of bedrooms,
the square footage, the location, and the age of the property
Contd..
• The choice and quality of features are critical in machine learning, as
they can greatly impact the accuracy and performance of the model.
Sources of Data
• Primary Data Source
• Surveys and Questionnaires
• Focus Groups
• Interviews
• Observations
• Experiments
• Secondary Data Source
• Government records
• Research Publications
• Industry Reports
• Academic Databases
• Transactional Data
• Point-of-Sale (POS) Systems
• Banking and Finance Records
• Web and Social Media Data
• Website Analytics
• Social Media Platforms
• Web Scraping
• Sensors and IoT Devices
• Smart Devices
• Things in IoT
• Open Data Sources
• Government Open Data
• NGOs and Non-Profitable Organizations
• Business and Organizational Data
• Customer Relationship Management (CRM) Systems
• Enterprise Resource Planning (ERP) Systems
• Big Data Sources
• Streaming Data
• Clickstream Data
Unit 3
Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and
Seaborn
Exploratory Data Analysis (EDA)
• An essential step where data scientists investigate datasets to
understand their structure, identify patterns, and uncover insights
• EDA helps to answer
• What type of data do we have? Are we working with numbers, text,
or dates?
• Are there outliers? These are unusual values that are very different
from the rest.
• Is anything missing? Are some parts of the dataset empty or
incomplete?
Packages for EDA
• NumPy
• Pandas
• Seaborn
• Matplotlib.
NumPy
• Numerical Python
• Used for working with arrays.
• To install
• pip install numpy
• To import
• import numpy
• Usually imported as np alias
• import numpy as np
Applications - numpy
• Data Analysis: NumPy is used extensively in data analysis to perform
operations like mean, median, and standard deviation calculations.
• Machine Learning: Many machine learning libraries, like scikit-learn,
utilize NumPy arrays to process and analyze data.
• Image Processing: NumPy aids in manipulating and processing
images, making it valuable in computer vision tasks.
• Scientific Research: Scientists and researchers use NumPy for
simulations and scientific computing.
import numpy
arr = numpy.array([1, 2, 3, 4, 5])
print(arr)

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Create array-numpy
• The array object in NumPy is called ndarray
• We can create ndarray using array()
Array Indexing
3 X 3 Matrix Breakdown
Data Manipulation and Preprocessing
• Indexing and Slicing: NumPy arrays can be easily indexed and sliced
to extract specific elements, rows, columns, or sub-arrays, allowing
for targeted data access and manipulation.
• Reshaping: Arrays can be reshaped to different dimensions without
altering the data, which is essential for preparing data for machine
learning models or for specific analytical needs.
• Filtering (Boolean Indexing): Data can be filtered based on conditions
using boolean arrays, allowing for the selection of data points that
meet specific criteria.
• Handling Missing Values: Used to handle missing or unidentified data
Importing Data
Loading Text File - numpy
import numpy as np
filepath='D:/Python/sample.txt'
data=np.loadtxt(filepath)
print(data)
[1. 2. 3.]
Arrays - numpy
Copying, Sorting, Reshaping
1 Create a null vector of size 10
Z = np.zeros(10)
print(Z)
2. Create a vector with values ranging from 10 to 49
Z = np.arange(10,50)
print(Z)
3. Create a vector with values ranging from 10 to 49
Z = np.arange(10,50)
print(Z)
4. Reverse a vector (first element becomes last)
Z = np.arange(50)
Print(Z)
Z = Z[::-1]
print(Z)
5. Extract all odd numbers from arr
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
odd=arr[arr % 2 == 1]
print(odd)
6. Replace all odd numbers in arr with -1
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[arr % 2 == 1] = -1
print(arr)
7. Convert a 1D array to a 2D array with 2 rows
arr = np.arange(10)
print(arr)
array2=arr.reshape(2, -1) # Setting to -1 automatically decides the number
of cols

print(array2)
Slicing – numpy
• A method for extracting a portion of a NumPy array
• The fundamental syntax for slicing in NumPy is array[start:stop:step].
• start: The inclusive starting index of the slice. If omitted, it defaults to
the beginning of the dimension.
• stop: The exclusive ending index of the slice. If omitted, it defaults to
the end of the dimension.
• step: The interval between elements selected in the slice. If omitted,
it defaults to 1. A negative step reverses the order of elements.
Pandas
Uses
• Import datasets from databases, spreadsheets, comma-separated
values (CSV) files, and more.
• Clean datasets, for example, by dealing with missing values.
• Tidy datasets by reshaping their structure into a suitable format for
analysis.
• Aggregate data by calculating summary statistics such as the mean of
columns, correlation between them, and more.
• Visualize datasets and uncover insights.
• To install
• pip install pandas
• To import
• import pandas as pd
Panda - Series
import pandas as pd
data_list = [10, 20, 30, 40, 50]
my_series = pd.Series(data_list)
print(my_series)
0 10
1 20
2 30
3 40
4 50
dtype: int64
Panda-Series(numpy)
import pandas as pd
import numpy as np
data_array = np.array(['a', 'b', 'c', 'd'])
my_series = pd.Series(data_array)
print(my_series)
0 a
1 b
2 c
3 d
dtype: object
Panda-Series(Custom Index)
import pandas as pd
data = [1, 3, 5]
custom_index = ['x', 'y', 'z']
my_series =pd.Series(data,index=custom_index) print(my_series)
x 1
y 3
z 5
dtype: int64
Pandas Dtaframe
• Two-dimensional: Data is organized in rows and columns.
• Labeled Axes: Both rows and columns have labels (indices for rows,
and column names for columns), allowing for easy access and
manipulation of specific data points.
• Potentially Heterogeneous Data Types: Columns can contain
different data types (e.g., integers, floats, strings, booleans), while
within a single column, the data type is typically uniform.
• Mutable: Data within a DataFrame can be modified after creation.
• Primary Pandas Data Structure: It is the most commonly used object
in pandas for data analysis and manipulation tasks.
Pandas
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame
object:
df = pd.DataFrame(data)
print(df)
Pandas – Locate Row
Pandas – Named Index
CSV Files
CSV Files
• Comma Separated Values
• A plain text file format used to store tabular data, where each row
represents a record, and each column is separated by a comma
How To create a csv File
• You can make a CSV in a text editor like Notepad. Separate each data
field with a comma (no spaces), put each row on a separate line, and
save the file as a ".csv".
• To save a spreadsheet as a CSV in Excel, go to File > Save as and select
CSV as the file type.
• In Google Sheets, go to File > Download, then select Comma
Separated Values (.csv).
Creating csv file - pandas
• import pandas
• Prepare Your Data into a suitable structure for a DataFrame
• Create a Pandas DataFrame
• Save the DataFrame to CSV: Use the to_csv() method of the
DataFrame, providing the desired file name and path
df.to_csv('my_data.csv', index=False)
• 'my_data.csv' is the name of the CSV file to be created.
• index=False is typically used to prevent Pandas from writing the DataFrame index as a
column in the CSV file, which is usually not desired.
Opening csv files - pandas

df = pd.read_csv('your_file_name.csv’)
• Viewing the first few rows: print(df.head())
• Getting information about the DataFrame:
print(df.info())
• Accessing specific columns:
print(df['column_name'])
import pandas as pd

# 1. Prepare your data

data = {
'Product': ['Laptop', 'Mouse', 'Keyboard'],
'Price': [1200, 25, 75],
'Quantity': [10, 50, 30]
}

# 2. Create a Pandas DataFrame

df = pd.DataFrame(data)

# 3. Save the DataFrame to CSV

df.to_csv('D:/Python/products.csv', index=False)

print("CSV file 'products.csv' has been created successfully.")

Open csv Files
Handling Missing Values
• Missing data is defined as the values or data that is not stored (or not
present) for some variable/s in the given dataset
How is a Missing Value Represented in a
Dataset?
• NaN (Not a Number): In many programming languages and data
analysis tools, missing values are represented as NaN. This is the
default for libraries like Pandas in Python.
• NULL or None: In databases and some programming languages,
missing values are often represented as NULL or None. For instance,
in SQL databases, a missing value is typically recorded as NULL.
• Empty Strings: Sometimes, missing values are denoted by empty
strings (""). This is common in text-based data or CSV files where a
field might be left blank.
How is a Missing Value Represented in a
Dataset?
• Special Indicators: Datasets might use specific indicators like -999,
9999, or other unlikely values to signify missing data. This is often
seen in older datasets or specific industries where such conventions
were established.
• Blanks or Spaces: In some cases, particularly in fixed-width text files,
missing values might be represented by spaces or blank fields.
Why is Data Missing From the Dataset?
• Past data might get corrupted due to improper maintenance.

• Observations are not recorded for certain fields due to some reasons.
There might be a failure in recording the values due to human error.

• The user has not provided the values intentionally

• Item nonresponse: This means the participant refused to respond.

Types of Missing Values
MCAR – Missing Completely At Random
• The missing data happens totally by chance—no pattern at all.
• Example: a temperature sensor fails randomly, dropping readings
unpredictably
• Example: A survey respondent accidentally skips a question because
their phone battery died mid-survey.
• Safe to drop or simple fill.
MAR – Missing At Random
• The missingness is linked to something else you already know—but
not to the missing value itself.
• Example: Missing doctor visit of lower income patients

• Income Predict Frequency of visits

• Use observed data (like age) to impute

MNAR – Missing Not At Random
• The reason for the missingness is directly related to the value of the
missing data itself
• Example: Rich people hide income—missingness tied to actual value
• Requires Complex modeling or external data
Why Do We Need to Care About Handling
Missing Data?
• Many machine learning algorithms fail if the dataset contains missing
values. However, algorithms like K-nearest and Naive Bayes support
data with missing values.

• You may end up building a biased machine learning model, leading to

incorrect results if the missing values are not handled properly.

• Missing data can lead to a lack of precision in the statistical analysis.

Handling Missing Value
• Deleting the missing value
• Deleting Rows from DataFrames
1. Removing Rows by Index
2. Removing Rows by Condition
3. Removing Duplicate Rows
4. Removing Rows with Missing Values
5. Removing Rows by Index Range
Handling Missing Value
• Deleting Columns from DataFrame
1. Removing Columns by using Label
2. Removing Columns by using Index
3. Removing Range of Columns by using Indices
4. Removing column using pop() Method
• Flagging
• Imputation

Areer: A Warm Welcome To Careerera Family
No ratings yet
Areer: A Warm Welcome To Careerera Family
131 pages
Numpy&pandas
No ratings yet
Numpy&pandas
17 pages
NumPy and Pandas: Essential Python Libraries
No ratings yet
NumPy and Pandas: Essential Python Libraries
72 pages
CRAI AI BOOTCAMP Week Two 2025
No ratings yet
CRAI AI BOOTCAMP Week Two 2025
29 pages
Ty B Tech - Bda - Ai315 - Lab Manual
No ratings yet
Ty B Tech - Bda - Ai315 - Lab Manual
52 pages
De&v Lab Manual
No ratings yet
De&v Lab Manual
91 pages
Advanced Python & Data Science Guide
No ratings yet
Advanced Python & Data Science Guide
42 pages
Fds Lab Manual
No ratings yet
Fds Lab Manual
31 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
Lab 2 DWM
No ratings yet
Lab 2 DWM
13 pages
Python Data Analysis Guide
No ratings yet
Python Data Analysis Guide
75 pages
RAW Data
No ratings yet
RAW Data
22 pages
Chapter 3 Numpy Data Analysis
No ratings yet
Chapter 3 Numpy Data Analysis
21 pages
Unit - Iii
No ratings yet
Unit - Iii
79 pages
Ex 1
No ratings yet
Ex 1
6 pages
FDS Unit 4
No ratings yet
FDS Unit 4
66 pages
Lab-3 AI
No ratings yet
Lab-3 AI
21 pages
Python Module 5
No ratings yet
Python Module 5
43 pages
Numpy Data Analysis and Visualisation With Python
No ratings yet
Numpy Data Analysis and Visualisation With Python
75 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
63 pages
21CSS303T - UNIT-1 - Lecture - 1
No ratings yet
21CSS303T - UNIT-1 - Lecture - 1
90 pages
Module 1.foundations of Data Science
No ratings yet
Module 1.foundations of Data Science
17 pages
PyDays Day-2 - Final
No ratings yet
PyDays Day-2 - Final
26 pages
Numpy and Pandas
No ratings yet
Numpy and Pandas
28 pages
Numpy in Python
No ratings yet
Numpy in Python
34 pages
Essential Python Libraries
100% (1)
Essential Python Libraries
41 pages
Unit 5
No ratings yet
Unit 5
40 pages
Lab 2, Python Numpy - LUMS
No ratings yet
Lab 2, Python Numpy - LUMS
4 pages
NumPy Notes
No ratings yet
NumPy Notes
30 pages
What Is Numpy?: Aim: Study Python Libraries: Numpy, Pandas, Matplotlib, Scikitlearn With Student Dataset
No ratings yet
What Is Numpy?: Aim: Study Python Libraries: Numpy, Pandas, Matplotlib, Scikitlearn With Student Dataset
18 pages
Numpy & Pandas
No ratings yet
Numpy & Pandas
13 pages
Numpy Basics Introduction To
No ratings yet
Numpy Basics Introduction To
35 pages
W03 - FA23 - AIC270 - Programming For AI - Syed Ahmed
No ratings yet
W03 - FA23 - AIC270 - Programming For AI - Syed Ahmed
57 pages
PP&DS Unit Iii
No ratings yet
PP&DS Unit Iii
26 pages
Numpy Library in Python
No ratings yet
Numpy Library in Python
16 pages
3 - Pandas
No ratings yet
3 - Pandas
87 pages
MTE204 Data Python
No ratings yet
MTE204 Data Python
45 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
62 pages
More On Numpy
No ratings yet
More On Numpy
50 pages
Python NumPy for Beginners
100% (1)
Python NumPy for Beginners
84 pages
Big Data Analytics Lab Guide
No ratings yet
Big Data Analytics Lab Guide
57 pages
Python Libraries
No ratings yet
Python Libraries
79 pages
Data Visualization1
No ratings yet
Data Visualization1
52 pages
Comprehensive NumPy Guide for Python
No ratings yet
Comprehensive NumPy Guide for Python
30 pages
Nptel Presentation
No ratings yet
Nptel Presentation
24 pages
NumPy Basics: Arrays and Operations
No ratings yet
NumPy Basics: Arrays and Operations
49 pages
New Chat
No ratings yet
New Chat
30 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
36 pages
Python 5 Unit
No ratings yet
Python 5 Unit
74 pages
NumPy Essentials for Data Scientists
100% (1)
NumPy Essentials for Data Scientists
27 pages
Chapter 20
No ratings yet
Chapter 20
19 pages
Ch2 Numpy Pandas
No ratings yet
Ch2 Numpy Pandas
87 pages
NUMPY
No ratings yet
NUMPY
33 pages
Python For AIML1
No ratings yet
Python For AIML1
15 pages
Python NumPy Cheat Sheet
No ratings yet
Python NumPy Cheat Sheet
1 page
Numpy, Pandas and Matplotlib
No ratings yet
Numpy, Pandas and Matplotlib
60 pages
Moisture Content Determination
No ratings yet
Moisture Content Determination
5 pages
Bcan 201 (New) Dca 201
No ratings yet
Bcan 201 (New) Dca 201
2 pages
Addis Ababa University Department of Law Fresh Man Course
No ratings yet
Addis Ababa University Department of Law Fresh Man Course
12 pages
Olympic Data Analytics Project
No ratings yet
Olympic Data Analytics Project
51 pages
PDC - Vortex - Xceed - Kuwait - Cs - ROP DATA PDF
No ratings yet
PDC - Vortex - Xceed - Kuwait - Cs - ROP DATA PDF
2 pages
Kurmanji Basic Learning Manual
No ratings yet
Kurmanji Basic Learning Manual
32 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
28 pages
High-Rise Building Shear Walls
100% (4)
High-Rise Building Shear Walls
119 pages
Roy M Broad: Networking Performance: A Study of The Benefits of Business Networking in The West Midlands
No ratings yet
Roy M Broad: Networking Performance: A Study of The Benefits of Business Networking in The West Midlands
366 pages
Challenging 2A04 Sol e
No ratings yet
Challenging 2A04 Sol e
5 pages
AP 550 Asphalt Paver Sell Sheet MSS-1172-02-EN
No ratings yet
AP 550 Asphalt Paver Sell Sheet MSS-1172-02-EN
2 pages
Zayat - Wireless Infra Structure & DDF
No ratings yet
Zayat - Wireless Infra Structure & DDF
18 pages
UTS (Philosophical Perspective)
No ratings yet
UTS (Philosophical Perspective)
68 pages
12v Battery Charger Circuit With Auto Cut Off - Circuits Gallery
0% (1)
12v Battery Charger Circuit With Auto Cut Off - Circuits Gallery
40 pages
2013 Sagar J Abichandani
No ratings yet
2013 Sagar J Abichandani
8 pages
Valence Electrons Shown: Unit 2 - Bonding
No ratings yet
Valence Electrons Shown: Unit 2 - Bonding
5 pages
TCP Flow Control and Error Control
No ratings yet
TCP Flow Control and Error Control
19 pages
NEET/JEE Chemistry Formula Guide
100% (1)
NEET/JEE Chemistry Formula Guide
18 pages
Sand and Gravel For Se As Filtration Medium - Specification: Indian Standard
No ratings yet
Sand and Gravel For Se As Filtration Medium - Specification: Indian Standard
12 pages
Arnold 1998
No ratings yet
Arnold 1998
17 pages
AC Compressor Manual
No ratings yet
AC Compressor Manual
48 pages
Slope Stability Analysis Using FEM
No ratings yet
Slope Stability Analysis Using FEM
5 pages
Remotesensing 15 05162
No ratings yet
Remotesensing 15 05162
14 pages
Grove 1997 VIII On The Gas Voltaic Battery Experiments Made With A View of Ascertaining The Rationale of Its Action and
No ratings yet
Grove 1997 VIII On The Gas Voltaic Battery Experiments Made With A View of Ascertaining The Rationale of Its Action and
23 pages
Math 9 LM Draft 3.24.2014
No ratings yet
Math 9 LM Draft 3.24.2014
241 pages
Ambarella CV2S66 Preliminary Datasheet
No ratings yet
Ambarella CV2S66 Preliminary Datasheet
88 pages
Complex Numbers For High School
No ratings yet
Complex Numbers For High School
60 pages
1k Resistor Datasheet SMD
No ratings yet
1k Resistor Datasheet SMD
8 pages
Power Series Solutions - Complete
No ratings yet
Power Series Solutions - Complete
65 pages
SICK Area Scanner Manual 2
No ratings yet
SICK Area Scanner Manual 2
12 pages