Feature Engineering
(Professional Elective 1- Artificial Intelligence & Machine Learning
Specialization)
Artificial Intelligence - AI
• Intelligence: The capacity to learn and solve problems.
• Artificial Intelligence (AI): Simulation of human intelligence by
machines.
• Computers with the ability to mimic or duplicate the functions of
human brain
AI
• Artificial Intelligence is basically the mechanism to incorporate human intelligence into machines
through a set of rules(algorithm)
• AI is basically the study of training your machine(computers) to mimic a human brain and its
thinking capabilities".
Machine Learning
• Study/process which provides the system(computer) to learn automatically on its own through
experiences it had and improve accordingly without being explicitly programmed
• ML is an application or subset of AI
Eg:
• Recommendation system of Netflix
• People you may know of social media platforms
Deep Learning
• Subset of machine learning that uses multilayered neural
networks, called deep neural networks, to simulate the complex
decision-making power of the human brain.
• Eg:
• can detect suspicious attempts to log into your accounts and notify you
• Natural language processing - Chatbot, language translator
Data Science
• Data science is the study of data that helps us
derive useful insight for business decision
making.
• Combines math, computer science, and
domain expertise to tackle real-world
challenges in a variety of fields.
Feature Engineering
Definition
• The process of creating new features or
transforming existing features to improve the
performance of a machine-learning model
• The process of transforming raw data into a
more effective set of inputs for machine learning
models
Feature Engineering
Data
•Raw Facts and figures
•In the world of machine learning, data is the
foundation upon which models are built
•Its potential is hidden in the relationships
between features, the patterns within, and
the transformations that can unlock critical
insights – Feature Engineering
Contd…
• Feature engineering involves creating new input variables from the
existing data to improve the performance of machine learning
algorithms
• Consider a housing dataset with features like size, number of rooms,
and price. Feature engineering could involve creating a new feature
such as price_per_sqft, which may have a stronger correlation with
the target variable than price alone.
Feature
• Also known as variable or attribute
• Individual measurable property or characteristic of a data point that is
used as input for a machine learning algorithm
• Eg:
• a dataset of housing prices, features could include the number of bedrooms,
the square footage, the location, and the age of the property
Contd..
• The choice and quality of features are critical in machine learning, as
they can greatly impact the accuracy and performance of the model.
Sources of Data
• Primary Data Source
• Surveys and Questionnaires
• Focus Groups
• Interviews
• Observations
• Experiments
• Secondary Data Source
• Government records
• Research Publications
• Industry Reports
• Academic Databases
• Transactional Data
• Point-of-Sale (POS) Systems
• Banking and Finance Records
• Web and Social Media Data
• Website Analytics
• Social Media Platforms
• Web Scraping
• Sensors and IoT Devices
• Smart Devices
• Things in IoT
• Open Data Sources
• Government Open Data
• NGOs and Non-Profitable Organizations
• Business and Organizational Data
• Customer Relationship Management (CRM) Systems
• Enterprise Resource Planning (ERP) Systems
• Big Data Sources
• Streaming Data
• Clickstream Data
Unit 3
Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and
Seaborn
Exploratory Data Analysis (EDA)
• An essential step where data scientists investigate datasets to
understand their structure, identify patterns, and uncover insights
• EDA helps to answer
• What type of data do we have? Are we working with numbers, text,
or dates?
• Are there outliers? These are unusual values that are very different
from the rest.
• Is anything missing? Are some parts of the dataset empty or
incomplete?
Packages for EDA
• NumPy
• Pandas
• Seaborn
• Matplotlib.
NumPy
• Numerical Python
• Used for working with arrays.
• To install
• pip install numpy
• To import
• import numpy
• Usually imported as np alias
• import numpy as np
Applications - numpy
• Data Analysis: NumPy is used extensively in data analysis to perform
operations like mean, median, and standard deviation calculations.
• Machine Learning: Many machine learning libraries, like scikit-learn,
utilize NumPy arrays to process and analyze data.
• Image Processing: NumPy aids in manipulating and processing
images, making it valuable in computer vision tasks.
• Scientific Research: Scientists and researchers use NumPy for
simulations and scientific computing.
import numpy
arr = numpy.array([1, 2, 3, 4, 5])
print(arr)
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Create array-numpy
• The array object in NumPy is called ndarray
• We can create ndarray using array()
Array Indexing
3 X 3 Matrix Breakdown
Data Manipulation and Preprocessing
• Indexing and Slicing: NumPy arrays can be easily indexed and sliced
to extract specific elements, rows, columns, or sub-arrays, allowing
for targeted data access and manipulation.
• Reshaping: Arrays can be reshaped to different dimensions without
altering the data, which is essential for preparing data for machine
learning models or for specific analytical needs.
• Filtering (Boolean Indexing): Data can be filtered based on conditions
using boolean arrays, allowing for the selection of data points that
meet specific criteria.
• Handling Missing Values: Used to handle missing or unidentified data
Importing Data
Loading Text File - numpy
import numpy as np
filepath='D:/Python/sample.txt'
data=np.loadtxt(filepath)
print(data)
[1. 2. 3.]
Arrays - numpy
Copying, Sorting, Reshaping
1 Create a null vector of size 10
Z = np.zeros(10)
print(Z)
2. Create a vector with values ranging from 10 to 49
Z = np.arange(10,50)
print(Z)
3. Create a vector with values ranging from 10 to 49
Z = np.arange(10,50)
print(Z)
4. Reverse a vector (first element becomes last)
Z = np.arange(50)
Print(Z)
Z = Z[::-1]
print(Z)
5. Extract all odd numbers from arr
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
odd=arr[arr % 2 == 1]
print(odd)
6. Replace all odd numbers in arr with -1
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[arr % 2 == 1] = -1
print(arr)
7. Convert a 1D array to a 2D array with 2 rows
arr = np.arange(10)
print(arr)
array2=arr.reshape(2, -1) # Setting to -1 automatically decides the number
of cols
print(array2)
Slicing – numpy
• A method for extracting a portion of a NumPy array
• The fundamental syntax for slicing in NumPy is array[start:stop:step].
• start: The inclusive starting index of the slice. If omitted, it defaults to
the beginning of the dimension.
• stop: The exclusive ending index of the slice. If omitted, it defaults to
the end of the dimension.
• step: The interval between elements selected in the slice. If omitted,
it defaults to 1. A negative step reverses the order of elements.
Pandas
Uses
• Import datasets from databases, spreadsheets, comma-separated
values (CSV) files, and more.
• Clean datasets, for example, by dealing with missing values.
• Tidy datasets by reshaping their structure into a suitable format for
analysis.
• Aggregate data by calculating summary statistics such as the mean of
columns, correlation between them, and more.
• Visualize datasets and uncover insights.
• To install
• pip install pandas
• To import
• import pandas as pd
Panda - Series
import pandas as pd
data_list = [10, 20, 30, 40, 50]
my_series = pd.Series(data_list)
print(my_series)
0 10
1 20
2 30
3 40
4 50
dtype: int64
Panda-Series(numpy)
import pandas as pd
import numpy as np
data_array = np.array(['a', 'b', 'c', 'd'])
my_series = pd.Series(data_array)
print(my_series)
0 a
1 b
2 c
3 d
dtype: object
Panda-Series(Custom Index)
import pandas as pd
data = [1, 3, 5]
custom_index = ['x', 'y', 'z']
my_series =pd.Series(data,index=custom_index) print(my_series)
x 1
y 3
z 5
dtype: int64
Pandas Dtaframe
• Two-dimensional: Data is organized in rows and columns.
• Labeled Axes: Both rows and columns have labels (indices for rows,
and column names for columns), allowing for easy access and
manipulation of specific data points.
• Potentially Heterogeneous Data Types: Columns can contain
different data types (e.g., integers, floats, strings, booleans), while
within a single column, the data type is typically uniform.
• Mutable: Data within a DataFrame can be modified after creation.
• Primary Pandas Data Structure: It is the most commonly used object
in pandas for data analysis and manipulation tasks.
Pandas
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame
object:
df = pd.DataFrame(data)
print(df)
Pandas – Locate Row
Pandas – Named Index
CSV Files
CSV Files
• Comma Separated Values
• A plain text file format used to store tabular data, where each row
represents a record, and each column is separated by a comma
How To create a csv File
• You can make a CSV in a text editor like Notepad. Separate each data
field with a comma (no spaces), put each row on a separate line, and
save the file as a ".csv".
• To save a spreadsheet as a CSV in Excel, go to File > Save as and select
CSV as the file type.
• In Google Sheets, go to File > Download, then select Comma
Separated Values (.csv).
Creating csv file - pandas
• import pandas
• Prepare Your Data into a suitable structure for a DataFrame
• Create a Pandas DataFrame
• Save the DataFrame to CSV: Use the to_csv() method of the
DataFrame, providing the desired file name and path
df.to_csv('my_data.csv', index=False)
• 'my_data.csv' is the name of the CSV file to be created.
• index=False is typically used to prevent Pandas from writing the DataFrame index as a
column in the CSV file, which is usually not desired.
Opening csv files - pandas
df = pd.read_csv('your_file_name.csv’)
• Viewing the first few rows: print(df.head())
• Getting information about the DataFrame:
print(df.info())
• Accessing specific columns:
print(df['column_name'])
import pandas as pd
# 1. Prepare your data
data = {
'Product': ['Laptop', 'Mouse', 'Keyboard'],
'Price': [1200, 25, 75],
'Quantity': [10, 50, 30]
}
# 2. Create a Pandas DataFrame
df = pd.DataFrame(data)
# 3. Save the DataFrame to CSV
df.to_csv('D:/Python/products.csv', index=False)
print("CSV file 'products.csv' has been created successfully.")
Open csv Files
Handling Missing Values
• Missing data is defined as the values or data that is not stored (or not
present) for some variable/s in the given dataset
How is a Missing Value Represented in a
Dataset?
• NaN (Not a Number): In many programming languages and data
analysis tools, missing values are represented as NaN. This is the
default for libraries like Pandas in Python.
• NULL or None: In databases and some programming languages,
missing values are often represented as NULL or None. For instance,
in SQL databases, a missing value is typically recorded as NULL.
• Empty Strings: Sometimes, missing values are denoted by empty
strings (""). This is common in text-based data or CSV files where a
field might be left blank.
How is a Missing Value Represented in a
Dataset?
• Special Indicators: Datasets might use specific indicators like -999,
9999, or other unlikely values to signify missing data. This is often
seen in older datasets or specific industries where such conventions
were established.
• Blanks or Spaces: In some cases, particularly in fixed-width text files,
missing values might be represented by spaces or blank fields.
Why is Data Missing From the Dataset?
• Past data might get corrupted due to improper maintenance.
• Observations are not recorded for certain fields due to some reasons.
There might be a failure in recording the values due to human error.
• The user has not provided the values intentionally
• Item nonresponse: This means the participant refused to respond.
Types of Missing Values
MCAR – Missing Completely At Random
• The missing data happens totally by chance—no pattern at all.
• Example: a temperature sensor fails randomly, dropping readings
unpredictably
• Example: A survey respondent accidentally skips a question because
their phone battery died mid-survey.
• Safe to drop or simple fill.
MAR – Missing At Random
• The missingness is linked to something else you already know—but
not to the missing value itself.
• Example: Missing doctor visit of lower income patients
• Income Predict Frequency of visits
• Use observed data (like age) to impute
MNAR – Missing Not At Random
• The reason for the missingness is directly related to the value of the
missing data itself
• Example: Rich people hide income—missingness tied to actual value
• Requires Complex modeling or external data
Why Do We Need to Care About Handling
Missing Data?
• Many machine learning algorithms fail if the dataset contains missing
values. However, algorithms like K-nearest and Naive Bayes support
data with missing values.
• You may end up building a biased machine learning model, leading to
incorrect results if the missing values are not handled properly.
• Missing data can lead to a lack of precision in the statistical analysis.
Handling Missing Value
• Deleting the missing value
• Deleting Rows from DataFrames
1. Removing Rows by Index
2. Removing Rows by Condition
3. Removing Duplicate Rows
4. Removing Rows with Missing Values
5. Removing Rows by Index Range
Handling Missing Value
• Deleting Columns from DataFrame
1. Removing Columns by using Label
2. Removing Columns by using Index
3. Removing Range of Columns by using Indices
4. Removing column using pop() Method
• Flagging
• Imputation