KEMBAR78
(Feature Engineering) (Extended-Cheatsheet) | PDF | Teaching Methods & Materials | Computers
100% found this document useful (1 vote)
1K views9 pages

(Feature Engineering) (Extended-Cheatsheet)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views9 pages

(Feature Engineering) (Extended-Cheatsheet)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

# [ Feature Engineering ] [ Extended-cheatsheet ]

1. Data Preprocessing

1.1 Handling Missing Values

● Check for missing values: df.isnull().sum()


● Drop rows with missing values: df.dropna()
● Fill missing values with a specific value: df.fillna(value)
● Fill missing values with mean: df.fillna(df.mean())
● Fill missing values with median: df.fillna(df.median())
● Fill missing values with mode: df.fillna(df.mode().iloc[0])
● Fill missing values with forward fill: df.fillna(method='ffill')
● Fill missing values with backward fill: df.fillna(method='bfill')

1.2 Encoding Categorical Variables

● One-hot encoding: pd.get_dummies(df, columns=['column_name'])


● Label encoding: from sklearn.preprocessing import LabelEncoder;
LabelEncoder().fit_transform(df['column_name'])
● Ordinal encoding: from sklearn.preprocessing import OrdinalEncoder;
OrdinalEncoder().fit_transform(df[['column_name']])
● Binary encoding: df['binary_column'] = np.where(df['column_name'] ==
'value', 1, 0)
● Frequency encoding: df['freq_encoded'] =
df.groupby('column_name')['column_name'].transform('count')
● Mean encoding: df['mean_encoded'] =
df.groupby('column_name')['target'].transform('mean')
● Weight of Evidence (WoE) encoding: df['woe'] =
np.log(df.groupby('column_name')['target'].mean() / (1 -
df.groupby('column_name')['target'].mean()))

1.3 Scaling and Normalization

● Min-max scaling: from sklearn.preprocessing import MinMaxScaler;


MinMaxScaler().fit_transform(df[['column_name']])
● Standard scaling (Z-score normalization): from sklearn.preprocessing
import StandardScaler;
StandardScaler().fit_transform(df[['column_name']])

By: Waleed Mousa


● Max-abs scaling: from sklearn.preprocessing import MaxAbsScaler;
MaxAbsScaler().fit_transform(df[['column_name']])
● Robust scaling: from sklearn.preprocessing import RobustScaler;
RobustScaler().fit_transform(df[['column_name']])
● Normalization (L1, L2, Max): from sklearn.preprocessing import
Normalizer; Normalizer(norm='l1').fit_transform(df[['column_name']])

1.4 Handling Outliers

● Identify outliers using IQR: Q1 = df['column_name'].quantile(0.25); Q3 =


df['column_name'].quantile(0.75); IQR = Q3 - Q1; df[(df['column_name'] <
Q1 - 1.5 * IQR) | (df['column_name'] > Q3 + 1.5 * IQR)]
● Identify outliers using Z-score: from scipy.stats import zscore;
df[np.abs(zscore(df['column_name'])) > 3]
● Remove outliers: df = df[(df['column_name'] >= lower_bound) &
(df['column_name'] <= upper_bound)]
● Cap outliers: df['column_name'] = np.where(df['column_name'] >
upper_bound, upper_bound, np.where(df['column_name'] < lower_bound,
lower_bound, df['column_name']))

2. Feature Transformation

2.1 Mathematical Transformations

● Logarithmic transformation: df['log_column'] = np.log(df['column_name'])


● Square root transformation: df['sqrt_column'] =
np.sqrt(df['column_name'])
● Exponential transformation: df['exp_column'] = np.exp(df['column_name'])
● Reciprocal transformation: df['reciprocal_column'] = 1 /
df['column_name']
● Box-Cox transformation: from scipy.stats import boxcox;
df['boxcox_column'] = boxcox(df['column_name'])[0]
● Yeo-Johnson transformation: from scipy.stats import yeojohnson;
df['yeojohnson_column'] = yeojohnson(df['column_name'])[0]

2.2 Binning and Discretization

● Equal-width binning: pd.cut(df['column_name'], bins=n)


● Equal-frequency binning: pd.qcut(df['column_name'], q=n)
● Custom binning: pd.cut(df['column_name'], bins=[0, 10, 20, 30, 40, 50,
60, 70, 80, 90, 100])

By: Waleed Mousa


● Discretization using KBinsDiscretizer: from sklearn.preprocessing import
KBinsDiscretizer; KBinsDiscretizer(n_bins=n,
encode='ordinal').fit_transform(df[['column_name']])

2.3 Interaction Features

● Multiplication: df['interaction'] = df['column_1'] * df['column_2']


● Division: df['interaction'] = df['column_1'] / df['column_2']
● Addition: df['interaction'] = df['column_1'] + df['column_2']
● Subtraction: df['interaction'] = df['column_1'] - df['column_2']
● Polynomial features: from sklearn.preprocessing import
PolynomialFeatures;
PolynomialFeatures(degree=n).fit_transform(df[['column_1', 'column_2']])

2.4 Date and Time Features

● Extract year: df['year'] = df['date_column'].dt.year


● Extract month: df['month'] = df['date_column'].dt.month
● Extract day: df['day'] = df['date_column'].dt.day
● Extract hour: df['hour'] = df['datetime_column'].dt.hour
● Extract minute: df['minute'] = df['datetime_column'].dt.minute
● Extract second: df['second'] = df['datetime_column'].dt.second
● Extract day of week: df['day_of_week'] = df['date_column'].dt.dayofweek
● Extract day of year: df['day_of_year'] = df['date_column'].dt.dayofyear
● Extract week of year: df['week_of_year'] =
df['date_column'].dt.weekofyear
● Extract quarter: df['quarter'] = df['date_column'].dt.quarter
● Extract is_weekend: df['is_weekend'] =
df['date_column'].dt.dayofweek.isin([5, 6])
● Extract is_holiday: holidays = ['2023-01-01', '2023-12-25'];
df['is_holiday'] = df['date_column'].isin(holidays)
● Time since feature: df['time_since'] = (df['date_column'] -
df['reference_date']).dt.days

3. Feature Selection

3.1 Univariate Feature Selection

● Select K best features: from sklearn.feature_selection import


SelectKBest, f_classif; SelectKBest(score_func=f_classif,
k=n).fit_transform(X, y)

By: Waleed Mousa


● Select percentile of features: from sklearn.feature_selection import
SelectPercentile, f_classif; SelectPercentile(score_func=f_classif,
percentile=p).fit_transform(X, y)

3.2 Recursive Feature Elimination

● Recursive Feature Elimination (RFE): from sklearn.feature_selection


import RFE; from sklearn.linear_model import LinearRegression;
RFE(estimator=LinearRegression(),
n_features_to_select=n).fit_transform(X, y)
● Recursive Feature Elimination with Cross-Validation (RFECV): from
sklearn.feature_selection import RFECV; from sklearn.linear_model import
LinearRegression; RFECV(estimator=LinearRegression(),
min_features_to_select=n).fit_transform(X, y)

3.3 L1 and L2 Regularization

● Lasso (L1) regularization: from sklearn.linear_model import Lasso;


Lasso(alpha=a).fit_transform(X, y)
● Ridge (L2) regularization: from sklearn.linear_model import Ridge;
Ridge(alpha=a).fit_transform(X, y)
● Elastic Net regularization: from sklearn.linear_model import ElasticNet;
ElasticNet(alpha=a, l1_ratio=r).fit_transform(X, y)

3.4 Feature Importance

● Feature importance using Random Forest: from sklearn.ensemble import


RandomForestClassifier; rf = RandomForestClassifier(); rf.fit(X, y);
rf.feature_importances_
● Feature importance using Gradient Boosting: from sklearn.ensemble import
GradientBoostingClassifier; gb = GradientBoostingClassifier(); gb.fit(X,
y); gb.feature_importances_
● Permutation feature importance: from sklearn.inspection import
permutation_importance; permutation_importance(model, X, y, n_repeats=n)

4. Dimensionality Reduction

4.1 Principal Component Analysis (PCA)

● PCA: from sklearn.decomposition import PCA;


PCA(n_components=n).fit_transform(X)

By: Waleed Mousa


● Incremental PCA: from sklearn.decomposition import IncrementalPCA;
IncrementalPCA(n_components=n).fit_transform(X)
● Kernel PCA: from sklearn.decomposition import KernelPCA;
KernelPCA(n_components=n, kernel='rbf').fit_transform(X)

4.2 t-SNE (t-Distributed Stochastic Neighbor Embedding)

● t-SNE: from sklearn.manifold import TSNE;


TSNE(n_components=n).fit_transform(X)

4.3 UMAP (Uniform Manifold Approximation and Projection)

● UMAP: from umap import UMAP; UMAP(n_components=n).fit_transform(X)

4.4 Autoencoders

● Autoencoder using Keras: from keras.layers import Input, Dense; from


keras.models import Model; input_layer = Input(shape=(n,)); encoded =
Dense(encoding_dim, activation='relu')(input_layer); decoded = Dense(n,
activation='sigmoid')(encoded); autoencoder = Model(input_layer,
decoded); encoder = Model(input_layer, encoded)

5. Text Feature Engineering

5.1 Text Preprocessing

● Lowercase: df['text'] = df['text'].str.lower()


● Remove punctuation: df['text'] = df['text'].str.replace('[^a-zA-Z]', ' ')
● Remove stopwords: from nltk.corpus import stopwords; stop_words =
set(stopwords.words('english')); df['text'] = df['text'].apply(lambda x:
' '.join([word for word in x.split() if word not in stop_words]))
● Stemming: from nltk.stem import PorterStemmer; ps = PorterStemmer();
df['text'] = df['text'].apply(lambda x: ' '.join([ps.stem(word) for word
in x.split()]))
● Lemmatization: from nltk.stem import WordNetLemmatizer; lemmatizer =
WordNetLemmatizer(); df['text'] = df['text'].apply(lambda x: '
'.join([lemmatizer.lemmatize(word) for word in x.split()]))

5.2 Text Vectorization

● Bag-of-Words (BoW): from sklearn.feature_extraction.text import


CountVectorizer; CountVectorizer().fit_transform(df['text'])

By: Waleed Mousa


● TF-IDF: from sklearn.feature_extraction.text import TfidfVectorizer;
TfidfVectorizer().fit_transform(df['text'])
● Word2Vec: from gensim.models import Word2Vec;
Word2Vec(sentences=df['text'], vector_size=n, window=w, min_count=m,
workers=wrk)
● GloVe: from gensim.models import KeyedVectors;
KeyedVectors.load_word2vec_format('glove.6B.100d.txt', binary=False)
● FastText: from gensim.models import FastText;
FastText(sentences=df['text'], vector_size=n, window=w, min_count=m,
workers=wrk)
● BERT: from transformers import BertTokenizer, BertModel; tokenizer =
BertTokenizer.from_pretrained('bert-base-uncased'); model =
BertModel.from_pretrained('bert-base-uncased')

5.3 Text Feature Extraction

● Named Entity Recognition (NER): from nltk import word_tokenize, pos_tag,


ne_chunk; ne_chunk(pos_tag(word_tokenize(text)))
● Part-of-Speech (POS) tagging: from nltk import word_tokenize, pos_tag;
pos_tag(word_tokenize(text))
● Sentiment Analysis: from textblob import TextBlob;
TextBlob(text).sentiment

6. Image Feature Engineering

6.1 Image Preprocessing

● Resize: from PIL import Image; img = Image.open('image.jpg'); img =


img.resize((width, height))
● Convert to grayscale: from PIL import Image; img =
Image.open('image.jpg'); img = img.convert('L')
● Normalize pixel values: from PIL import Image; img =
Image.open('image.jpg'); img = img / 255.0
● Data augmentation (flip, rotate, etc.): from keras.preprocessing.image
import ImageDataGenerator; datagen = ImageDataGenerator(rotation_range=r,
width_shift_range=ws, height_shift_range=hs, shear_range=s, zoom_range=z,
horizontal_flip=True)

6.2 Image Feature Extraction

By: Waleed Mousa


● HOG (Histogram of Oriented Gradients): from skimage.feature import hog;
hog_features = hog(image, orientations=n, pixels_per_cell=(p, p),
cells_per_block=(c, c))
● SIFT (Scale-Invariant Feature Transform): from cv2 import xfeatures2d;
sift = xfeatures2d.SIFT_create(); keypoints, descriptors =
sift.detectAndCompute(image, None)
● ORB (Oriented FAST and Rotated BRIEF): from cv2 import ORB; orb =
ORB_create(); keypoints, descriptors = orb.detectAndCompute(image, None)
● CNN features: from keras.applications.vgg16 import VGG16; model =
VGG16(weights='imagenet', include_top=False); features =
model.predict(image)

7. Audio Feature Engineering

7.1 Audio Preprocessing

● Load audio file: import librosa; audio, sample_rate =


librosa.load('audio.wav')
● Resampling: import librosa; audio = librosa.resample(audio,
orig_sr=sample_rate, target_sr=target_sample_rate)
● Normalize audio: import librosa; audio = librosa.util.normalize(audio)

7.2 Audio Feature Extraction

● MFCC (Mel-Frequency Cepstral Coefficients): import librosa; mfccs =


librosa.feature.mfcc(y=audio, sr=sample_rate)
● Chroma features: import librosa; chroma =
librosa.feature.chroma_stft(y=audio, sr=sample_rate)
● Spectral contrast: import librosa; contrast =
librosa.feature.spectral_contrast(y=audio, sr=sample_rate)
● Tonnetz: import librosa; tonnetz = librosa.feature.tonnetz(y=audio,
sr=sample_rate)

8. Time Series Feature Engineering

8.1 Time Series Decomposition

● Decompose time series into trend, seasonality, and residuals: from


statsmodels.tsa.seasonal import seasonal_decompose; decomposition =
seasonal_decompose(time_series, model='additive', period=p)

By: Waleed Mousa


● STL decomposition: from statsmodels.tsa.seasonal import STL; stl =
STL(time_series, period=p); res = stl.fit()

8.2 Rolling and Expanding Statistics

● Rolling mean: time_series.rolling(window=n).mean()


● Rolling standard deviation: time_series.rolling(window=n).std()
● Expanding mean: time_series.expanding(min_periods=n).mean()
● Expanding standard deviation: time_series.expanding(min_periods=n).std()

8.3 Lag Features

● Shift/lag feature: df['lag_1'] = df['column'].shift(1)


● Difference feature: df['diff_1'] = df['column'].diff(1)
● Percentage change feature: df['pct_change_1'] =
df['column'].pct_change(1)

8.4 Autocorrelation and Partial Autocorrelation

● Autocorrelation Function (ACF): from statsmodels.tsa.stattools import


acf; acf_values = acf(time_series, nlags=n)
● Partial Autocorrelation Function (PACF): from statsmodels.tsa.stattools
import pacf; pacf_values = pacf(time_series, nlags=n)

9. Geospatial Feature Engineering

9.1 Geospatial Distance and Proximity

● Haversine distance: from math import radians, cos, sin, asin, sqrt; def
haversine(lon1, lat1, lon2, lat2): lon1, lat1, lon2, lat2 = map(radians,
[lon1, lat1, lon2, lat2]); dlon = lon2 - lon1; dlat = lat2 - lat1; a =
sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2; c = 2 *
asin(sqrt(a)); r = 6371; return c * r
● Manhattan distance: from math import fabs; def manhattan(x1, y1, x2, y2):
return fabs(x1 - x2) + fabs(y1 - y2)
● Euclidean distance: from math import sqrt; def euclidean(x1, y1, x2, y2):
return sqrt((x1 - x2)**2 + (y1 - y2)**2)

9.2 Geospatial Aggregation

● Spatial join: import geopandas as gpd; gpd.sjoin(gdf1, gdf2,


op='intersects')

By: Waleed Mousa


● Spatial groupby: import geopandas as gpd;
gdf.groupby('column').agg({'geometry': 'first', 'other_column': 'mean'})

9.3 Geospatial Binning

● Create grid: import geopandas as gpd; grid =


gpd.GeoDataFrame(geometry=gpd.points_from_xy(x, y))
● Spatial binning: import geopandas as gpd; gdf['grid_id'] = gpd.sjoin(gdf,
grid, op='within')['index_right']

10. Feature Scaling and Normalization

10.1 Scaling

● Min-max scaling: from sklearn.preprocessing import MinMaxScaler;


MinMaxScaler().fit_transform(df[['column_name']])
● Standard scaling (Z-score normalization): from sklearn.preprocessing
import StandardScaler;
StandardScaler().fit_transform(df[['column_name']])
● Max-abs scaling: from sklearn.preprocessing import MaxAbsScaler;
MaxAbsScaler().fit_transform(df[['column_name']])
● Robust scaling: from sklearn.preprocessing import RobustScaler;
RobustScaler().fit_transform(df[['column_name']])

10.2 Normalization

● L1 normalization: from sklearn.preprocessing import Normalizer;


Normalizer(norm='l1').fit_transform(df[['column_name']])
● L2 normalization: from sklearn.preprocessing import Normalizer;
Normalizer(norm='l2').fit_transform(df[['column_name']])
● Max normalization: from sklearn.preprocessing import Normalizer;
Normalizer(norm='max').fit_transform(df[['column_name']])

By: Waleed Mousa

You might also like