Data Analyst
Eng. Ejaz Ahmad
What is Data Analyst?
A process of inspecting, cleaning, transforming and
modeling data with the goal of finding of useful
information and make decision
1. Inspecting
2. Cleaning
3. Transforming
4. Modeling
Skills Required
Programming (python, R)
SQL
Excel
Tablue
Statistic
Programming
Python libraries
•Pandas
•Numpy
•Matplotlib
•Sklearn
•Tensorflor/ MachineLearning
1.Data Extraction
Following steps include in Data Extraction
1. SQL
2. Web Scrapping
3. File Format (CSV,XML,JSON)
4. Consulting API
5. Baying Data
6. Distributed Database
2.Data Cleaning
Following steps include in data cleaning
1. Missing values and empty data
2. Data imputation
3. Incorrect types
4. Incorrect and invalid values
5. Outliers and non relevant data
6. Statistical sanitization
3.Data Wrangling
1. Hierarchical data
2. Handling Categorical data
3. Reshaping and transforming structure
4. Indexing data for quick access
5. Merging ,combining and joining
4.Analysis
1. Exploration
2. Building Statistical model
3. Visualization and representation
4. Correlation vs causation analysis
5. Hypothesis testing
6. Statistical analysis
7. Reporting
5.Action
1. Building Machine Learning Models
2. Feature Engineering
3. Moving ML into Production
4. Building ETL pipelines
5. Live dashboard and reporting
6. Decision making and real life testing
MACHINE
LEARNING
Engr. Ejaz Ahmad
Machine learning steps
Frame the problem
Get data
Discover and Visualize the data to get inside
Prepare the data for machine learning
Select a model and train it
Fine tune your model
Present your solution
Launch your system
1.Frame the problem
What is the business objective of this model
What is the previous solution of model
Decide is what type of machine learning algorithm
applied
Type of Machine
Learning
Supervised
• Learn from Known Datasets known Training
datasets
Unsupervised
• Learn from unlabeled data, used to find stricture
and patterns in big data
Reinforcement
• Learn from experiences and rewards
Selecting Algorithm
Classification:
•Is This A or B:
Anomaly Detection Algorithm
•Is This Weird: Analyze patterns
Regression Algorithm
•How much or how many: estimator
Clustering Algorithm
•Find Structure in datasets
Reinforcement Algorithm
•Use to tack decision
1.Classification
It give 2 or 3
If give 2 out put yes or no called 2 class classification
If give 3 outputs yes no or maybe it is called Multi
class
Classification Algorithm
Logistic Regression
Decision Tree
Artificial Neural network
K-Nearest Neighbor
Support Vector Machine
Random Forest
Naïve Bayes
Stochastic Gradient Descent
2.Anomaly Detection
It analyze the certain patterns and alert you when
there is a change in patterns
Credit card companies use this algorithm to find any
usual change
3.Regression
It is an Estimator
Predict the numerical/ integer values
4.Clustering Algorithms
Unsupervised learning use to understand the
structure of data
5.Reinforcement
Algorithms
Used to make a decision
Popular data source
UC Irvine Machine Learning Repository
Kaggle datasets
Amazon’s AWS datasets
http://dataportals.org/
http://opendatamonitor.eu/
http://quandl.com/
Wikipedia’s list of Machine Learning datasets
Quora.com question
Quora.com question
UCI ML respositery
Splitting Test and Train data
From sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing,
test_size=0.2, random_state=42)