KEMBAR78
Data Engineers | PDF | Apache Spark | Statistics
0% found this document useful (0 votes)
65 views21 pages

Data Engineers

Report on data engineering domain

Uploaded by

ramjikancharla24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views21 pages

Data Engineers

Report on data engineering domain

Uploaded by

ramjikancharla24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Science

1.Python Statistics for Data Science Course

Module 1: Understanding the

Topics:
• Introduction to Data Types
• Numerical parameters to represent data
• Mean
• Mode
• Median
• Sensitivity
• Information Gain
• Entropy
• Statistical parameters to represent data

Module2:Probability and its uses


Topics:
• Uses of probability
• Need of probability
• Bayesian Inference
• Density Concepts
• Normal Distribution Curve

Module 3: Statistical Inference


Topics:
• Point Estimation
• Confidence Margin
• Hypothesis Testing
• Levels of Hypothesis Testing

Module 4: Testing the Data


Learning Objectives:
At the end of this module, you should be able to:
• Understand Parametric and Non-parametric Testing • Learn various types of parametric
testing
• Discuss experimental designing
• Explain a/b testing Topics
• Parametric Test
• Parametric Test Types
• Non- Parametric Test
• Experimental Designing
• A/B testing

Module 5: Data Clustering


Topics:
• Association and Dependence
• Causation and Correlation
• Covariance
• Simpson’s Paradox
• Clustering Techniques

Module 6: Regression Modelling


Topics:
• Logistic and Regression Techniques
• Problem of Collinearity
• WOE and IV
• Residual Analysis
• Heteroscedasticity
• Homoscedasticity rse Curriculum

2.R Statistics for Data science course

Module 1: Understanding the Data


Topics:

• Introduction to Data Types


• Numerical parameters to represent data
• Mean
• Mode
• Median
• Sensitivity
• Information Gain
• Entropy
• Statistical parameters to represent data

Module 2: Probability and its Uses


Topics
• Uses of probability
• Need of probability
• Bayesian Inference
• Density Concepts
• Normal Distribution Curve

Module 3: Statistical Inference


Topics
• Point Estimation
• Confidence Margin
• Hypothesis Testing
• Levels of Hypothesis Testing

Module 4: Testing the Data


Topics
• Parametric Test
• Parametric Test Types
• Non- Parametric Test
• A/B testing

Module 5: Data Clustering


Topics
• Association and Dependence
• Causation and Correlation
• Covariance
• Simpson’s Paradox
• Clustering Techniques

Module 6: Regression Modelling


Topics
• Logistic and Regression Techniques
• Problem of Collinearity
• WOE and IV
• Residual Analysis
• Heteroscedasticity
• Homoscedasticity

3.Data Science

Module 1: Introduction to Data Science


Topics
• What is Data Science?
• What does Data Science involve?
• Era of Data Science
• Business Intelligence vs Data Science
• Life cycle of Data Science
• Tools of Data Science
• Introduction to Big Data and Hadoop
• Introduction to R
• Introduction to Spark
• Introduction to Machine Learning

Module 2: Statistical Inference


Topics:
• What is Statistical Inference?
• Terminologies of Statistics
• Measures of Centers
• Measures of Spread
• Probability
• Normal Distribution
• Binary Distribution

Module 3: Data Extraction, Wrangling and Exploration


Topics
• Data Analysis Pipeline
• What is Data Extraction
• Types of Data
• Raw and Processed Data
• Data Wrangling
• Exploratory Data Analysis
• Visualization of Data

Module 4: Introduction to Machine Learning


Topics
• What is Machine Learning?
• Machine Learning Use-Cases
• Machine Learning Process Flow
• Machine Learning Categories
• Supervised Learning algorithm: Linear Regression and Logistic
• Regression

Module 5: Classification Techniques


Topics
• What are classification and its use cases?
• What is Decision Tree?
• Algorithm for Decision Tree Induction
• Creating a Perfect Decision Tree
• Confusion Matrix
• What is Random Forest?
• What is Navies Bayes?
• Support Vector Machine: Classification

Module 6: Unsupervised Learning


Topics
• What is Clustering & its use cases
• What is K-means Clustering?
• What is C-means Clustering?
• What is Canopy Clustering
• What is Hierarchical Clustering?

Module 7: Recommender Engines


Topics
• What is Association Rules & its Use Cases?
• What is Recommendation Engine & its Workings?
• Types of Recommendations
• User-Based Recommendation
• Item-Based Recommendation
• Difference: User-Based and Item-Based Recommendation
• Recommendation Use Cases

Module 8: Text Mining


Topics
• The concepts of text-mining
• Use cases
• Text Mining Algorithms
• Quantifying text
• TF-IDF
• Beyond TF-IDF

Module 9: Time Series


Topics
• What is Time Series data?
• Time Series variables
• Different components of Time Series data
• Visualize the data to identify Time Series Components
• Implement ARIMA model for forecasting
• Exponential smoothing models
• Identifying different time series scenario based on which different Exponential Smoothing model can be
applied
• Implement respective ETS model for forecasting

Module 10: Deep Learning


Topics
• Reinforced Learning
• Reinforcement learning Process Flow
• Reinforced Learning Use cases
• Deep Learning
• Biological Neural Networks
• Understand Artificial Neural Networks
• Building an Artificial Neural Network
• How ANN works
• Important Terminologies of ANN’s

4.Python for Data Science


Module 1: Introduction to Python
Topics
• Overview of Python
• The Companies using Python
• Different Applications where Python is used
• Discuss Python Scripts on UNIX/Windows
• Values, Types, Variables
• Operands and Expressions
• Conditional Statements
• Loops
• Command Line Arguments
• Writing to the screen

Module 2: Sequences and File Operations


Topics
• Python files I/O Functions
• Numbers
• Strings and related operations
• Tuples and related operations
• Lists and related operations
• Dictionaries and related operations
• Sets and related operations

Module 3: Deep Dive – Functions, OOPs, Modules, Errors and Exceptions


Topics
• Functions
• Function Parameters
• Global Variables
• Variable Scope and Returning Values
• Lambda Functions
• Object-Oriented Concepts
• Standard Libraries
• The Import Statements
• Module Search Path
• Package Installation Ways
• Errors and Exception Handling
• Handling Multiple Exceptions

Module 4: Introduction to NumPy, Pandas and Matplotlib


Topics
• NumPy - arrays
• Operations on arrays
• Indexing slicing and iterating
• Reading and writing arrays on files
• Pandas - data structures & index operations
• Reading and Writing data from Excel/CSV formats into Pandas
• matplotlib library
• Grids, axes, plots
• Markers, colors, fonts and styling
• Types of plots - bar graphs, pie charts, histograms
• Contour plots

Module 5: Data Manipulation


Topics
• Basic Functionalities of a data object
• Merging of Data objects
• Concatenation of data objects
• Types of Joins on data objects
• Exploring a Dataset
• Analyzing a dataset

Module 6: Introduction to Machine Learning with Python


Topics
• Python Revision (NumPy, Pandas, scikit learn, matplotlib)
• What is Machine Learning?
• Machine Learning Use-Cases

Module 7: Supervised Learning - I


Topics
• What are Classification and its use cases?
• What is Decision Tree?
• Algorithm for Decision Tree Induction
• Creating a Perfect Decision Tree
• Confusion Matrix
• What is Random Forest?

Module 8: Dimensionality Reduction


Topics
• Introduction to Dimensionality
• Why Dimensionality Reduction
• PCA
• Factor Analysis
• Scaling dimensional model
• LDA

Module 9: Supervised Learning - II


Topics
• What is Naïve Bayes?
• How Naïve Bayes works?
• Implementing Naïve Bayes Classifier
• What is Support Vector Machine?
• Illustrate how Support Vector Machine works?
• Hyperparameter Optimization
• Grid Search vs Random Search
• Implementation of Support Vector Machine for Classification

Module 10: Unsupervised Learning


Topics
• What is Clustering & its Use Cases?
• What is K-means Clustering?
• How does K-means algorithm work?
• How to do optimal clustering
• What is C-means Clustering?
• What is Hierarchical Clustering?
• How Hierarchical Clustering works?

Module 11: Association Rules Mining and Recommendation Systems


Topics
• What are Association Rules?
• Association Rule Parameters
• Calculating Association Rule Parameters
• Recommendation Engines
• How does Recommendation Engines work?
• Collaborative Filtering
• Content-Based Filtering
Module 12: Reinforcement Learning
Topics
• What is Reinforcement Learning
• Why Reinforcement Learning
• Elements of Reinforcement Learning
• Exploration vs Exploitation dilemma
• Epsilon Greedy Algorithm
• Markov Decision Process (MDP)
• Q values and V values
• Q – Learning
• α values

Module 13: Time Series Analysis


Topics
• What is Time Series Analysis?
• Importance of TSA
• Components of TSA
• White Noise
• AR model
• MA model
• ARMA model
• ARIMA model
• Stationarity
• ACF & PACF

Module 14: Model Selection and Boosting


Topics
• What is Model Selection?
• The need for Model Selection
• Cross-Validation
• What is Boosting?
• How Boosting Algorithms work?
• Types of Boosting Algorithms
• Adaptive Boosting

5.Apache Spark and Scala


Module 1: Introduction to Big Data Hadoop and Spark
Topics
• What is Big Data?
• Big Data Customer Scenarios
• Limitations and Solutions of Existing Data Analytics Architecture with Uber Use Case • How Hadoop
Solves the Big Data Problem?
• What is Hadoop?
• Hadoop’s Key Characteristics
• Hadoop Ecosystem and HDFS
• Hadoop Core Components
• Rack Awareness and Block Replication YARN and its Advantage
• Hadoop Cluster and its Architecture
• Hadoop: Different Cluster Modes
• Big Data Analytics with Batch & Real-time Processing
• Why Spark is needed?
• What is Spark?
• How Spark differs from other frameworks?
• Spark at Yahoo!

Module 2: Introduction to Scala and Apache Spark


Topics
• What is Scala?
• Scala in other Frameworks
• Basic Scala Operations
• Control Structures in Scala
• Collections in Scala- Array
• Why Scala for Spark?
• Introduction to Scala REPL
• Variable Types in Scala
• Foreach loop, Functions and Procedures
• ArrayBuffer, Map, Tuples, Lists, and more

Module 3: Functional Programming and OOPs Concepts in Scala


Topics
• Functional Programming
• Anonymous Functions
• Getters and Setters
• Properties with only Getters
• Singletons
• Overriding Methods
• Higher Order Functions
• Class in Scala
• Custom Getters and Setters
• Auxiliary Constructor and Primary Constructor
• Extending a Class
• Traits as Interfaces
• and Layered Traits

Module 4: Deep Dive into Apache Spark Framework


Topics
• Spark’s Place in Hadoop Ecosystem
• Spark Components & its Architecture
• Spark Deployment Modes
• Introduction to Spark Shell
• Writing your first Spark Job Using SBT
• Submitting Spark Job
• Spark Web UI
• Data Ingestion using Sqoop

Module 5: Playing with Spark RDDs


Topics
• Challenges in Existing Computing Methods
• Probable Solution & How RDD Solves the Problem
• What is RDD, Its Functions, Transformations & Actions?
• Data Loading and Saving Through RDDs
• Key-Value Pair RDDs
• Other Pair RDDs o RDD Lineage
• RDD Lineage
• RDD Persistence
• WordCount Program Using RDD Concepts
• RDD Partitioning & How It Helps Achieve Parallelization
• Passing Functions to Spark

Module 6: DataFrames and Spark SQL


Topisc
• Need for Spark SQL
• What is Spark SQL?
• Spark SQL Architecture
• SQL Context in Spark SQL
• User Defined Functions
• Data Frames & Datasets
• Interoperating with RDDs
• JSON and Parquet File Formats
• Loading Data through Different Sources
• Spark – Hive Integration

Module 7: Machine Learning using Spark MLlib


Topics
• Why Machine Learning?
• What is Machine Learning?
• Where Machine Learning is Used?
• Face Detection: USE CASE
• Different Types of Machine Learning Techniques
• Introduction to MLlib
• Features of MLlib and MLlib Tools
• Various ML algorithms supported by MLlib

Module 8: Deep Dive into Spark MLlib


Topics
• Supervised Learning - Linear Regression, Logistic Regression, DecisionmTree, Random Forest
• Unsupervised Learning - K-Means Clustering & How It Workswith MLlib
• Analysis on US Election Data using MLlib (K-Means)

Module 9: Understanding Apache Kafka & Apache Flume


Topics
• Need for Kafka
• Core Concepts of Kafka
• Where is Kafka Used?
• What is Kafka?
• Kafka Architecture
• Understanding the Components of Kafka Cluster
• Configuring Kafka Cluster
• Need of Apache Flume
• What is Apache Flume?
• Flume Sources
• Flume Channels
• Integrating Apache Flume and Apache Kafka
• Basic Flume Architecture
• Flume Sinks
• Flume Configuration

Module 10: Apache Spark Streaming- Processing Multiple Batches


Topics
• Drawbacks in Existing Computing Methods
• Why Streaming is Necessary?
• What is Spark Streaming?
• Spark Streaming Features
• Spark Streaming Workflow
• How Uber Uses Streaming Data
• Streaming Context & DStreams
• Transformations on DStreams
• Describe Windowed Operators and Why it is Useful
• Important Windowed Operators
• Slice, Window and ReduceByWindow Operators
• Stateful Operators

Module 11: Apache Spark Streaming- Data Sources


Topics
• Apache Spark Streaming: Data Sources
• Streaming Data Source Overview
• Apache Flume and Apache Kafka Data Sources
• Example: Using a Kafka Direct Data Source
• Perform Twitter Sentimental Analysis Using Spark Streaming

Module 12: In Class Project


Learning Objectives
Work on an end-to-end Financial domain project covering all the major concepts of Spark taught during the
course.

Module 13: Spark GraphX(Self-Paced)


6.Deep Learning with TensorFlow 2.0

Module 1: Introduction to Deep Learning


Topics
• What is Deep Learning?
• Curse of Dimensionality
• Machine Learning vs. Deep Learning
• Use cases of Deep Learning
• Human Brain vs. Neural Network
• What is Perceptron?
• Learning Rate
• Epoch
• Batch Size
• Activation Function
• Single Layer Perceptron

Module 2: Getting Started with TensorFlow 2.0


Topics
• Introduction to TensorFlow 2.x
• Installing TensorFlow 2.x
• Defining Sequence model layers
• Activation Function
• Layer Types
• Model Compilation
• Model Optimizer
• Model Loss Function
• Model Training
• Digit Classification using Simple Neural Network in TensorFlow 2.x
• Improving the model
• Adding Hidden Layer
• Adding Dropout
• Using Adam Optimizer

Module 3: Convolution Neural Network


Topics
• Image Classification Example
• What is Convolution
• Convolutional Layer Network
• Convolutional Layer
• Filtering
• ReLU Layer
• Pooling
• Data Flattening
• Fully Connected Layer
• Predicting a cat or a dog
• Saving and Loading a Model
• Face Detection using OpenCV

Module 4: Regional CNN


Topics
• Regional-CNN
• Selective Search Algorithm
• Bounding Box Regression
• SVM in RCNN
• Pre-trained Model
• Model Accuracy
• Model Inference Time
• Model Size Comparison
• Transfer Learning
• Object Detection – Evaluation
• mAP
• IoU
• RCNN – Speed Bottleneck
• Fast R-CNN
• RoI Pooling
• Fast R-CNN – Speed Bottleneck
• Faster R-CNN
• Feature Pyramid Network (FPN)
• Regional Proposal Network (RPN)
• Mask R-CNN

Module 5: Boltzmann Machine & Autoencoder


Topics
• What is Boltzmann Machine (BM)?
• Identify the issues with BM
• Why did RBM come into picture?
• Step by step implementation of RBM
• Distribution of Boltzmann Machine
• Understanding Autoencoders
• Architecture of Autoencoders
• Brief on types of Autoencoders
• Applications of Autoencoders

Module 6: Generative Adversarial Network(GAN)


Topics
• What is Boltzmann Machine (BM)?
• Identify the issues with BM
• Why did RBM come into picture?
• Step by step implementation of RBM
• Distribution of Boltzmann Machine
• Understanding Autoencoders
• Architecture of Autoencoders
• Brief on types of Autoencoders
• Applications of Autoencoders

Module 7: Emotion and Gender Detection


Topics
• What is Boltzmann Machine (BM)?
• Identify the issues with BM
• Why did RBM come into picture?
• Step by step implementation of RBM
• Distribution of Boltzmann Machine
• Understanding Autoencoders
• Architecture of Autoencoders
• Brief on types of Autoencoders
• Applications of Autoencoders

Module 8: Introduction RNN and GRU


Topics
• What is Boltzmann Machine (BM)?
• Identify the issues with BM
• Why did RBM come into picture?
• Step by step implementation of RBM
• Distribution of Boltzmann Machine
• Understanding Autoencoders
• Architecture of Autoencoders
• Brief on types of Autoencoders
• Applications of Autoencoders

Module 9: LSTM
Topics
• What is Boltzmann Machine (BM)?
• Identify the issues with BM
• Why did RBM come into picture?
• Step by step implementation of RBM
• Distribution of Boltzmann Machine
• Understanding Autoencoders
• Architecture of Autoencoders
• Brief on types of Autoencoders
• Applications of Autoencoders

Module 10: Auto Image Captioning Using CNN LSTM


Topics
• Auto Image Captioning
• COCO dataset
• Pre-trained model
• Inception V3 model
• Architecture of Inception V3
• Modify last layer of pre-trained model
• Freeze model
• CNN for image processing
• LSTM or text processing

7.Tableau Training

Module 1: Data Preparation using Tableau Prep


Topics:
• Data Visualization
• Business Intelligence tools
• Introduction to Tableau
• Tableau Architecture
• Tableau Server Architecture
• VizQL
• Introduction to Tableau Prep
• Tableau Prep Builder User Interface
• Data Preparation techniques using Tableau Prep Builder tool

Module 2: Data Connection with Tableau Desktop


Topics:
• Features of Tableau Desktop
• Connect to data from File and Database
• Types of Connections
• Joins and Unions
• Data Blending
• Tableau Desktop User Interface
• Basic project: Create a workbook and publish it on Tableau Online

Module 3: Basic Visual Analytics


Topics:
• Visual Analytics
• Basic Charts: Bar Chart, Line Chart, and Pie Chart
• Hierarchies
• Data Granularity
• Highlighting
• Sorting
• Filtering
• Grouping
• Sets

Module 4: Calculations in Tableau


Topics:
• Types of Calculations
• Built-in Functions (Number, String, Date, Logical and Aggregate)
• Operators and Syntax Conventions
• Table Calculations
• Level Of Detail (LOD) Calculations
• Using R within Tableau for Calculations

Module 5: Advanced Visual Analytics


Topics:
• Parameters
• Tool tips
• Trend lines
• Reference lines
• Forecasting
• Clustering

Module 6: Level of Detail (LOD) Expressions in Tableau


Topics:
• Use Case I - Count Customer by Order
• Use Case II - Profit per Business Day
• Use Case III - Comparative Sales
• Use Case IV - Profit Vs Target
• Use Case V - Finding the second order date
• Use Case VI - Cohort Analysis

Module 7: Geographic Visualizations in Tableau


Topics:
• Introduction to Geographic Visualizations
• Manually assigning Geographical Locations
• Types of Maps
• Spatial Files
• Custom Geocoding
• Polygon Maps
• Web Map Services
• Background Images

Module 8: Advanced Charts in Tableau


Topics:
• Box and Whisker’s Plot
• Bullet Chart
• Bar in Bar Chart
• Gantt Chart
• Waterfall Chart
• Pareto Chart
• Control Chart
• Funnel Chart
• Bump Chart
• Step and Jump Lines
• Word Cloud
• Donut Chart

Module 9: Dashboards and Stories


Topics:
• Introduction to Dashboards
• The Dashboard Interface
• Dashboard Objects
• Building a Dashboard
• Dashboard Layouts and Formatting
• Interactive Dashboards with actions
• Designing Dashboards for devices
• Story Points

Module 10: Get Industry Ready


Topics:
• Tableau Tips and Tricks
• Choosing the right type of Chart
• Format Style
• Data Visualization best practices
• Prepare for Tableau Interview

Module 11: Exploring Tableau Online


Topics:
• Publishing Workbooks to Tableau Online
• Interacting with Content on Tableau Online
• Data Management through Tableau Catalog
• AI-Powered features in Tableau Online (Ask Data and Explain Data)
• Understand Scheduling
• Managing Permissions on Tableau Online
• Data Security with Filters in Tableau Online

You might also like