KEMBAR78
Step by Step Guide For Data Engineering | PDF | Apache Spark | Computing
0% found this document useful (0 votes)
397 views7 pages

Step by Step Guide For Data Engineering

This document provides a step-by-step guide for data engineering that includes 15 steps. It covers topics like programming languages (Python, Scala, Java), data structures and algorithms, database fundamentals, SQL scripting, big data frameworks (Hadoop, Spark), data processing, data warehousing, data exploration libraries (Pandas, NumPy, Matplotlib), data orchestration with Airflow, NoSQL databases, message queues and streaming services, dashboarding tools, and cloud services (AWS). The guide recommends allocating time periods ranging from 1 week to 3 months for learning the various topics through online practice exercises and hands-on projects.

Uploaded by

Shubham Jagdale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
397 views7 pages

Step by Step Guide For Data Engineering

This document provides a step-by-step guide for data engineering that includes 15 steps. It covers topics like programming languages (Python, Scala, Java), data structures and algorithms, database fundamentals, SQL scripting, big data frameworks (Hadoop, Spark), data processing, data warehousing, data exploration libraries (Pandas, NumPy, Matplotlib), data orchestration with Airflow, NoSQL databases, message queues and streaming services, dashboarding tools, and cloud services (AWS). The guide recommends allocating time periods ranging from 1 week to 3 months for learning the various topics through online practice exercises and hands-on projects.

Uploaded by

Shubham Jagdale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Step by Step Guide for Data Engineering

01. Programming Language :


a. Python
i. Basic Syntax
ii. Variables
iii. Data Types
iv. Operators
v. List
vi. Tuples
vii. Sets
viii. Dictionaries
ix. Conditional Statements (If..Else)
x. Loops
xi. Try...Except
xii. Reading Files (CSV,JSON, TEXT, Excel)
xiii. Writing Files
xiv. Functions
xv. Working with Dates
b. Scala
c. Java
The practice of hackerrank or leetcode with easy
problems (10-15)
Time for learning - 2 Weeks

02. Data Structures & Algorithms (Basic):


a. Time Complexity and Space Complexity (Big O
notation)
b. Arrays
c. Linked List
d. Stack
e. Queue
f. Tree
g. Graph
h. Searching
i. Linear Search
ii. Binary Search
Step by Step Guide for iii. Data Engineering
Interpolation Search
i. Sorting
i. Selection Sort
ii. Insertion Sort
iii. Merge Sort
iv. Quick Sort
v. Heap Sort
Practice of geeksforgeeks with easy problems (10-12)
Time for learning - 1-2 Months (Depending on previous
experience)
03. Database Fundamentals :
a. DDL (CREATE, DROP, ALTER, TRUNCATE,
RENAME)
b. DCL (GRANT and REVOKE)
c. DML (INSERT, UPDATE, DELETE)
d. TCL (COMMIT, ROLLBACK)
e. Aggregation (MAX, MIN, FIRST, AVG,COUNT,
SUM)
f. Integrity Constraints (Primary Key, Foreign
Key)
g. Data Schema
h. ACID Properties
i. Views
j. Stored Procedures
k. ER and Relational Diagrams
l. Indexing
m. Hashing
n. Normalization forms

04. SQL Scripting :


a. Transactional Databases : MySQL,
PostgreSQL
b. Joins (Left, Inner, Outer, Full, Right)
c. Sub Queries
d. UNION Statement
e. Date Function
f. Nested Queries
g. Group By
h. Having
i. CASE Statements
j. Window Functions
Step Practice of hackerrank
by Step Guide or leetcode with easy problems
for Data Engineering
(10-15)
Time for learning - 3-4 Weeks (section 3 and 4)

05. BigData Fundamentals :


a. BigData Basics and Characteristics?
b. 5 V’s of BigData
c. Vertical vs Horizontal Scaling
d. Scaling Up and Scaling Out
e. ETL Pipelines
f. File formats
i. CSV
ii. JSON
iii. AVRO
iv. Parquet
v. ORC
g. Type of Data
i. Structured
ii. Unstructured
iii. Semi-structured
Time for learning - 1 Week (Only Theory)
06. Cluster Computing
a. Hadoop Ecosystem
i. HDFS
ii. Mar-Reduce
iii. Yarn
b. Apache Hive
i. How to load data in different file formats
ii. Internal Tables
iii. External Tables
iv. Querying table data stored in HDFS
v. Partitioning
vi. Bucketing
vii. Map-Side Join
viii. Sorted-Merge Join
ix. UDF in Hive
x. SerDe in Hive
07. Apache Spark
a. Spark Core
b. Spark SQL
c. Spark Streaming
d. Difference Between Hadoop and Spark
Step Time
by Step Guide for -Data
for learning 3-4 Engineering
Weeks (Hands-on and theory)

08. Data Processing


a. Batch Processing
b. Real-Time Processing
c. Hybrid Processing
Time for learning - 1-2 Weeks (Understand basic concept)

09. Data Warehousing Fundamentals:


a. OLAP vs OLTP
b. Dimension Tables
c. Data Cube
d. Extract Transform Load (ETL)
e. E-R Modeling VS Dimensional Modeling
f. Fact Tables
g. Star Schema
h. Snowflake Schema
i. Warehouse Designing Questions
Time for learning - 1-2 Weeks (Theory)
10. Data Exploration Libraries:
a. Pandas
i. Reading and writing CSV & JSON
ii. DataFrames and Series
iii. Head, tail
iv. Info()
v. Dropping columns
vi. Sorting
vii. Apply
viii. Filter
ix. Loc and iloc
x. Shape, Index, Columns
xi. Lambda
xii. Basic Arithmetic Functions
xiii. Join and Merge
b. NumPy
i. Creating Arrays
ii. Indexing and Slicing
iii. Copy vs View
iv. Shape
v. Reshape
vi. Split
Step by Step Guide forvii.Data
Join Engineering
viii. Sort, Search, Filter, Split
c. MatplotLib
i. Pyplot
ii. Plotting
iii. Lines
iv. Legends
v. Labels
vi. Grid
vii. Scatter
viii. Bars
ix. Histogram
x. Pie Charts
xi. Seaborn
Time for learning - 1-2 Weeks (Theory and HandsOn)
11. Data Orchestration (AirFlow) :
a. Intro to Airflow
b. Implementing Airflow DAGs
c. Maintaining and monitoring Airflow workflows
d. Building production pipelines in Airflow
Time for learning - 1-2 Weeks (Theory and HandsOn)
12. NoSQL:
a. Difference between NoSQL vs SQL
b. Features of NoSQL
c. Types of NoSQL database
d. CAP Theorem
e. Eventual Consistency
f. Tools -
i. HBase
ii. Cassandra
iii. AWS DynamoDB
iv. MongoDB

Time for learning - 2-3 Weeks (Theory and HandsOn)


Learn MongoDB or Cassandra
13. Message Queue or Streaming Services :
a. Apache Kafka
b. Apache Beam
c. AWS Kinesis
Time for learning - 2-3 Weeks (Theory and HandsOn)
Pick one and learn
Step by Step Guide for Data Engineering
14. Dashboarding Tools :
a. Tableau
b. QuickSight
c. Data Studio
d. Looker
Time for learning - 2 Weeks (Theory and HandsOn)
Build some dashboards (will tell you about projects in
future videos)
15. Cloud Services (AWS) :
a. Ondemand Machines
i. AWS EC2
b. Access Management
i. AWS IAM
c. Object Storage
i. AWS S3
d. Transactional Database Services
i. AWS RDS
1. MySQL
2. Arora
3. PostgreSQL
e. Adhoc Query
i. AWS Athena
f. Data Warehouse
i. AWS Redshift
g. NoSQL Database Services
i. AWS DynamoDB
h. Serverless
i. AWS Lambda
i. ETL Services
i. AWS Glue
j. For Storing and Accessing Credentials
i. AWS Secret Manager
k. Log Services
i. AWS Cloudwatch
ii. AWS Config
l. Distributed Data Computation
i. AWS EMR
m. Messaging Queue
i. AWS SNS
ii. AWS SQS
n. Real Time Data Processing
Step by Step Guide for DataKinesis
i. AWS Engineering
ii. AWS Firehose
iii. AWS Analytics
o. Networking (Advance Leve)
i. VPC
ii. Subnets
iii. NACL
iv. Security Groups
v. VPC Peering
vi. VPN
p. Security
i. KMS
ii. WAF

Time for learning - 2-3 Months (Theory and HandsOn)


Learning fundamentals, doing hands-on practice with
projects

You might also like