KEMBAR78
Aws Data Engineering Project | PDF
0% found this document useful (0 votes)
11 views1 page

Aws Data Engineering Project

Uploaded by

myr4112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views1 page

Aws Data Engineering Project

Uploaded by

myr4112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

■ AWS Data Engineering Project Roadmap

1. Core AWS Services (Must-Know for Data Engineers)


• Storage: Amazon S3 → Data Lake storage (raw, processed, curated zones).
• Compute: AWS EC2 (basic compute), AWS Lambda (serverless functions).
• Databases & Warehousing: Amazon RDS (Postgres/MySQL), Amazon Redshift (data
warehouse).
• ETL / Data Integration: AWS Glue (ETL with Python/Spark), Glue Data Catalog.
• Querying: Amazon Athena (SQL on S3).
• Streaming: Kinesis or MSK (Kafka).
• Workflow Orchestration: AWS Step Functions or Apache Airflow (MWAA).

2. Data Engineering Concepts You’ll Apply


• Data Lake Zones in S3: raw, staging, curated.
• ETL/ELT with Glue + Python/Spark.
• Partitioning and Bucketing for big data efficiency.
• Schema evolution and data cataloging.
• Batch vs. Streaming pipelines.

3. Step-by-Step Project Example (Batch Pipeline)


■ Project: E-commerce Sales Data Pipeline on AWS
• Ingest Data → Dump raw CSV/JSON files into S3.
• Catalog Data → Use Glue Crawler to create metadata tables.
• Transform Data (ETL) → Use Glue ETL job (Python/Spark) to clean and join data.
• Query Data → Use Athena (SQL) to query processed data in S3.
• Load to Warehouse → Move curated data to Redshift.
• Orchestrate Workflow → Automate using Step Functions or Airflow.
• Visualization → Connect Redshift/Athena to QuickSight or Power BI.

4. Intermediate Project (Streaming Pipeline)


• Stream clickstream/order events into Kinesis.
• Use Kinesis Data Firehose to land data in S3.
• Transform with Lambda or Glue Streaming Job.
• Query near-real time with Athena.

5. What to Learn Next (Priority Order)


• AWS S3 (data lake basics).
• AWS Glue (ETL with PySpark).
• AWS Athena (serverless querying).
• Amazon Redshift (warehousing).
• Orchestration (Step Functions or Airflow).
• Streaming (Kinesis).

You might also like