Data Engineering Virtual
Internship: by AWS
NAME : ROHIT KUMAR
ADD NO. : 22SCSE1011273
Data Engineering by AWS:
Virtual Internship
Welcome to my virtual internship journey! Explore the world of data
engineering powered by AWS, and learn about the key concepts, services,
and applications that make it possible.
Introduction: What is Data Engineering?
Data Transformation Data Pipelines
Data engineering involves extracting, cleaning, transforming, Data engineers build robust and scalable data pipelines to
and loading data from various sources to prepare it for analysis ensure smooth and efficient data flow for various applications.
and use.
The AWS Cloud: Powering Data Engineering
Scalability and Flexibility Cost-Effective Solutions
AWS offers a wide range of scalable services that allow AWS provides a pay-as-you-go model, ensuring cost-
you to handle massive data volumes and complex efficient data processing and storage without upfront
workloads. investments.
Key Data Engineering
Concepts and Principles
Data Modeling ETL Processes
Understanding the structure Extract, Transform, Load (ETL)
and relationships of data is processes are fundamental to
crucial for effective data data engineering, ensuring
management and analysis. clean and consistent data for
analysis.
Data Warehousing
Centralized storage of data for analytical purposes, enabling
comprehensive insights and business intelligence.
Building Data Pipelines with
AWS Services
Data Ingestion Data Transformation
AWS services like Kinesis and SQS AWS Glue provides a serverless ETL
enable real-time data ingestion and service for transforming data into a
event streaming. usable format.
Data Storage Data Analytics
Amazon S3 offers scalable and durable Services like Redshift and Athena
object storage for raw and processed allow for fast and efficient data
data. analysis and reporting.
Exploring AWS Glue and
AWS Athena
AWS Glue
1
A serverless ETL service for data transformation, cleansing,
and enriching, simplifying complex data processing.
AWS Athena
2
A serverless query service enabling interactive analysis of
data stored in S3 using SQL, eliminating the need for
complex infrastructure.
Leveraging Amazon S3 for Data
Storage
Object Storage
S3 offers secure and scalable object storage for a wide range of
data, from raw logs to processed files.
Data Durability
S3 ensures data durability and availability, with multiple copies
and automatic replication for high reliability.
Data Access
S3 provides flexible data access through APIs and SDKs, allowing
seamless integration with various applications.
Scaling with Amazon EMR and Apache Spark
Scalability and Performance
EMR provides managed Hadoop and Spark clusters for large-scale data processing, offering
1
high performance and scalability.
Distributed Processing
2 Apache Spark enables distributed data processing, allowing parallel execution
of tasks for faster insights.
Data Analytics
3 Spark provides a powerful engine for data analytics, supporting
various data processing and machine learning tasks.
Securing and Monitoring Data Workloads
Data Encryption
1
AWS offers encryption options for data at rest and in transit, ensuring data confidentiality and integrity.
Access Control
2 IAM policies and security groups restrict access to sensitive data, ensuring only
authorized users can access it.
Monitoring and Auditing
AWS CloudTrail and CloudWatch provide logging and
3
monitoring capabilities, enabling insights into data access
patterns and potential security breaches.
Hands-on Exercises and
Case Studies
1 2
Practical Application Problem-Solving
Implement data pipelines and Develop problem-solving skills by
analyze data using real-world case tackling real-world data challenges
studies, applying the knowledge and identifying solutions using AWS
gained during the internship. services.