KEMBAR78
Data Engineering | PDF | Apache Spark | Apache Hadoop
0% found this document useful (0 votes)
47 views8 pages

Data Engineering

The document outlines a comprehensive roadmap for aspiring data engineers, detailing essential skills and technologies such as SQL, Python, PySpark, data warehousing, and cloud services. It emphasizes the importance of hands-on experience with various tools and frameworks, as well as the need for strong programming and problem-solving skills. Additionally, it highlights the significance of machine learning knowledge and experience in building scalable data solutions in cloud environments for career advancement in data engineering.

Uploaded by

Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views8 pages

Data Engineering

The document outlines a comprehensive roadmap for aspiring data engineers, detailing essential skills and technologies such as SQL, Python, PySpark, data warehousing, and cloud services. It emphasizes the importance of hands-on experience with various tools and frameworks, as well as the need for strong programming and problem-solving skills. Additionally, it highlights the significance of machine learning knowledge and experience in building scalable data solutions in cloud environments for career advancement in data engineering.

Uploaded by

Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Engineering & AL

Data Engineering in corporate will make you a millionaire in 10 years and this is how I think
today's computer science students could achieve the same.

𝗦𝘁𝗲𝗽 𝟭: 𝗦𝗤𝗟
- Basic SQL Syntax
- DDL, DML, DCL
- Joins & Subqueires
- Views & Indexes
- CTEs & Window Functions

𝗦𝘁𝗲𝗽 𝟮: 𝗣𝘆𝘁𝗵𝗼𝗻
- Fundamentals
- Numpy
- Pandas

𝗦𝘁𝗲𝗽 𝟯: 𝗣𝘆𝘀𝗽𝗮𝗿𝗸
- RDD
- Dataframe
- Datasets
- Spark Streaming
- Optimization techniques

𝗦𝘁𝗲𝗽 𝟰: 𝗗𝗮𝘁𝗮 𝗪𝗮𝗿𝗲𝗵𝗼𝘀𝘂𝗶𝗻𝗴/𝗗𝗮𝘁𝗮 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴


- OLAP vs OLTP
- Star & Snowflake Schema
- Fact & Dimension Tables
- Slowly Changing Dimensions (SCD)

𝗦𝘁𝗲𝗽 𝟱: 𝗖𝗹𝗼𝘂𝗱 𝗦𝗲𝗿𝘃𝗶𝗰𝗲𝘀


- Nosql DB
- Relational DB
- Datawarehousing
- Scheduling & Orchestration
- Messaging
- ETL Services
- Storage Services
- Data Processing Services

𝗛𝗲𝗿𝗲 𝗮𝗿𝗲 𝘀𝗼𝗺𝗲 𝘃𝗮𝗹𝘂𝗮𝗯𝗹𝗲 𝗿𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀 𝘁𝗼 𝗵𝗲𝗹𝗽 𝘆𝗼𝘂 𝗴𝗲𝘁 𝘀𝘁𝗮𝗿𝘁𝗲𝗱:

• SQL - https://lnkd.in/gV_5EFtE
• Python - https://lnkd.in/dt_-2-Uj
• Pyspark - https://lnkd.in/gtCdub-V
• Airflow - https://lnkd.in/guebuHJ7
• Kafka - https://lnkd.in/gVZUT52s
• Azure Cloud - https://lnkd.in/gwc3By9h
• Google Cloud - https://lnkd.in/gV_5EFtE
• AWS - https://lnkd.in/gJeUGfjS
• Projects - https://lnkd.in/gcpsNtnw
5+ years of experience in software engineering, with a focus on platform engineering or
cloud-native application development.

Strong programming skills in Python, Go, or Java, with experience in building


distributed systems.
Extensive hands-on experience with GCP services, including Compute Engine,
Kubernetes Engine (GKE), Cloud Storage, BigQuery, Pub/Sub, and Cloud Functions.
Proficiency in designing and managing cloud-based architectures with a deep
understanding of GCP IAM, VPCs, and networking.
Expertise in containerization and orchestration technologies, including Docker and
Kubernetes.
Hands-on experience with Infrastructure as Code tools like Terraform, Pulumi, or Cloud
Deployment Manager.
Familiarity with monitoring and observability tools such as Cloud Monitoring,
Prometheus, or Grafana.
Strong understanding of network protocols (e.g., TCP/IP, HTTP/HTTPS), load balancing,
and security best practices in a cloud environment.
Knowledge of CI/CD pipelines and related tools (e.g., Jenkins, GitLab CI/CD, or
CircleCI).
Experience with data processing and storage solutions like BigQuery, Firestore, or
Cloud Spanner is a plus.
Excellent problem-solving skills with the ability to troubleshoot and resolve complex
system issues.
Bachelor's or Master’s degree in Computer Science, Engineering, or a related field or
equivalent military experience required

Advanced degree in a technical field such as computer science

•Experience with building and managing scalable ML infrastructure on cloud


platforms
•Ability to promote product excellence and collaboration, driving a portfolio of
concurrent engineering projects, from short-term critical feature launches to long-
term research initiatives.
•Ability to create a compelling vision for the future, communicate clearly, and
have a collaborative leadership approach.
•Experience with machine learning frameworks such as Tensorflow, Caffe2
PyTorch, Spark ML, scikit-learn, or related frameworks
AI experience is a must for this position: Knowledge about LLMs (GPT, Mistral, Llama, Claude,
etc…) and experience developing with them (usage, prompting, etc), AI frameworks
(Langchain, LlamaIndex, Auto Gen, etc), AI architectures (RAG, reranking, etc)

You have at least 5+ years of experience in back-end development in with Python,


Typescript/ Node.js or Java, with a focus on delivering for security, scalability,
availability, and performance
Security is at the forefront of your mind in everything that you do
Nice to have: Ideally you will have worked with AWS in a production environment and
understand how to design for, deploy on and get the best out of, the environment and
services provided by Amazon
Deep knowledge about large-scale recommendation and ML/Ranking systems

•Proven track record of operating highly-available systems at significant scale


•Experience in at least one of the following areas:
•Retrieval Systems (Indexing, Retrieval, ANN Embedding Based Retrieval
System)
•Large-scale ML Feature Management & Serving
•Online Ranking Serving System (ML Inference, Ranking, Marketplace)
•Full-stack Ranking Backend Generalists with Prior Experience in Large-scale
Consumer Products
•h NoSQL systems e.g., Bigtable, HBase
3+ years experience working with distributed Ranking / Recommendation applications in
production

Deep knowledge and experience in recommendation system development life cycles

•Deep Expertise in one or more of the following areas: Retrieval Systems, ML


Feature Management and Serving Systems, Vector Database (ANN), Online
Ranking and Marketplace Systems, etc.

ore Requirements

•Expertise in Large-Scale Storage Systems


•Deep knowledge of AWS storage services (S3, DynamoDB, EBS, EFS, FSx, Glacier) and their
performance characteristics.
•Experience designing and optimizing object storage, block storage, and file storage
solutions at scale.
•Strong understanding of storage durability, consistency models, replication, and erasure
coding for fault tolerance.
•Experience implementing tiered storage solutions and cost-optimized data retention
strategies.
•Distributed Systems & Scalability
•Deep understanding of distributed storage architectures, CAP theorem, and consistency
models.
•Expertise in partitioning, sharding, and replication strategies for low-latency, high-
throughput storage.
•Experience designing and implementing highly available, fault-tolerant distributed
systems using consensus algorithms (Raft, Paxos, Gossip Protocol).
•Hands-on experience with high-performance NoSQL databases (DynamoDB, Cassandra,
RocksDB).
•High-Performance Backend Engineering
•Strong programming skills in Kotlin, Java, Go, Rust, or Python for backend storage
development.
•Experience building event-driven, microservices-based architectures using gRPC, REST, or
WebSockets.
•Expertise in data serialization formats (Parquet, Avro, ORC) for optimized storage access.
•Experience implementing data compression, deduplication, and indexing strategies to
improve storage efficiency.
•Cloud-Native & Infrastructure Automation
•Strong hands-on experience with AWS Well-Architected Framework and cloud storage best
practices.
•Proficiency in Infrastructure as Code (IaC) using Terraform, AWS CDK, or CloudFormation.
•Experience with Kubernetes (EKS), serverless architectures (Lambda, Fargate), and
containerized storage workloads.
•Expertise in CI/CD automation for storage services, leveraging GitHub Actions,
CodePipeline, Jenkins, or ArgoCD.
•Performance Optimization & Observability
•Experience with benchmarking, profiling, and optimizing storage workloads.
•Proficiency in performance monitoring tools (CloudWatch, Prometheus, OpenTelemetry,
Grafana) for storage systems.
•Strong debugging and troubleshooting skills for latency bottlenecks, memory leaks, and
concurrency issues.
•Experience designing observability strategies (tracing, metrics, structured logging) for
large-scale storage systems.
•Security, Compliance, and Data Protection
•Deep knowledge of data security, encryption at rest/in transit, and IAM policies in AWS.
•Experience implementing fine-grained access controls (IAM, KMS, STS, VPC Security
Groups) for multi-tenant storage solutions.
•Familiarity with compliance frameworks (SOC2, GDPR, HIPAA, FedRAMP) and best
practices for secure data storage.
•Expertise in disaster recovery, backup strategies, and multi-region failover solutions.
•Leadership & Architectural Strategy
•Proven ability to design, document, and drive large-scale storage architectures from
concept to production.
•Experience leading technical design reviews, architecture discussions, and engineering
best practices.
•Strong ability to mentor senior and mid-level engineers, fostering growth in distributed
storage expertise.
•Ability to influence technical roadmaps, long-term vision, and cost optimization strategies
for backend storage.
Proficient in Scala or Java or Python, spark, HQL and SQL.

• Deep knowledge in Hadoop ecosystem, like HDFS, Hive, MapReduce, Presto etc.
• Advanced knowledge of complex software design, distributed system design, design patterns,
data structures and algorithms.
• Excellent data analytics skills and ability to explore and identify data issues.
• Ability to explain complex subjects in layman’s terms.
• Experience with distributed version control like Git or similar
• Familiarity with continuous integration/deployment processes and tools such as Jenkins and
Maven.
• Familiar with public cloud technologies in Google Cloud Platform, especially BigQuery, GCS
and Dataproc.
• Experience with ETL pipelines.
• Experience in advertising domain.
• Familiar with workflow management systems like airflow or oozie.
• Experience with enterprise monitoring and alerting solutions like Prometheus, Graphite, alerts
manager and Splunk.
𝗦𝗤𝗟
- How would you write a query to calculate a cumulative sum or running total within a specific
partition in SQL?
- How do window functions differ from aggregate functions, and when would you use them?
- How do you identify and remove duplicate records in SQL without using temporary tables?

𝗣𝘆𝘁𝗵𝗼𝗻
- How do you manage memory efficiently when processing large files in Python?
- What are Python decorators, and how would you use them to optimize reusable code in ETL
processes?
- How do you use Python’s built-in logging module to capture detailed error and audit logs?

𝗣𝘆𝘀𝗽𝗮𝗿𝗸
- How would you handle skewed data in a Spark job to prevent performance issues?
- What is the difference between the Spark Session and Spark Context? When should each be
used?
- How do you handle backpressure in Spark Streaming applications to manage load
effectively?

𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀
- How do you configure cluster autoscaling in Databricks, and when should it be used?
- How do you implement data versioning in Delta Lake tables within Databricks?
- How would you monitor and optimize Databricks job performance metrics?

𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮 𝗙𝗮𝗰𝘁𝗼𝗿𝘆


- What are tumbling window triggers in Azure Data Factory, and how do you configure them?
- How would you enable managed identity-based authentication for linked services in ADF?
- How do you create custom activity logs in ADF for monitoring data pipeline execution?

𝗖𝗜/𝗖𝗗
- What are blue-green deployments, and how would you use them for ETL jobs?
- How do you implement rollback mechanisms in CI/CD pipelines for data integration
processes?
- What strategies do you use to handle schema evolution in data pipelines as part of CI/CD?

𝗗𝗮𝘁𝗮 𝗪𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗶𝗻𝗴
- How do you optimize join operations in a data warehouse to improve query performance?
- What is a slowly changing dimension (SCD), and what are different ways to implement it in a
data warehouse?
- How do surrogate keys benefit data warehouse design over natural keys?

𝗗𝗮𝘁𝗮 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴
- How do you decide between a star schema and a snowflake schema for a data warehouse?
Provide examples of scenarios where each is ideal.
- What is dimensional modeling, and how does it differ from entity-relationship modeling in
terms of use cases?
- How do you handle one-to-many relationships in a dimensional model to ensure efficient
querying?

Experience creating solutions incorporating Machine Learning algorithms and models using
Python with Data Engineering libraries and tools

 You have developed server-side Java and Python applications using mainstream
libraries and frameworks, including the Spring framework, Pandas, SciPy, PySpark, and
Pydantic
 Current cloud technology experience with AWS
 Experience integrating with async messaging, logging, or queues, such as Kafka,
RabbitMQ, SQS, NATS
 You collaborate as a hands-on team member developing a significant commercial
software project in Java and Python
 Software development experience building and testing applications following secure
coding practices. Additional preferred experience includes building systems for
financial services or tightly regulated businesses, security and privacy compliance
(GPDR, CCPA, ISO 27001, PCI, HIPAA, etc.) experience

Responsibilities

 You are an active collaborator as a primary member of a software engineering team


focused on building event-driven services that provide secure, efficient solutions in a
determined timeframe
 You will work with the Data Science teams, creating solutions incorporating Machine
Learning algorithms and models using Python with Data Engineering libraries and tools
 You can work on a scalable data streaming application functionality on an AWS cloud-
based platform
 Diligently observe and maintain Standards for Regulatory Compliance and Information
Security, plus deliver and maintain accurate, complete, and current documentation
 Participate in full Agile cycle engagements, including meetings, iterative development,
estimations, code reviews, and design sessions
 You will work with the service quality engineering team to ensure that only thoroughly
tested code makes it to production, then own deliverables from design through
production operationalization

Qualifications

 You have 8+ years of software development experience building and testing


applications following secure coding practices
 You are a hands-on team member working on a significant commercial software project
in Java and Python
 Your recent experience is hands-on building and supporting commercial systems
managing data and transactions, including server-side development of Data Flow
processes, incorporating Machine Learning models, and performing Data Enrichment
and ETL processes.
 Current cloud technology experience with AWS (Kubernetes, Fargate, EC2, S3, RDS
PostgreSQL, Lambda, OpenSearch/Elasticsearch). Familiarity with creating and using
Docker/Kubernetes applications
 Experience with Continuous Integration/Continuous Delivery (CI/CD) processes and
practices (CodeCommit, CodeDeploy, CodePipeline/Harness/Jenkins/GitHub Actions,
CLI, BitBucket/Git).
 Knowledgeable and experienced with software and system patterns and their
application in prior works. Experience gathering and assessing specifications and
requirements. Experience supporting data science efforts.

Design business-critical data models that would be used to power business decisions. Ensure
data quality, consistency, and accuracy.

 Design, build, and maintain scalable, robust and reliable data pipelines for internal
stakeholders and customers.
 Deliver data products that our customers can use, including data warehouse sharing
and embedded analytics.
 Help develop a mature product analytics capability within the company and empower
data-driven decisions.
 Contribute to the broader Data Analytics community at Zip to influence tooling and
standards to improve culture and productivity.

Qualifications

 5+ years of industry software development experience within the Data domain


 Experience with data processing technologies and frameworks, such as Airflow, luigi,
dbt etc.
 Experience with data warehousing technologies such as Snowflake, Clickhouse,
Redshift etc.
 Ability to effectively communicate complex projects to non-technical stakeholders.
 Bachelor’s and/or Master’s degree, preferably in CS or engineering fields, or equivalent
experience

Nice to Haves

 Experience with analytics tools such as Superset and Looker.


 Experience with deploying data systems on AWS and Kubernetes.

15+ years of experience in software development, focusing on big data processing, real-time
serving and distributed low latency systems

Expert in multiple distributed technologies (e.g. Spark/Storm, Kafka, Key Value Stores,
Caching, Solr, Druid, etc.)
Proficient in Scala or Java and Full Stack application development.
Deep knowledge in Hadoop ecosystem, like HDFS, Hive, MapReduce, presto etc.
Advanced knowledge of complex software design, distributed system design, design
patterns, data structures and algorithms.
Experience working as a Machine learning engineer closely collaborating with data
scientists.
Experience working with ML frameworks like TensorFlow and ML feature engineering.
Experience in one or more public cloud technologies like GCP, Azure, etc.
Excellent debugging and problem-solving capability.
Experience in working in large teams using CI/CD and agile methodologies.
Domain expertise in Ad Tech systems is a plus.
Experience working with financial applications is a plus.

You might also like