0% found this document useful (0 votes)

187 views37 pages

Data Engineering System Design

The document outlines 12 detailed system design scenarios for Data Engineering interviews, covering real-world contexts, clarifying questions, functional and non-functional requirements, architecture proposals, trade-offs, and security considerations. Each scenario includes specific examples such as real-time user activity analytics, batch ETL pipelines, change data capture, data quality monitoring, cost optimization, cross-region data replication, and data lakehouse architecture. It serves as a comprehensive guide to prepare for technical interviews by showcasing both technical knowledge and critical thinking.

Uploaded by

Swapnil Narayan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

187 views37 pages

Data Engineering System Design

Uploaded by

Swapnil Narayan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Data Engineering System Design

This document includes 12 detailed system design scenarios frequently asked in Data
Engineering interviews. Each scenario explains:

● The problem and real-world context

● Clarifying questions to ask in an interview

● Functional and non-functional requirements

● A detailed architecture proposal

● Trade-offs and technology choices

● Security and compliance considerations

● Interview talking points

Use this to prepare thorough answers that showcase both technical knowledge and critical
thinking.
1. Real-Time User Activity Analytics (Clickstream
Processing)
Prompt:

Design a system to process website or app clickstream data in near real-time to compute
metrics like active users, session durations, funnels, and trends.

Real-World Context:

E-commerce companies, news portals, and streaming services all rely on clickstream data to
understand user behaviour. These insights help drive recommendations, detect anomalies, and
improve user engagement.

Clarifying Questions

● What’s the peak event rate (e.g. 10,000 events/sec or 1 million)?

● Is latency critical (e.g. dashboards updated every 10 sec vs. every 5 min)?

● Do we need to store raw events for historical analysis or only aggregates?

● Are there privacy regulations like GDPR requiring data masking?

● Which BI tools will the analysts use?

Functional Requirements

● Handle high-volume event ingestion

● Compute sliding window metrics (e.g. active users every 5 minutes)

● Sessionise user events based on timeouts

● Store both raw logs and aggregated metrics

● Provide API or warehouse for dashboard queries

Non-Functional Requirements

● Scalability for traffic spikes (e.g. during sales)

● Fault tolerance so data isn’t lost during outages

● Cost efficiency for cloud services

● Privacy compliance (GDPR, CCPA)

Architecture Overview

● Ingestion:
Use Kafka or Pub/Sub to capture events from web/mobile clients. Kafka can handle
large event volumes and supports message durability.

● Processing:
Use Flink or Spark Structured Streaming for real-time processing. These frameworks
support windowing functions and can join streams for enrichment.

● Storage:

○ Store raw logs in S3 or GCS in Parquet format, partitioned by date and hour.

○ Store aggregates (like active users, bounce rate) in a data warehouse like
BigQuery, Snowflake, or Redshift for fast querying.

● Visualization:
Connect BI tools like Looker, Tableau, or Superset to the warehouse for dashboards.

Low-Level Design

● Implement sessionization logic that groups events by user_id. A typical session timeout
is 30 minutes of inactivity.

● Use sliding window aggregations to track metrics like active users in the last 5 minutes.

● Enrich data with geo-IP information using lookups or joins in the stream.
Example Diagram

Client → Kafka → Flink/Spark → S3 (Raw Logs)

↓
BigQuery / Snowflake → BI Tools

Trade-offs & Reasoning

● Kafka vs Pub/Sub: Kafka offers better fine-tuned performance and durability but requires
cluster management.

● Flink has lower latency and better state management than Spark, but Spark is more
widely adopted and easier to deploy.

● Using Parquet reduces storage costs and improves scan speeds in warehouses.

Security & Compliance

● Encrypt events in transit.

● Mask PII fields like email or IP before writing to warehouses.

● Implement fine-grained permissions for warehouse queries.

Interview Talking Points:

● Describe windowing and sessionization logic in detail.

● Explain how you’d handle spikes in traffic.

● Discuss why raw data is retained for backfills or new analytics.

2. Batch ETL Pipeline for E-commerce
Prompt:

Design a daily ETL pipeline that processes CSV files uploaded to S3, transforms them, and
loads them into Snowflake for analytics.

Real-World Context:

E-commerce platforms often exchange data with partners in CSV format. The data must be
cleaned, validated, and integrated into internal warehouses for reporting and fraud analysis.

Clarifying Questions

● Daily data size (e.g. 10 GB or 1 TB)?

● Complexity of transformations — simple cleaning or multi-table joins?

● Required SLA for pipeline completion?

● Compliance requirements for data audits?

Functional Requirements

● Automatically detect and ingest new CSV files from S3

● Transform data (clean nulls, cast data types, join reference tables)

● Run data quality checks

● Load data into Snowflake

● Alert on failures

Non-Functional Requirements

● Handle increasing file sizes without performance degradation

● Maintain data quality and consistency

● Provide observability through logging and metrics

Architecture Overview

● Ingestion:
External systems drop CSV files into an S3 bucket.

● Orchestration:
Airflow schedules and manages the ETL pipeline.

● Processing:
Use PySpark to transform data. Spark is highly efficient for large CSV parsing and can
handle distributed joins.

● Data Quality:
Implement Great Expectations or Amazon Deequ for data validation (e.g. null checks,
uniqueness constraints).

● Warehouse Load:
Load transformed data into Snowflake using bulk load commands like COPY INTO.

Low-Level Design

● Design Spark jobs to handle:

○ Parsing large CSV files with defined schemas

○ Joins with lookup datasets (e.g. SKUs, pricing tables)

○ Null removal and data type casting

● Implement Airflow alerting with email or Slack on job failures.

Example Diagram

S3 → Airflow → Spark → Data Quality Checks → Snowflake

Trade-offs & Reasoning

● Spark scales better than Pandas for large datasets.

● Snowflake handles semi-structured data and scales compute resources on demand.

● Airflow allows complex DAGs with retries and SLA monitoring.

Security & Compliance

● Use S3 bucket policies to restrict file uploads.

● Implement Snowflake’s row-level security for sensitive customer data.

Interview Talking Points:

● Explain why Spark is chosen over Pandas.

● Discuss partitioning strategies in Snowflake.

● Describe how you’d debug data pipeline failures.

3. Change Data Capture (CDC) Pipeline
Prompt:

Design a pipeline that captures changes from a transactional database and replicates them to
an analytics warehouse in near real-time.

Real-World Context:

Many analytics systems rely on data from OLTP systems. Instead of bulk daily extracts, modern
pipelines use CDC to reflect changes quickly, enabling fresher insights.

Clarifying Questions

● Source database type (MySQL, Postgres, Oracle)?

● How much data changes per hour?

● How fresh must data in the warehouse be (seconds, minutes, hours)?

● Any GDPR concerns regarding replicating sensitive data?

Functional Requirements

● Detect inserts, updates, deletes in source DB

● Replicate changes into target warehouse

● Maintain consistency even during outages

● Support schema changes gracefully

Non-Functional Requirements

● Low replication lag

● Scalability to high transaction volumes

● Fault tolerance and recovery mechanisms

● Auditing for compliance

Architecture Overview

● CDC Capture:
Debezium reads database logs and streams changes to Kafka.

● Processing:
Spark Structured Streaming or Flink reads Kafka topics, applies business logic, and
writes changes to the warehouse.

● Warehouse Load:
Load data into Snowflake, Redshift, or BigQuery using idempotent upserts.

Low-Level Design

● Create separate Kafka topics per table.

● Handle tombstone records (deletes) in downstream jobs.

● Implement error handling for schema drift.

● Use checkpointing in Spark to resume from failures.

Example Diagram

+---------+ +----------+ +--------+ +-----------+

DB →| Debezium|→ | Kafka |→ | Spark |→ | Snowflake |
+---------+ +----------+ +--------+ +-----------+

Trade-offs & Reasoning

● Debezium simplifies log parsing but requires access to database logs.

● Streaming CDC reduces warehouse latency but adds operational complexity.

● Idempotent writes are critical to avoid duplicate rows.

Security & Compliance

● Encrypt CDC streams.

● Mask PII in data flows before warehouse ingestion.

● Implement warehouse access controls.

Interview Talking Points:

● Explain how Debezium handles schema changes.

● Discuss exactly-once vs at-least-once semantics.

● Describe how you’d backfill historical data if needed.

4. Data Quality Monitoring Architecture
Prompt:

Design a system to monitor and enforce data quality across data pipelines, catching issues like
null rates, schema drift, and outliers.

Real-World Context:

Without data quality checks, pipelines silently propagate bad data, leading to broken reports and
machine learning failures. Companies increasingly require automated monitoring and alerts for
data quality.

Clarifying Questions

● Which types of data checks are required (e.g. nulls, uniqueness, statistical tests)?

● Batch vs streaming pipelines to monitor?

● How many datasets need monitoring?

● Preferred alerting channels (Slack, PagerDuty)?

Functional Requirements

● Validate data with configurable rules

● Visualize historical quality trends

● Alert engineers on failures

● Maintain audit logs for compliance

Non-Functional Requirements

● Scalable to thousands of datasets

● Minimal overhead for streaming checks

● Centralized monitoring dashboards

Architecture Overview

● Validation Tools:
Use Great Expectations, Deequ, or custom checks.

● Orchestration:
Integrate checks into Airflow or pipeline frameworks.

● Storage:
Time-series DB (e.g. Prometheus, InfluxDB) stores metrics over time.

● Alerts:
Notify teams via Slack, email, or incident tools.

Low-Level Design

● Define validation rules as YAML or JSON config files.

● Store validation results and metrics per table.

● Track trends over time to detect gradual data drift.

Example Diagram

Data Pipelines → Validation Engine → Metrics Store → Dashboards

↓
Alerts

Trade-offs & Reasoning

● Deep statistical checks catch subtle errors but add latency.

● Validation on streaming data is more complex than batch.

● Central dashboards help prioritize which pipelines need attention.

Security & Compliance

● Mask sensitive fields in logs.

● Enforce access controls on dashboards.

● Maintain validation run history for audits.

Interview Talking Points:

● Explain differences between batch and streaming data quality checks.

● Discuss how to measure and track data drift.

● Describe error-handling strategies for quality failures.

5. Data Pipeline Cost Optimization
Prompt:

Design a system or strategy to reduce costs in an existing data pipeline that uses cloud services
like Spark, Snowflake, and S3.

Real-World Context:

As data volumes grow, cloud costs can skyrocket. Companies often seek ways to balance
performance with cost-efficiency.

Clarifying Questions

● Which parts of the pipeline are the costliest?

● Any hard SLAs for data freshness?

● Batch or streaming workloads?

● Budget constraints vs performance goals?

Functional Requirements

● Analyze pipeline costs

● Recommend cost-saving changes

● Implement optimizations without breaking SLAs

Non-Functional Requirements

● Scalability to large workloads

● Transparent reporting to stakeholders

● Minimal performance degradation

Architecture Overview

● Cost Monitoring:
Integrate cloud billing APIs to collect cost data.

● Optimization Engine:
Analyze query patterns, data sizes, and job runtimes.

● Techniques:

○ Partition pruning

○ Data clustering

○ Auto-scaling Spark clusters

○ Warehouse query rewriting

○ Data retention and tiering policies

○ Off-peak scheduling

● Dashboards:
Visualize cost trends and potential savings.

Low-Level Design

● Airflow tasks extract cost metrics daily.

● Identify long-running queries in Snowflake using query history.

● Set warehouse auto-suspend timeouts to avoid idle compute charges.

Example Diagram

Data Pipelines → Cost Metrics Collector → Dashboard

↓
Optimization Engine

Trade-offs & Reasoning

● Aggressive cost cuts might increase job runtimes.

● Older data tiering saves money but complicates retrieval for rare queries.

● Monitoring costs introduce some overhead themselves.

Security & Compliance

● Restrict financial data visibility to appropriate teams.

● Log optimization changes for compliance reviews.

Interview Talking Points:

● Discuss specific Spark and Snowflake tuning techniques.

● Describe cost-visibility tools like cloud cost explorers.

● Explain how to balance cost savings with data SLAs.

6. Cross-Region Data Replication
Prompt:

Design a system to replicate user activity data between US and EU regions, enabling global
analytics while remaining GDPR-compliant.

Real-World Context:

Many global platforms operate in multiple regions and must keep data available for global
reporting while respecting laws like GDPR, which restricts where personal data can be stored.

Clarifying Questions

● How consistent does data need to be across regions (seconds vs minutes delay)?

● Is all data replicated or just specific tables?

● Are there specific GDPR rules about data crossing regions?

● Is there a preference for batch or streaming replication?

Functional Requirements

● Ingest and process data locally in each region

● Replicate selected datasets to a central location

● Aggregate data for global reporting

● Maintain data residency for sensitive user data

Non-Functional Requirements

● Minimal replication lag

● Scalability as traffic grows

● Resilience to regional outages

● Compliance with data privacy laws

Architecture Overview

● Regional Processing:
Kafka in each region ingests events locally.

● Replication:
Use MirrorMaker 2.0 (or Confluent Replicator) to copy topics across regions.

● Processing:
Spark jobs in each region process and store data in S3 buckets.

● Aggregation:
A centralized batch job merges data from regional buckets into a global warehouse for
analytics.

Low-Level Design

● Tag all records with a region identifier for auditing.

● Mask or anonymize personal data before sending across regions.

● Configure MirrorMaker with topic-level filtering so only necessary data crosses regions.

● Use daily batch jobs to aggregate global metrics.

Example Diagram

+----------+ +--------+ +--------+

US → | Kafka |→ | Spark |→ | S3 (US) |
+----------+ +--------+ +--------+
↓
MirrorMaker
↓
+----------+ +--------+ +--------+
EU → | Kafka |→ | Spark |→ | S3 (EU) |
+----------+ +--------+ +--------+
↓
Aggregator Job → Global Data Warehouse
Trade-offs & Reasoning

● MirrorMaker offers straightforward replication but adds latency.

● GDPR compliance means some data might stay regional only.

● A centralized warehouse simplifies reporting but can become a bottleneck.

Security & Compliance

● Encrypt all replicated data streams.

● Restrict which topics can cross borders.

● Audit logs for all cross-region transfers.

Interview Talking Points:

● Describe how GDPR impacts architecture decisions.

● Explain how you’d handle schema changes across replicated topics.

● Discuss failover strategies if one region becomes unavailable.

.
7. Data Lakehouse Architecture for Diverse Data
Prompt:

Design a data lakehouse architecture that supports structured, semi-structured, and
unstructured data for analytics and machine learning.

Real-World Context:

Modern enterprises deal with diverse data types—CSV files, JSON logs, images, videos, and
streaming data. Traditional warehouses struggle to store everything cost-effectively. The
lakehouse approach bridges the gap between flexible storage and reliable analytics.

Clarifying Questions

● Estimated data volume (TBs, PBs)?

● Expected latency for analytics (near real-time vs overnight batch)?

● Any regulatory restrictions (e.g. on PII storage)?

● How many concurrent users for BI and ML workloads?

Functional Requirements

● Unified storage for all data types

● Support for ACID transactions for consistency

● Efficient querying for analytics

● Integration with BI tools and ML platforms

● Schema evolution support

Non-Functional Requirements

● Cost-efficient storage for large volumes

● Scalability for both read and write workloads

● Lineage tracking and governance

● High durability and availability

Architecture Overview

● Storage:
Store data in cloud object storage (e.g. S3, ADLS Gen2).

● Format:
Use formats like Delta Lake, Iceberg, or Hudi to provide ACID transactions, schema
evolution, and time-travel features.

● Processing:
Spark or Databricks handles both batch and streaming workloads.

● Catalog:
Use Hive Metastore, AWS Glue, or Unity Catalog to maintain metadata.

● Visualization:
Connect tools like Power BI, Looker, or Tableau directly to lakehouse tables for
analytics.

Low-Level Design

● Partition data by business-relevant keys (e.g. date, region, product_id).

● Leverage Delta Lake’s time travel to recover previous versions.

● Optimize file sizes to reduce small-file issues.

Example Diagram

Raw Data → Data Lake (Delta / Iceberg / Hudi)

↓
Spark Processing
↓
BI Tools & ML Models

Trade-offs & Reasoning

● Delta Lake makes querying object storage reliable but increases storage overhead due
to transaction logs.

● Iceberg excels at large table scalability with hidden partitioning.

● Lakehouses simplify cost by avoiding the need for separate data lakes and warehouses.

Security & Compliance

● Encrypt data at rest.

● Apply fine-grained permissions on table access.

● Use data masking for sensitive columns (PII).

Interview Talking Points:

● Explain differences between Delta Lake, Iceberg, and Hudi.

● Discuss how lakehouses handle schema evolution.

● Describe BI query performance optimizations (e.g. Z-Ordering).

8. Data Mesh Architecture
Prompt:

Design a data platform using Data Mesh principles for a large enterprise.

Real-World Context:

Traditional centralized data lakes can become bottlenecks as companies grow. Data Mesh
pushes data ownership to domain teams while maintaining governance and discoverability.

Clarifying Questions

● How many domains exist in the organization?

● How is data governance handled centrally?

● Real-time vs batch data products?

● Existing data catalog solutions in place?

Functional Requirements

● Allow domain teams to own and publish data products

● Enable data discovery through a central catalog

● Define standard contracts for data products

● Support both batch and streaming products

Non-Functional Requirements

● Scalability for many domains

● Secure sharing of data across teams

● Cost transparency and tracking

Architecture Overview

● Data Products:
Each domain publishes data products with well-defined contracts (schemas, SLAs, etc.).

● Catalog:
Central catalog (DataHub, Amundsen) indexes all data products for discovery.

● APIs:
Standardized APIs for consuming data products.

● Streaming Backbone:
Kafka may be used for cross-domain event sharing.

Low-Level Design

● Domains maintain their own pipelines and storage.

● Catalog crawlers ingest metadata from each domain’s systems.

● Usage metrics help identify popular products.

Example Diagram

Domain A Data Product

Domain B Data Product
↓
Central Catalog
↓
Consumers via APIs
Trade-offs & Reasoning

● Data Mesh improves agility but requires strong governance.

● Catalogs prevent chaos by centralizing discovery.

● Standard APIs reduce integration headaches.

Security & Compliance

● Federated IAM policies across domains.

● Auditable metadata changes.

● Data masking standards enforced across all products.

Interview Talking Points:

● Discuss how contracts help avoid data quality issues.

● Explain governance challenges in a decentralized model.

● Describe how to handle domain-level outages.

9. Near Real-Time Deduplication and Idempotent Writes
Prompt:

Design a streaming pipeline that ingests events which might contain duplicates, ensuring
downstream systems receive unique events only once.

Real-World Context:

Many systems send the same event multiple times for reliability. However, duplicate events can
lead to overcounting or corrupted analytics if not handled carefully.

Clarifying Questions

● How often do duplicates occur?

● Are duplicate events exactly identical or slightly different?

● How long can duplicates arrive after the original?

● What is the required latency for downstream consumers?

Functional Requirements

● Detect and eliminate duplicate events.

● Maintain consistent state across restarts.

● Handle very high event volumes.

Non-Functional Requirements

● Low-latency deduplication.

● Scalable state storage for large event volumes.

● Robust fault recovery.

Architecture Overview

● Ingestion:
Kafka ingests events from producers.

● Processing:
Flink or Spark Streaming uses keyed state to track recently seen event IDs.

● Storage:
Deduplicated events are stored in a warehouse or operational DB.

● Idempotent Writes:
Downstream writes use UPSERT semantics to avoid duplication.

Low-Level Design

● Maintain a cache or state table keyed on unique event IDs.

● Watermark logic clears old entries to limit memory usage.

● Writes to downstream systems must support idempotency.

Example Diagram

Client → Kafka → Flink/Spark → Deduplicated Events → Data Warehouse

Trade-offs & Reasoning

● Stateful deduplication requires significant memory.

● Flink provides low-latency deduplication but adds complexity.

● Idempotent writes simplify recovery after failures.

Security & Compliance

● Encrypt data streams.

● Apply RBAC on streaming jobs.

● Audit deduplication logic for correctness.

Interview Talking Points:

● Explain time windows for deduplication.

● Discuss state storage strategies in stream processors.

● Describe how idempotency avoids duplicates in downstream systems.

10. Scalable Data Catalog and Metadata Store
Prompt:

Design a data catalog system where engineers and analysts can discover datasets, track
lineage, and manage metadata at scale.

Real-World Context:

Modern enterprises have thousands of datasets spread across tools like Hive, Snowflake, and
Airflow. Without a data catalog, it’s impossible for teams to find, trust, and reuse existing data.

Clarifying Questions

● How many datasets and tables are we cataloging (thousands, millions)?

● Do we need to track lineage in real-time?

● Types of metadata stored (schemas, business definitions, data quality, tags)?

● Integration with which tools (Airflow, Spark, Snowflake, etc.)?

Functional Requirements

● Centralized searchable catalog

● Store technical and business metadata

● Visualize data lineage

● Expose APIs for programmatic queries

● Track usage metrics and popularity

Non-Functional Requirements

● Scalability for millions of assets

● Fast search latency

● Fine-grained security and access control

Architecture Overview

● Metadata Store:
Use graph DBs like Neo4j for lineage, or ElasticSearch for fast text search.

● Ingestion:
Metadata crawlers periodically scan data sources for schemas, tables, and pipelines.

● APIs and UI:

Expose RESTful APIs and build a frontend UI for search and lineage exploration.

Low-Level Design

● Store entities as nodes and relationships as edges in a graph DB.

● Load metrics like data usage into a time-series DB for popularity tracking.

● Integrate crawlers with orchestration tools like Airflow for automatic updates.

Example Diagram

Data Sources → Crawlers → Metadata Store → API → UI

↓
Lineage Graph

Trade-offs & Reasoning

● Graph databases are perfect for relationship queries like lineage.

● ElasticSearch improves free-text search but requires data duplication.

● Crawlers must balance freshness vs load on data sources.

Security & Compliance

● Role-based access to metadata.

● Mask sensitive information in metadata (e.g. table comments with PII).

● Audit logs for metadata changes.

Interview Talking Points:

● Describe graph models for representing data lineage.

● Explain how you’d handle schema changes in catalogs.

● Discuss how to implement usage-based ranking of assets.

11. Machine Learning Feature Store Design
Prompt:

Design a feature store to centralise machine learning features for reuse across different models
and teams.

Real-World Context:

Without a feature store, data scientists often create the same features repeatedly, leading to
inconsistent results between training and production. A feature store improves efficiency and
model reliability.

Clarifying Questions

● Real-time or batch feature serving required?

● Expected size and number of features?

● How should features be versioned?

● Integration with which ML platforms?

Functional Requirements

● Central repository of features

● Consistent feature values for training and online inference

● Versioning and lineage tracking

● Ability to share features across teams

Non-Functional Requirements

● Low latency for online serving

● Scalability for hundreds or thousands of features

● Data governance and auditing

Architecture Overview

● Offline Store:
Data warehouse (e.g. Snowflake, BigQuery) stores large feature tables for training.

● Online Store:
Low-latency databases like Redis or DynamoDB serve real-time features for production
inference.

● Processing:
Spark pipelines calculate and store new feature values.

● Serving Layer:
REST APIs expose features to models in real-time.

Low-Level Design

● Maintain feature definitions in a metadata catalog.

● Batch compute features daily and write to warehouse.

● Real-time compute handled by streaming jobs (e.g. Flink).

● Implement TTL (time-to-live) for online store data.

Example Diagram

Raw Data → Spark → Feature Store (Offline & Online)

↓
ML Models

Trade-offs & Reasoning

● Offline stores are cost-efficient but slow for real-time predictions.

● Online stores cost more but reduce inference latency.

● Versioning avoids training-serving skew.

Security & Compliance

● RBAC on who can access features.

● Lineage tracking for auditability.

● Masking of sensitive feature values.

Interview Talking Points:

● Explain how to avoid feature drift.

● Discuss handling time-travel queries for training consistency.

● Describe the trade-offs between online and offline stores.

12. IoT Sensor Data Platform
Prompt:

Design a platform to ingest, store, and analyse temperature sensor data from 100,000 IoT
devices across India, supporting real-time alerts for extreme readings.

Real-World Context:

IoT applications in manufacturing, agriculture, and smart cities generate vast amounts of sensor
data that needs to be processed in real-time for critical alerts.

Clarifying Questions

● How frequently does each sensor send data? (e.g. once a second or once a minute?)

● Are alerts needed in real time or can they be delayed slightly?

● How long should raw data be retained?

● Is there location metadata included?

Functional Requirements

● High-frequency data ingestion

● Real-time processing to detect extreme readings

● Store raw data for historical analysis

● Provide dashboards for time-series trends

Non-Functional Requirements

● Low latency for critical alerts

● Cost-effective storage

● Fault tolerance for device outages

Architecture Overview

● Ingestion:
Use MQTT broker for lightweight IoT communication. Kafka may be used if messages
are larger or arrive at high frequency.

● Processing:
Flink processes incoming streams, applies threshold rules, and triggers alerts for
extreme values.

● Storage:

○ Raw data → S3 or GCS for cost-effective archival

○ Aggregated data → Time-series DB like TimescaleDB or InfluxDB

● Alerting:
Serverless functions (like AWS Lambda) send notifications (SMS, email, etc.).

● Visualization:
Grafana dashboards for time-series trends and anomalies.

Low-Level Design

● Flink uses keyed streams to separate processing per device.

● Maintain stateful computations for rolling averages.

● Define alert thresholds dynamically via a config service.

Example Diagram

+------------+
IoT → | MQTT/Kafka | → Flink → Alerts
+------------+ ↓
+-----------+
| S3 (Raw) |
+-----------+
↓
TimescaleDB → Grafana
Trade-offs & Reasoning

● MQTT is efficient for IoT devices with limited bandwidth.

● Flink provides low-latency, stateful stream processing.

● Time-series DBs are ideal for fast queries on sensor data.

Security & Compliance

● TLS for secure communication between devices and brokers.

● Device-level authentication via tokens or certificates.

● Access control on dashboards for sensitive operational data.

Interview Talking Points:

● Explain why MQTT is preferred over HTTP for IoT.

● Discuss how to handle network partitions affecting real-time alerts.

● Describe cost trade-offs for keeping all raw data vs aggregates.

System Design
No ratings yet
System Design
6 pages
Data Engineer Questions
No ratings yet
Data Engineer Questions
10 pages
Karthik (Project Details)
No ratings yet
Karthik (Project Details)
14 pages
Scenario-Based Questions On Integrating Data in A Cloud
No ratings yet
Scenario-Based Questions On Integrating Data in A Cloud
17 pages
Data Engineer Interview - Assessment of ETL Designs
No ratings yet
Data Engineer Interview - Assessment of ETL Designs
13 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
3 pages
Cheatsheet System Design
No ratings yet
Cheatsheet System Design
16 pages
System Design CheatSheet
No ratings yet
System Design CheatSheet
9 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
My Walmart Interviewexperience Answers
No ratings yet
My Walmart Interviewexperience Answers
13 pages
Data Engineer Interview Questions With Examples
No ratings yet
Data Engineer Interview Questions With Examples
8 pages
@Q - B@Snowflake & AWS
No ratings yet
@Q - B@Snowflake & AWS
17 pages
Bda Assign2
No ratings yet
Bda Assign2
4 pages
Data Eng
No ratings yet
Data Eng
10 pages
Big Data Analytics Application
No ratings yet
Big Data Analytics Application
6 pages
ETL Question and Answers
No ratings yet
ETL Question and Answers
6 pages
Data Pipeline Project Overview
No ratings yet
Data Pipeline Project Overview
12 pages
Ayush
No ratings yet
Ayush
25 pages
Azure de QSN and Ans
No ratings yet
Azure de QSN and Ans
16 pages
Data Engineer Introduction
No ratings yet
Data Engineer Introduction
3 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
16 pages
Designing Data Intensive Applications
No ratings yet
Designing Data Intensive Applications
23 pages
System Design Data Engineers Pocket Full
No ratings yet
System Design Data Engineers Pocket Full
15 pages
Project Documentation
No ratings yet
Project Documentation
36 pages
Hackathon Retail
No ratings yet
Hackathon Retail
6 pages
SQL Important Revision
No ratings yet
SQL Important Revision
3 pages
RPSG (FMCG) - Datalake Technical Design Document
No ratings yet
RPSG (FMCG) - Datalake Technical Design Document
23 pages
Design Data Architecture 1st Unit
No ratings yet
Design Data Architecture 1st Unit
58 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
33 pages
Prashanth Snowflake Data Engg
No ratings yet
Prashanth Snowflake Data Engg
5 pages
Technical Documentation
No ratings yet
Technical Documentation
21 pages
Data Engineering Lab
No ratings yet
Data Engineering Lab
6 pages
BASF Interview QA
No ratings yet
BASF Interview QA
4 pages
Systems Analysis and Design 3
No ratings yet
Systems Analysis and Design 3
5 pages
ETL Interview Preparation
No ratings yet
ETL Interview Preparation
18 pages
RealTime Data Analytics Project Checklist
No ratings yet
RealTime Data Analytics Project Checklist
2 pages
PraveenaS DataEngineer
No ratings yet
PraveenaS DataEngineer
4 pages
PDF Data Engineering Interview Questions and Answers
No ratings yet
PDF Data Engineering Interview Questions and Answers
18 pages
Aws Qna
No ratings yet
Aws Qna
6 pages
How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium
No ratings yet
How To Design AWS Data Architectures - by Narjes Karmeni - The Startup - Medium
22 pages
Section 2
No ratings yet
Section 2
1 page
DSBDA Insem
No ratings yet
DSBDA Insem
18 pages
Design A Workflow Management Platform Like Apache Airflo
No ratings yet
Design A Workflow Management Platform Like Apache Airflo
4 pages
Data Engineer
No ratings yet
Data Engineer
5 pages
Abinitio Questions
No ratings yet
Abinitio Questions
10 pages
Akshay Chekuri
No ratings yet
Akshay Chekuri
4 pages
Assignment 2 - Data Storage
No ratings yet
Assignment 2 - Data Storage
4 pages
Data Lake Implementation Improved Processing Time by 4X
No ratings yet
Data Lake Implementation Improved Processing Time by 4X
5 pages
Reusable Components
No ratings yet
Reusable Components
6 pages
Banking DBMS & Netflix Data Strategy
No ratings yet
Banking DBMS & Netflix Data Strategy
22 pages
32 Unnamed 26 03 2025
No ratings yet
32 Unnamed 26 03 2025
19 pages
CEP DWM LAB FALL 2024 - CEP Assigned
No ratings yet
CEP DWM LAB FALL 2024 - CEP Assigned
2 pages
Life
No ratings yet
Life
3 pages
Big Data Analytics
No ratings yet
Big Data Analytics
33 pages
Hrishikesh Reddy (Project)
No ratings yet
Hrishikesh Reddy (Project)
14 pages
Scholary Report Sai Teja
No ratings yet
Scholary Report Sai Teja
4 pages
Data Analyst Training Guide
No ratings yet
Data Analyst Training Guide
4 pages
Ravi
No ratings yet
Ravi
4 pages
Naukri Vivekkumarsingh (7y 6m)
No ratings yet
Naukri Vivekkumarsingh (7y 6m)
4 pages
Ameyo Questionnaire Short
No ratings yet
Ameyo Questionnaire Short
9 pages
TSS Engginer
No ratings yet
TSS Engginer
28 pages
Ibm Tivoli Sample Resume 3
No ratings yet
Ibm Tivoli Sample Resume 3
3 pages
Gong, 2021
No ratings yet
Gong, 2021
10 pages
BD - eBOOK Big Data Data Scientist
No ratings yet
BD - eBOOK Big Data Data Scientist
11 pages
Pass4Test: IT Certification Guaranteed, The Easy Way!
No ratings yet
Pass4Test: IT Certification Guaranteed, The Easy Way!
5 pages
UAV-assisted Distributed Blockchain Technology To Secure The IoTs From DDoS
No ratings yet
UAV-assisted Distributed Blockchain Technology To Secure The IoTs From DDoS
2 pages
Template-Incident-Management-Requirements-Dictionary-Master-Servicenow
No ratings yet
Template-Incident-Management-Requirements-Dictionary-Master-Servicenow
26 pages
Unit 4-Configuration Management
No ratings yet
Unit 4-Configuration Management
30 pages
Securing Smart Cities Using Blockchain Technology
No ratings yet
Securing Smart Cities Using Blockchain Technology
2 pages
SDG-UAEPASS SP Authentication Integration OAuth Version 1.0
50% (2)
SDG-UAEPASS SP Authentication Integration OAuth Version 1.0
23 pages
Network Models for IT Professionals
No ratings yet
Network Models for IT Professionals
88 pages
IEEE P21451-1-7: Providing More Efficient Network Services Over MQTT-SN
No ratings yet
IEEE P21451-1-7: Providing More Efficient Network Services Over MQTT-SN
5 pages
Asynchronous Data Transfer: Synchronous and Asynchronous Operations
No ratings yet
Asynchronous Data Transfer: Synchronous and Asynchronous Operations
9 pages
E-Commerce Fraud Detection System
No ratings yet
E-Commerce Fraud Detection System
4 pages
E TRM
No ratings yet
E TRM
7 pages
Gartner Research Cloud Computing Planning Guide 2019
100% (1)
Gartner Research Cloud Computing Planning Guide 2019
59 pages
SAP Hybris 415
No ratings yet
SAP Hybris 415
26 pages
DELL SC Storage Administration and Advanced Management
No ratings yet
DELL SC Storage Administration and Advanced Management
2 pages
Cybersource Salesforce Order Management Technical Guide
No ratings yet
Cybersource Salesforce Order Management Technical Guide
9 pages
FINAL
No ratings yet
FINAL
5 pages
AUTOSAR CP SWS CANNetworkManagement-pages-1
No ratings yet
AUTOSAR CP SWS CANNetworkManagement-pages-1
3 pages
TALEND ESB 6.0 Cours 1444874212 - 00 - Course - LessonTOC - 13 Files Merged
No ratings yet
TALEND ESB 6.0 Cours 1444874212 - 00 - Course - LessonTOC - 13 Files Merged
203 pages
Home Automation Feasibility Report
No ratings yet
Home Automation Feasibility Report
2 pages
Transaction Management and Con Currency Control
100% (1)
Transaction Management and Con Currency Control
14 pages
Comprehensive Course Catalog Overview
No ratings yet
Comprehensive Course Catalog Overview
15 pages
Java Interview
No ratings yet
Java Interview
2 pages
Opens in A New Window: Types of Direct Memory Access (DMA)
No ratings yet
Opens in A New Window: Types of Direct Memory Access (DMA)
11 pages
Chapter 1-3 Sample
No ratings yet
Chapter 1-3 Sample
18 pages
Dolphin Process Tracking For SAP Order Management
No ratings yet
Dolphin Process Tracking For SAP Order Management
2 pages

Data Engineering System Design

Uploaded by

Data Engineering System Design

Uploaded by

Data Engineering System Design

●​ The problem and real-world context​

●​ Clarifying questions to ask in an interview​

●​ Functional and non-functional requirements​

●​ A detailed architecture proposal​

●​ Trade-offs and technology choices​

●​ Security and compliance considerations​

●​ Interview talking points​

●​ What’s the peak event rate (e.g. 10,000 events/sec or 1 million)?​

●​ Do we need to store raw events for historical analysis or only aggregates?​

●​ Are there privacy regulations like GDPR requiring data masking?​

●​ Which BI tools will the analysts use?​

●​ Handle high-volume event ingestion​

●​ Compute sliding window metrics (e.g. active users every 5 minutes)​

●​ Sessionise user events based on timeouts​

●​ Store both raw logs and aggregated metrics​

●​ Provide API or warehouse for dashboard queries​

●​ Scalability for traffic spikes (e.g. during sales)​

●​ Fault tolerance so data isn’t lost during outages​

●​ Cost efficiency for cloud services​

●​ Privacy compliance (GDPR, CCPA)​

Client → Kafka → Flink/Spark → S3 (Raw Logs)

Trade-offs & Reasoning

Security & Compliance

●​ Encrypt events in transit.​

●​ Mask PII fields like email or IP before writing to warehouses.​

●​ Implement fine-grained permissions for warehouse queries.​

Interview Talking Points:

●​ Describe windowing and sessionization logic in detail.​

●​ Explain how you’d handle spikes in traffic.​

●​ Discuss why raw data is retained for backfills or new analytics.​

●​ Daily data size (e.g. 10 GB or 1 TB)?​

●​ Complexity of transformations — simple cleaning or multi-table joins?​

●​ Required SLA for pipeline completion?​

●​ Compliance requirements for data audits?​

●​ Automatically detect and ingest new CSV files from S3​

●​ Run data quality checks​

●​ Load data into Snowflake​

●​ Handle increasing file sizes without performance degradation​

●​ Maintain data quality and consistency​

●​ Provide observability through logging and metrics​

●​ Design Spark jobs to handle:​

○​ Parsing large CSV files with defined schemas​

○​ Joins with lookup datasets (e.g. SKUs, pricing tables)​

○​ Null removal and data type casting​

●​ Implement Airflow alerting with email or Slack on job failures.​

S3 → Airflow → Spark → Data Quality Checks → Snowflake

Trade-offs & Reasoning

●​ Spark scales better than Pandas for large datasets.​

●​ Snowflake handles semi-structured data and scales compute resources on demand.​

●​ Airflow allows complex DAGs with retries and SLA monitoring.​

Security & Compliance

●​ Use S3 bucket policies to restrict file uploads.​

●​ Implement Snowflake’s row-level security for sensitive customer data.​

Interview Talking Points:

●​ Explain why Spark is chosen over Pandas.​

●​ Discuss partitioning strategies in Snowflake.​

●​ Describe how you’d debug data pipeline failures.​

●​ Source database type (MySQL, Postgres, Oracle)?​

●​ How much data changes per hour?​

●​ How fresh must data in the warehouse be (seconds, minutes, hours)?​

●​ Any GDPR concerns regarding replicating sensitive data?​

●​ Detect inserts, updates, deletes in source DB​

●​ Replicate changes into target warehouse​

●​ Maintain consistency even during outages​

●​ Support schema changes gracefully​

●​ Low replication lag​

●​ Scalability to high transaction volumes​

●​ Fault tolerance and recovery mechanisms​

●​ Auditing for compliance​

●​ Create separate Kafka topics per table.​

●​ Handle tombstone records (deletes) in downstream jobs.​

●​ Implement error handling for schema drift.​

●​ Use checkpointing in Spark to resume from failures.​

+---------+ +----------+ +--------+ +-----------+

Trade-offs & Reasoning

● The problem and real-world context

● Clarifying questions to ask in an interview

● Functional and non-functional requirements

● A detailed architecture proposal

● Trade-offs and technology choices

● Security and compliance considerations

● Interview talking points

● What’s the peak event rate (e.g. 10,000 events/sec or 1 million)?

● Do we need to store raw events for historical analysis or only aggregates?

● Are there privacy regulations like GDPR requiring data masking?

● Which BI tools will the analysts use?

● Handle high-volume event ingestion

● Compute sliding window metrics (e.g. active users every 5 minutes)

● Sessionise user events based on timeouts

● Store both raw logs and aggregated metrics

● Provide API or warehouse for dashboard queries

● Scalability for traffic spikes (e.g. during sales)

● Fault tolerance so data isn’t lost during outages

● Cost efficiency for cloud services

● Privacy compliance (GDPR, CCPA)

● Encrypt events in transit.

● Mask PII fields like email or IP before writing to warehouses.

● Implement fine-grained permissions for warehouse queries.

● Describe windowing and sessionization logic in detail.

● Explain how you’d handle spikes in traffic.

● Discuss why raw data is retained for backfills or new analytics.

● Daily data size (e.g. 10 GB or 1 TB)?

● Complexity of transformations — simple cleaning or multi-table joins?

● Required SLA for pipeline completion?

● Compliance requirements for data audits?

● Automatically detect and ingest new CSV files from S3

● Run data quality checks

● Load data into Snowflake

● Handle increasing file sizes without performance degradation

● Maintain data quality and consistency

● Provide observability through logging and metrics

● Design Spark jobs to handle:

○ Parsing large CSV files with defined schemas

○ Joins with lookup datasets (e.g. SKUs, pricing tables)

○ Null removal and data type casting

● Implement Airflow alerting with email or Slack on job failures.

● Spark scales better than Pandas for large datasets.

● Snowflake handles semi-structured data and scales compute resources on demand.

● Airflow allows complex DAGs with retries and SLA monitoring.

● Use S3 bucket policies to restrict file uploads.

● Implement Snowflake’s row-level security for sensitive customer data.

● Explain why Spark is chosen over Pandas.

● Discuss partitioning strategies in Snowflake.

● Describe how you’d debug data pipeline failures.

● Source database type (MySQL, Postgres, Oracle)?

● How much data changes per hour?

● How fresh must data in the warehouse be (seconds, minutes, hours)?

● Any GDPR concerns regarding replicating sensitive data?

● Detect inserts, updates, deletes in source DB

● Replicate changes into target warehouse

● Maintain consistency even during outages

● Support schema changes gracefully

● Low replication lag

● Scalability to high transaction volumes

● Fault tolerance and recovery mechanisms

● Auditing for compliance

● Create separate Kafka topics per table.

● Handle tombstone records (deletes) in downstream jobs.

● Implement error handling for schema drift.

● Use checkpointing in Spark to resume from failures.

● Debezium simplifies log parsing but requires access to database logs.

● Streaming CDC reduces warehouse latency but adds operational complexity.

● Idempotent writes are critical to avoid duplicate rows.

● Encrypt CDC streams.

● Mask PII in data flows before warehouse ingestion.

● Implement warehouse access controls.

● Explain how Debezium handles schema changes.

● Discuss exactly-once vs at-least-once semantics.

● Describe how you’d backfill historical data if needed.

● Batch vs streaming pipelines to monitor?

● How many datasets need monitoring?

● Preferred alerting channels (Slack, PagerDuty)?

● Validate data with configurable rules

● Visualize historical quality trends

● Alert engineers on failures