KEMBAR78
Data Engineering System Design | PDF | Metadata | Databases
0% found this document useful (0 votes)
187 views37 pages

Data Engineering System Design

The document outlines 12 detailed system design scenarios for Data Engineering interviews, covering real-world contexts, clarifying questions, functional and non-functional requirements, architecture proposals, trade-offs, and security considerations. Each scenario includes specific examples such as real-time user activity analytics, batch ETL pipelines, change data capture, data quality monitoring, cost optimization, cross-region data replication, and data lakehouse architecture. It serves as a comprehensive guide to prepare for technical interviews by showcasing both technical knowledge and critical thinking.

Uploaded by

Swapnil Narayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
187 views37 pages

Data Engineering System Design

The document outlines 12 detailed system design scenarios for Data Engineering interviews, covering real-world contexts, clarifying questions, functional and non-functional requirements, architecture proposals, trade-offs, and security considerations. Each scenario includes specific examples such as real-time user activity analytics, batch ETL pipelines, change data capture, data quality monitoring, cost optimization, cross-region data replication, and data lakehouse architecture. It serves as a comprehensive guide to prepare for technical interviews by showcasing both technical knowledge and critical thinking.

Uploaded by

Swapnil Narayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Data Engineering System Design

This document includes 12 detailed system design scenarios frequently asked in Data
Engineering interviews. Each scenario explains:

●​ The problem and real-world context​

●​ Clarifying questions to ask in an interview​

●​ Functional and non-functional requirements​

●​ A detailed architecture proposal​

●​ Trade-offs and technology choices​

●​ Security and compliance considerations​

●​ Interview talking points​

Use this to prepare thorough answers that showcase both technical knowledge and critical
thinking.
1. Real-Time User Activity Analytics (Clickstream
Processing)
Prompt:​

Design a system to process website or app clickstream data in near real-time to compute
metrics like active users, session durations, funnels, and trends.

Real-World Context:​

E-commerce companies, news portals, and streaming services all rely on clickstream data to
understand user behaviour. These insights help drive recommendations, detect anomalies, and
improve user engagement.

Clarifying Questions

●​ What’s the peak event rate (e.g. 10,000 events/sec or 1 million)?​

●​ Is latency critical (e.g. dashboards updated every 10 sec vs. every 5 min)?​

●​ Do we need to store raw events for historical analysis or only aggregates?​

●​ Are there privacy regulations like GDPR requiring data masking?​

●​ Which BI tools will the analysts use?​

Functional Requirements

●​ Handle high-volume event ingestion​

●​ Compute sliding window metrics (e.g. active users every 5 minutes)​

●​ Sessionise user events based on timeouts​

●​ Store both raw logs and aggregated metrics​

●​ Provide API or warehouse for dashboard queries​


Non-Functional Requirements

●​ Scalability for traffic spikes (e.g. during sales)​

●​ Fault tolerance so data isn’t lost during outages​

●​ Cost efficiency for cloud services​

●​ Privacy compliance (GDPR, CCPA)​

Architecture Overview

●​ Ingestion:​
Use Kafka or Pub/Sub to capture events from web/mobile clients. Kafka can handle
large event volumes and supports message durability.​

●​ Processing:​
Use Flink or Spark Structured Streaming for real-time processing. These frameworks
support windowing functions and can join streams for enrichment.​

●​ Storage:​

○​ Store raw logs in S3 or GCS in Parquet format, partitioned by date and hour.​

○​ Store aggregates (like active users, bounce rate) in a data warehouse like
BigQuery, Snowflake, or Redshift for fast querying.​

●​ Visualization:​
Connect BI tools like Looker, Tableau, or Superset to the warehouse for dashboards.​

Low-Level Design

●​ Implement sessionization logic that groups events by user_id. A typical session timeout
is 30 minutes of inactivity.​

●​ Use sliding window aggregations to track metrics like active users in the last 5 minutes.​

●​ Enrich data with geo-IP information using lookups or joins in the stream.​
Example Diagram

Client → Kafka → Flink/Spark → S3 (Raw Logs)



BigQuery / Snowflake → BI Tools

Trade-offs & Reasoning

●​ Kafka vs Pub/Sub: Kafka offers better fine-tuned performance and durability but requires
cluster management.​

●​ Flink has lower latency and better state management than Spark, but Spark is more
widely adopted and easier to deploy.​

●​ Using Parquet reduces storage costs and improves scan speeds in warehouses.​

Security & Compliance

●​ Encrypt events in transit.​

●​ Mask PII fields like email or IP before writing to warehouses.​

●​ Implement fine-grained permissions for warehouse queries.​

Interview Talking Points:

●​ Describe windowing and sessionization logic in detail.​

●​ Explain how you’d handle spikes in traffic.​

●​ Discuss why raw data is retained for backfills or new analytics.​


2. Batch ETL Pipeline for E-commerce
Prompt:​

Design a daily ETL pipeline that processes CSV files uploaded to S3, transforms them, and
loads them into Snowflake for analytics.

Real-World Context:​

E-commerce platforms often exchange data with partners in CSV format. The data must be
cleaned, validated, and integrated into internal warehouses for reporting and fraud analysis.

Clarifying Questions

●​ Daily data size (e.g. 10 GB or 1 TB)?​

●​ Complexity of transformations — simple cleaning or multi-table joins?​

●​ Required SLA for pipeline completion?​

●​ Compliance requirements for data audits?​

Functional Requirements

●​ Automatically detect and ingest new CSV files from S3​

●​ Transform data (clean nulls, cast data types, join reference tables)​

●​ Run data quality checks​

●​ Load data into Snowflake​

●​ Alert on failures​

Non-Functional Requirements

●​ Handle increasing file sizes without performance degradation​

●​ Maintain data quality and consistency​

●​ Provide observability through logging and metrics​


Architecture Overview

●​ Ingestion:​
External systems drop CSV files into an S3 bucket.​

●​ Orchestration:​
Airflow schedules and manages the ETL pipeline.​

●​ Processing:​
Use PySpark to transform data. Spark is highly efficient for large CSV parsing and can
handle distributed joins.​

●​ Data Quality:​
Implement Great Expectations or Amazon Deequ for data validation (e.g. null checks,
uniqueness constraints).​

●​ Warehouse Load:​
Load transformed data into Snowflake using bulk load commands like COPY INTO.​

Low-Level Design

●​ Design Spark jobs to handle:​

○​ Parsing large CSV files with defined schemas​

○​ Joins with lookup datasets (e.g. SKUs, pricing tables)​

○​ Null removal and data type casting​

●​ Implement Airflow alerting with email or Slack on job failures.​


Example Diagram

S3 → Airflow → Spark → Data Quality Checks → Snowflake

Trade-offs & Reasoning

●​ Spark scales better than Pandas for large datasets.​

●​ Snowflake handles semi-structured data and scales compute resources on demand.​

●​ Airflow allows complex DAGs with retries and SLA monitoring.​

Security & Compliance

●​ Use S3 bucket policies to restrict file uploads.​

●​ Implement Snowflake’s row-level security for sensitive customer data.​

Interview Talking Points:

●​ Explain why Spark is chosen over Pandas.​

●​ Discuss partitioning strategies in Snowflake.​

●​ Describe how you’d debug data pipeline failures.​


3. Change Data Capture (CDC) Pipeline
Prompt:​

Design a pipeline that captures changes from a transactional database and replicates them to
an analytics warehouse in near real-time.

Real-World Context:​

Many analytics systems rely on data from OLTP systems. Instead of bulk daily extracts, modern
pipelines use CDC to reflect changes quickly, enabling fresher insights.

Clarifying Questions

●​ Source database type (MySQL, Postgres, Oracle)?​

●​ How much data changes per hour?​

●​ How fresh must data in the warehouse be (seconds, minutes, hours)?​

●​ Any GDPR concerns regarding replicating sensitive data?​

Functional Requirements

●​ Detect inserts, updates, deletes in source DB​

●​ Replicate changes into target warehouse​

●​ Maintain consistency even during outages​

●​ Support schema changes gracefully​

Non-Functional Requirements

●​ Low replication lag​

●​ Scalability to high transaction volumes​

●​ Fault tolerance and recovery mechanisms​

●​ Auditing for compliance​


Architecture Overview

●​ CDC Capture:​
Debezium reads database logs and streams changes to Kafka.​

●​ Processing:​
Spark Structured Streaming or Flink reads Kafka topics, applies business logic, and
writes changes to the warehouse.​

●​ Warehouse Load:​
Load data into Snowflake, Redshift, or BigQuery using idempotent upserts.​

Low-Level Design

●​ Create separate Kafka topics per table.​

●​ Handle tombstone records (deletes) in downstream jobs.​

●​ Implement error handling for schema drift.​

●​ Use checkpointing in Spark to resume from failures.​

Example Diagram

+---------+ +----------+ +--------+ +-----------+


DB →| Debezium|→ | Kafka |→ | Spark |→ | Snowflake |
+---------+ +----------+ +--------+ +-----------+

Trade-offs & Reasoning

●​ Debezium simplifies log parsing but requires access to database logs.​

●​ Streaming CDC reduces warehouse latency but adds operational complexity.​

●​ Idempotent writes are critical to avoid duplicate rows.​


Security & Compliance

●​ Encrypt CDC streams.​

●​ Mask PII in data flows before warehouse ingestion.​

●​ Implement warehouse access controls.​

Interview Talking Points:

●​ Explain how Debezium handles schema changes.​

●​ Discuss exactly-once vs at-least-once semantics.​

●​ Describe how you’d backfill historical data if needed.​


4. Data Quality Monitoring Architecture
Prompt:​

Design a system to monitor and enforce data quality across data pipelines, catching issues like
null rates, schema drift, and outliers.

Real-World Context:​

Without data quality checks, pipelines silently propagate bad data, leading to broken reports and
machine learning failures. Companies increasingly require automated monitoring and alerts for
data quality.

Clarifying Questions

●​ Which types of data checks are required (e.g. nulls, uniqueness, statistical tests)?​

●​ Batch vs streaming pipelines to monitor?​

●​ How many datasets need monitoring?​

●​ Preferred alerting channels (Slack, PagerDuty)?​

Functional Requirements

●​ Validate data with configurable rules​

●​ Visualize historical quality trends​

●​ Alert engineers on failures​

●​ Maintain audit logs for compliance​

Non-Functional Requirements

●​ Scalable to thousands of datasets​

●​ Minimal overhead for streaming checks​

●​ Centralized monitoring dashboards​


Architecture Overview

●​ Validation Tools:​
Use Great Expectations, Deequ, or custom checks.​

●​ Orchestration:​
Integrate checks into Airflow or pipeline frameworks.​

●​ Storage:​
Time-series DB (e.g. Prometheus, InfluxDB) stores metrics over time.​

●​ Alerts:​
Notify teams via Slack, email, or incident tools.​

Low-Level Design

●​ Define validation rules as YAML or JSON config files.​

●​ Store validation results and metrics per table.​

●​ Track trends over time to detect gradual data drift.​

Example Diagram

Data Pipelines → Validation Engine → Metrics Store → Dashboards



Alerts

Trade-offs & Reasoning

●​ Deep statistical checks catch subtle errors but add latency.​

●​ Validation on streaming data is more complex than batch.​

●​ Central dashboards help prioritize which pipelines need attention.​


Security & Compliance

●​ Mask sensitive fields in logs.​

●​ Enforce access controls on dashboards.​

●​ Maintain validation run history for audits.​

Interview Talking Points:

●​ Explain differences between batch and streaming data quality checks.​

●​ Discuss how to measure and track data drift.​

●​ Describe error-handling strategies for quality failures.​


5. Data Pipeline Cost Optimization
Prompt:​

Design a system or strategy to reduce costs in an existing data pipeline that uses cloud services
like Spark, Snowflake, and S3.

Real-World Context:​

As data volumes grow, cloud costs can skyrocket. Companies often seek ways to balance
performance with cost-efficiency.

Clarifying Questions

●​ Which parts of the pipeline are the costliest?​

●​ Any hard SLAs for data freshness?​

●​ Batch or streaming workloads?​

●​ Budget constraints vs performance goals?​

Functional Requirements

●​ Analyze pipeline costs​

●​ Recommend cost-saving changes​

●​ Implement optimizations without breaking SLAs​

Non-Functional Requirements

●​ Scalability to large workloads​

●​ Transparent reporting to stakeholders​

●​ Minimal performance degradation​


Architecture Overview

●​ Cost Monitoring:​
Integrate cloud billing APIs to collect cost data.​

●​ Optimization Engine:​
Analyze query patterns, data sizes, and job runtimes.​

●​ Techniques:​

○​ Partition pruning​

○​ Data clustering​

○​ Auto-scaling Spark clusters​

○​ Warehouse query rewriting​

○​ Data retention and tiering policies​

○​ Off-peak scheduling​

●​ Dashboards:​
Visualize cost trends and potential savings.​

Low-Level Design

●​ Airflow tasks extract cost metrics daily.​

●​ Identify long-running queries in Snowflake using query history.​

●​ Set warehouse auto-suspend timeouts to avoid idle compute charges.​


Example Diagram

Data Pipelines → Cost Metrics Collector → Dashboard



Optimization Engine

Trade-offs & Reasoning

●​ Aggressive cost cuts might increase job runtimes.​

●​ Older data tiering saves money but complicates retrieval for rare queries.​

●​ Monitoring costs introduce some overhead themselves.​

Security & Compliance

●​ Restrict financial data visibility to appropriate teams.​

●​ Log optimization changes for compliance reviews.​

Interview Talking Points:

●​ Discuss specific Spark and Snowflake tuning techniques.​

●​ Describe cost-visibility tools like cloud cost explorers.​

●​ Explain how to balance cost savings with data SLAs.


6. Cross-Region Data Replication
Prompt:​

Design a system to replicate user activity data between US and EU regions, enabling global
analytics while remaining GDPR-compliant.

Real-World Context:​

Many global platforms operate in multiple regions and must keep data available for global
reporting while respecting laws like GDPR, which restricts where personal data can be stored.

Clarifying Questions

●​ How consistent does data need to be across regions (seconds vs minutes delay)?​

●​ Is all data replicated or just specific tables?​

●​ Are there specific GDPR rules about data crossing regions?​

●​ Is there a preference for batch or streaming replication?​

Functional Requirements

●​ Ingest and process data locally in each region​

●​ Replicate selected datasets to a central location​

●​ Aggregate data for global reporting​

●​ Maintain data residency for sensitive user data​

Non-Functional Requirements

●​ Minimal replication lag​

●​ Scalability as traffic grows​

●​ Resilience to regional outages​

●​ Compliance with data privacy laws​


Architecture Overview

●​ Regional Processing:​
Kafka in each region ingests events locally.​

●​ Replication:​
Use MirrorMaker 2.0 (or Confluent Replicator) to copy topics across regions.​

●​ Processing:​
Spark jobs in each region process and store data in S3 buckets.​

●​ Aggregation:​
A centralized batch job merges data from regional buckets into a global warehouse for
analytics.​

Low-Level Design

●​ Tag all records with a region identifier for auditing.​

●​ Mask or anonymize personal data before sending across regions.​

●​ Configure MirrorMaker with topic-level filtering so only necessary data crosses regions.​

●​ Use daily batch jobs to aggregate global metrics.​

Example Diagram

+----------+ +--------+ +--------+


US → | Kafka |→ | Spark |→ | S3 (US) |
+----------+ +--------+ +--------+

MirrorMaker

+----------+ +--------+ +--------+
EU → | Kafka |→ | Spark |→ | S3 (EU) |
+----------+ +--------+ +--------+

Aggregator Job → Global Data Warehouse
Trade-offs & Reasoning

●​ MirrorMaker offers straightforward replication but adds latency.​

●​ GDPR compliance means some data might stay regional only.​

●​ A centralized warehouse simplifies reporting but can become a bottleneck.​

Security & Compliance

●​ Encrypt all replicated data streams.​

●​ Restrict which topics can cross borders.​

●​ Audit logs for all cross-region transfers.​

Interview Talking Points:

●​ Describe how GDPR impacts architecture decisions.​

●​ Explain how you’d handle schema changes across replicated topics.​

●​ Discuss failover strategies if one region becomes unavailable.​

.
7. Data Lakehouse Architecture for Diverse Data
Prompt:​

Design a data lakehouse architecture that supports structured, semi-structured, and
unstructured data for analytics and machine learning.

Real-World Context:​

Modern enterprises deal with diverse data types—CSV files, JSON logs, images, videos, and
streaming data. Traditional warehouses struggle to store everything cost-effectively. The
lakehouse approach bridges the gap between flexible storage and reliable analytics.

Clarifying Questions

●​ Estimated data volume (TBs, PBs)?​

●​ Expected latency for analytics (near real-time vs overnight batch)?​

●​ Any regulatory restrictions (e.g. on PII storage)?​

●​ How many concurrent users for BI and ML workloads?​

Functional Requirements

●​ Unified storage for all data types​

●​ Support for ACID transactions for consistency​

●​ Efficient querying for analytics​

●​ Integration with BI tools and ML platforms​

●​ Schema evolution support​


Non-Functional Requirements

●​ Cost-efficient storage for large volumes​

●​ Scalability for both read and write workloads​

●​ Lineage tracking and governance​

●​ High durability and availability​

Architecture Overview

●​ Storage:​
Store data in cloud object storage (e.g. S3, ADLS Gen2).​

●​ Format:​
Use formats like Delta Lake, Iceberg, or Hudi to provide ACID transactions, schema
evolution, and time-travel features.​

●​ Processing:​
Spark or Databricks handles both batch and streaming workloads.​

●​ Catalog:​
Use Hive Metastore, AWS Glue, or Unity Catalog to maintain metadata.​

●​ Visualization:​
Connect tools like Power BI, Looker, or Tableau directly to lakehouse tables for
analytics.​

Low-Level Design

●​ Partition data by business-relevant keys (e.g. date, region, product_id).​

●​ Leverage Delta Lake’s time travel to recover previous versions.​

●​ Optimize file sizes to reduce small-file issues.​


Example Diagram

Raw Data → Data Lake (Delta / Iceberg / Hudi)



Spark Processing

BI Tools & ML Models

Trade-offs & Reasoning

●​ Delta Lake makes querying object storage reliable but increases storage overhead due
to transaction logs.​

●​ Iceberg excels at large table scalability with hidden partitioning.​

●​ Lakehouses simplify cost by avoiding the need for separate data lakes and warehouses.​

Security & Compliance

●​ Encrypt data at rest.​

●​ Apply fine-grained permissions on table access.​

●​ Use data masking for sensitive columns (PII).​

Interview Talking Points:

●​ Explain differences between Delta Lake, Iceberg, and Hudi.​

●​ Discuss how lakehouses handle schema evolution.​

●​ Describe BI query performance optimizations (e.g. Z-Ordering).​


8. Data Mesh Architecture
Prompt:​

Design a data platform using Data Mesh principles for a large enterprise.

Real-World Context:​

Traditional centralized data lakes can become bottlenecks as companies grow. Data Mesh
pushes data ownership to domain teams while maintaining governance and discoverability.

Clarifying Questions

●​ How many domains exist in the organization?​

●​ How is data governance handled centrally?​

●​ Real-time vs batch data products?​

●​ Existing data catalog solutions in place?​

Functional Requirements

●​ Allow domain teams to own and publish data products​

●​ Enable data discovery through a central catalog​

●​ Define standard contracts for data products​

●​ Support both batch and streaming products​

Non-Functional Requirements

●​ Scalability for many domains​

●​ Secure sharing of data across teams​

●​ Cost transparency and tracking​




Architecture Overview

●​ Data Products:​
Each domain publishes data products with well-defined contracts (schemas, SLAs, etc.).​

●​ Catalog:​
Central catalog (DataHub, Amundsen) indexes all data products for discovery.​

●​ APIs:​
Standardized APIs for consuming data products.​

●​ Streaming Backbone:​
Kafka may be used for cross-domain event sharing.​

Low-Level Design

●​ Domains maintain their own pipelines and storage.​

●​ Catalog crawlers ingest metadata from each domain’s systems.​

●​ Usage metrics help identify popular products.​

Example Diagram

Domain A Data Product


Domain B Data Product

Central Catalog

Consumers via APIs
Trade-offs & Reasoning

●​ Data Mesh improves agility but requires strong governance.​

●​ Catalogs prevent chaos by centralizing discovery.​

●​ Standard APIs reduce integration headaches.​

Security & Compliance

●​ Federated IAM policies across domains.​

●​ Auditable metadata changes.​

●​ Data masking standards enforced across all products.​

Interview Talking Points:

●​ Discuss how contracts help avoid data quality issues.​

●​ Explain governance challenges in a decentralized model.​

●​ Describe how to handle domain-level outages.​


9. Near Real-Time Deduplication and Idempotent Writes
Prompt:​

Design a streaming pipeline that ingests events which might contain duplicates, ensuring
downstream systems receive unique events only once.

Real-World Context:​

Many systems send the same event multiple times for reliability. However, duplicate events can
lead to overcounting or corrupted analytics if not handled carefully.

Clarifying Questions

●​ How often do duplicates occur?​

●​ Are duplicate events exactly identical or slightly different?​

●​ How long can duplicates arrive after the original?​

●​ What is the required latency for downstream consumers?​

Functional Requirements

●​ Detect and eliminate duplicate events.​

●​ Maintain consistent state across restarts.​

●​ Handle very high event volumes.​

Non-Functional Requirements

●​ Low-latency deduplication.​

●​ Scalable state storage for large event volumes.​

●​ Robust fault recovery.​


Architecture Overview

●​ Ingestion:​
Kafka ingests events from producers.​

●​ Processing:​
Flink or Spark Streaming uses keyed state to track recently seen event IDs.​

●​ Storage:​
Deduplicated events are stored in a warehouse or operational DB.​

●​ Idempotent Writes:​
Downstream writes use UPSERT semantics to avoid duplication.​

Low-Level Design

●​ Maintain a cache or state table keyed on unique event IDs.​

●​ Watermark logic clears old entries to limit memory usage.​

●​ Writes to downstream systems must support idempotency.​

Example Diagram

Client → Kafka → Flink/Spark → Deduplicated Events → Data Warehouse

Trade-offs & Reasoning

●​ Stateful deduplication requires significant memory.​

●​ Flink provides low-latency deduplication but adds complexity.​

●​ Idempotent writes simplify recovery after failures.​


Security & Compliance

●​ Encrypt data streams.​

●​ Apply RBAC on streaming jobs.​

●​ Audit deduplication logic for correctness.​

Interview Talking Points:

●​ Explain time windows for deduplication.​

●​ Discuss state storage strategies in stream processors.​

●​ Describe how idempotency avoids duplicates in downstream systems.​


10. Scalable Data Catalog and Metadata Store
Prompt:​

Design a data catalog system where engineers and analysts can discover datasets, track
lineage, and manage metadata at scale.

Real-World Context:​

Modern enterprises have thousands of datasets spread across tools like Hive, Snowflake, and
Airflow. Without a data catalog, it’s impossible for teams to find, trust, and reuse existing data.

Clarifying Questions

●​ How many datasets and tables are we cataloging (thousands, millions)?​

●​ Do we need to track lineage in real-time?​

●​ Types of metadata stored (schemas, business definitions, data quality, tags)?​

●​ Integration with which tools (Airflow, Spark, Snowflake, etc.)?​

Functional Requirements

●​ Centralized searchable catalog​

●​ Store technical and business metadata​

●​ Visualize data lineage​

●​ Expose APIs for programmatic queries​

●​ Track usage metrics and popularity​

Non-Functional Requirements

●​ Scalability for millions of assets​

●​ Fast search latency​

●​ Fine-grained security and access control​


Architecture Overview

●​ Metadata Store:​
Use graph DBs like Neo4j for lineage, or ElasticSearch for fast text search.​

●​ Ingestion:​
Metadata crawlers periodically scan data sources for schemas, tables, and pipelines.​

●​ APIs and UI:​


Expose RESTful APIs and build a frontend UI for search and lineage exploration.​

Low-Level Design

●​ Store entities as nodes and relationships as edges in a graph DB.​

●​ Load metrics like data usage into a time-series DB for popularity tracking.​

●​ Integrate crawlers with orchestration tools like Airflow for automatic updates.​

Example Diagram

Data Sources → Crawlers → Metadata Store → API → UI



Lineage Graph

Trade-offs & Reasoning

●​ Graph databases are perfect for relationship queries like lineage.​

●​ ElasticSearch improves free-text search but requires data duplication.​

●​ Crawlers must balance freshness vs load on data sources.​


Security & Compliance

●​ Role-based access to metadata.​

●​ Mask sensitive information in metadata (e.g. table comments with PII).​

●​ Audit logs for metadata changes.​

Interview Talking Points:

●​ Describe graph models for representing data lineage.​

●​ Explain how you’d handle schema changes in catalogs.​

●​ Discuss how to implement usage-based ranking of assets.​


11. Machine Learning Feature Store Design
Prompt:​

Design a feature store to centralise machine learning features for reuse across different models
and teams.

Real-World Context:​

Without a feature store, data scientists often create the same features repeatedly, leading to
inconsistent results between training and production. A feature store improves efficiency and
model reliability.

Clarifying Questions

●​ Real-time or batch feature serving required?​

●​ Expected size and number of features?​

●​ How should features be versioned?​

●​ Integration with which ML platforms?​

Functional Requirements

●​ Central repository of features​

●​ Consistent feature values for training and online inference​

●​ Versioning and lineage tracking​

●​ Ability to share features across teams​

Non-Functional Requirements

●​ Low latency for online serving​

●​ Scalability for hundreds or thousands of features​

●​ Data governance and auditing​


Architecture Overview

●​ Offline Store:​
Data warehouse (e.g. Snowflake, BigQuery) stores large feature tables for training.​

●​ Online Store:​
Low-latency databases like Redis or DynamoDB serve real-time features for production
inference.​

●​ Processing:​
Spark pipelines calculate and store new feature values.​

●​ Serving Layer:​
REST APIs expose features to models in real-time.​

Low-Level Design

●​ Maintain feature definitions in a metadata catalog.​

●​ Batch compute features daily and write to warehouse.​

●​ Real-time compute handled by streaming jobs (e.g. Flink).​

●​ Implement TTL (time-to-live) for online store data.​

Example Diagram

Raw Data → Spark → Feature Store (Offline & Online)



ML Models

Trade-offs & Reasoning

●​ Offline stores are cost-efficient but slow for real-time predictions.​

●​ Online stores cost more but reduce inference latency.​

●​ Versioning avoids training-serving skew.​


Security & Compliance

●​ RBAC on who can access features.​

●​ Lineage tracking for auditability.​

●​ Masking of sensitive feature values.​

Interview Talking Points:

●​ Explain how to avoid feature drift.​

●​ Discuss handling time-travel queries for training consistency.​

●​ Describe the trade-offs between online and offline stores.​


12. IoT Sensor Data Platform
Prompt:​

Design a platform to ingest, store, and analyse temperature sensor data from 100,000 IoT
devices across India, supporting real-time alerts for extreme readings.

Real-World Context:​

IoT applications in manufacturing, agriculture, and smart cities generate vast amounts of sensor
data that needs to be processed in real-time for critical alerts.

Clarifying Questions

●​ How frequently does each sensor send data? (e.g. once a second or once a minute?)​

●​ Are alerts needed in real time or can they be delayed slightly?​

●​ How long should raw data be retained?​

●​ Is there location metadata included?​

Functional Requirements

●​ High-frequency data ingestion​

●​ Real-time processing to detect extreme readings​

●​ Store raw data for historical analysis​

●​ Provide dashboards for time-series trends​

Non-Functional Requirements

●​ Low latency for critical alerts​

●​ Cost-effective storage​

●​ Fault tolerance for device outages​


Architecture Overview

●​ Ingestion:​
Use MQTT broker for lightweight IoT communication. Kafka may be used if messages
are larger or arrive at high frequency.​

●​ Processing:​
Flink processes incoming streams, applies threshold rules, and triggers alerts for
extreme values.​

●​ Storage:​

○​ Raw data → S3 or GCS for cost-effective archival​

○​ Aggregated data → Time-series DB like TimescaleDB or InfluxDB​

●​ Alerting:​
Serverless functions (like AWS Lambda) send notifications (SMS, email, etc.).​

●​ Visualization:​
Grafana dashboards for time-series trends and anomalies.​

Low-Level Design

●​ Flink uses keyed streams to separate processing per device.​

●​ Maintain stateful computations for rolling averages.​

●​ Define alert thresholds dynamically via a config service.​

Example Diagram

+------------+
IoT → | MQTT/Kafka | → Flink → Alerts
+------------+ ↓
+-----------+
| S3 (Raw) |
+-----------+

TimescaleDB → Grafana
Trade-offs & Reasoning

●​ MQTT is efficient for IoT devices with limited bandwidth.​

●​ Flink provides low-latency, stateful stream processing.​

●​ Time-series DBs are ideal for fast queries on sensor data.​

Security & Compliance

●​ TLS for secure communication between devices and brokers.​

●​ Device-level authentication via tokens or certificates.​

●​ Access control on dashboards for sensitive operational data.​

Interview Talking Points:

●​ Explain why MQTT is preferred over HTTP for IoT.​

●​ Discuss how to handle network partitions affecting real-time alerts.​

●​ Describe cost trade-offs for keeping all raw data vs aggregates.​

You might also like