Data Engineering System Design
This document includes 12 detailed system design scenarios frequently asked in Data
Engineering interviews. Each scenario explains:
● The problem and real-world context
● Clarifying questions to ask in an interview
● Functional and non-functional requirements
● A detailed architecture proposal
● Trade-offs and technology choices
● Security and compliance considerations
● Interview talking points
Use this to prepare thorough answers that showcase both technical knowledge and critical
thinking.
1. Real-Time User Activity Analytics (Clickstream
Processing)
Prompt:
Design a system to process website or app clickstream data in near real-time to compute
metrics like active users, session durations, funnels, and trends.
Real-World Context:
E-commerce companies, news portals, and streaming services all rely on clickstream data to
understand user behaviour. These insights help drive recommendations, detect anomalies, and
improve user engagement.
Clarifying Questions
● What’s the peak event rate (e.g. 10,000 events/sec or 1 million)?
● Is latency critical (e.g. dashboards updated every 10 sec vs. every 5 min)?
● Do we need to store raw events for historical analysis or only aggregates?
● Are there privacy regulations like GDPR requiring data masking?
● Which BI tools will the analysts use?
Functional Requirements
● Handle high-volume event ingestion
● Compute sliding window metrics (e.g. active users every 5 minutes)
● Sessionise user events based on timeouts
● Store both raw logs and aggregated metrics
● Provide API or warehouse for dashboard queries
Non-Functional Requirements
● Scalability for traffic spikes (e.g. during sales)
● Fault tolerance so data isn’t lost during outages
● Cost efficiency for cloud services
● Privacy compliance (GDPR, CCPA)
Architecture Overview
● Ingestion:
Use Kafka or Pub/Sub to capture events from web/mobile clients. Kafka can handle
large event volumes and supports message durability.
● Processing:
Use Flink or Spark Structured Streaming for real-time processing. These frameworks
support windowing functions and can join streams for enrichment.
● Storage:
○ Store raw logs in S3 or GCS in Parquet format, partitioned by date and hour.
○ Store aggregates (like active users, bounce rate) in a data warehouse like
BigQuery, Snowflake, or Redshift for fast querying.
● Visualization:
Connect BI tools like Looker, Tableau, or Superset to the warehouse for dashboards.
Low-Level Design
● Implement sessionization logic that groups events by user_id. A typical session timeout
is 30 minutes of inactivity.
● Use sliding window aggregations to track metrics like active users in the last 5 minutes.
● Enrich data with geo-IP information using lookups or joins in the stream.
Example Diagram
Client → Kafka → Flink/Spark → S3 (Raw Logs)
↓
BigQuery / Snowflake → BI Tools
Trade-offs & Reasoning
● Kafka vs Pub/Sub: Kafka offers better fine-tuned performance and durability but requires
cluster management.
● Flink has lower latency and better state management than Spark, but Spark is more
widely adopted and easier to deploy.
● Using Parquet reduces storage costs and improves scan speeds in warehouses.
Security & Compliance
● Encrypt events in transit.
● Mask PII fields like email or IP before writing to warehouses.
● Implement fine-grained permissions for warehouse queries.
Interview Talking Points:
● Describe windowing and sessionization logic in detail.
● Explain how you’d handle spikes in traffic.
● Discuss why raw data is retained for backfills or new analytics.
2. Batch ETL Pipeline for E-commerce
Prompt:
Design a daily ETL pipeline that processes CSV files uploaded to S3, transforms them, and
loads them into Snowflake for analytics.
Real-World Context:
E-commerce platforms often exchange data with partners in CSV format. The data must be
cleaned, validated, and integrated into internal warehouses for reporting and fraud analysis.
Clarifying Questions
● Daily data size (e.g. 10 GB or 1 TB)?
● Complexity of transformations — simple cleaning or multi-table joins?
● Required SLA for pipeline completion?
● Compliance requirements for data audits?
Functional Requirements
● Automatically detect and ingest new CSV files from S3
● Transform data (clean nulls, cast data types, join reference tables)
● Run data quality checks
● Load data into Snowflake
● Alert on failures
Non-Functional Requirements
● Handle increasing file sizes without performance degradation
● Maintain data quality and consistency
● Provide observability through logging and metrics
Architecture Overview
● Ingestion:
External systems drop CSV files into an S3 bucket.
● Orchestration:
Airflow schedules and manages the ETL pipeline.
● Processing:
Use PySpark to transform data. Spark is highly efficient for large CSV parsing and can
handle distributed joins.
● Data Quality:
Implement Great Expectations or Amazon Deequ for data validation (e.g. null checks,
uniqueness constraints).
● Warehouse Load:
Load transformed data into Snowflake using bulk load commands like COPY INTO.
Low-Level Design
● Design Spark jobs to handle:
○ Parsing large CSV files with defined schemas
○ Joins with lookup datasets (e.g. SKUs, pricing tables)
○ Null removal and data type casting
● Implement Airflow alerting with email or Slack on job failures.
Example Diagram
S3 → Airflow → Spark → Data Quality Checks → Snowflake
Trade-offs & Reasoning
● Spark scales better than Pandas for large datasets.
● Snowflake handles semi-structured data and scales compute resources on demand.
● Airflow allows complex DAGs with retries and SLA monitoring.
Security & Compliance
● Use S3 bucket policies to restrict file uploads.
● Implement Snowflake’s row-level security for sensitive customer data.
Interview Talking Points:
● Explain why Spark is chosen over Pandas.
● Discuss partitioning strategies in Snowflake.
● Describe how you’d debug data pipeline failures.
3. Change Data Capture (CDC) Pipeline
Prompt:
Design a pipeline that captures changes from a transactional database and replicates them to
an analytics warehouse in near real-time.
Real-World Context:
Many analytics systems rely on data from OLTP systems. Instead of bulk daily extracts, modern
pipelines use CDC to reflect changes quickly, enabling fresher insights.
Clarifying Questions
● Source database type (MySQL, Postgres, Oracle)?
● How much data changes per hour?
● How fresh must data in the warehouse be (seconds, minutes, hours)?
● Any GDPR concerns regarding replicating sensitive data?
Functional Requirements
● Detect inserts, updates, deletes in source DB
● Replicate changes into target warehouse
● Maintain consistency even during outages
● Support schema changes gracefully
Non-Functional Requirements
● Low replication lag
● Scalability to high transaction volumes
● Fault tolerance and recovery mechanisms
● Auditing for compliance
Architecture Overview
● CDC Capture:
Debezium reads database logs and streams changes to Kafka.
● Processing:
Spark Structured Streaming or Flink reads Kafka topics, applies business logic, and
writes changes to the warehouse.
● Warehouse Load:
Load data into Snowflake, Redshift, or BigQuery using idempotent upserts.
Low-Level Design
● Create separate Kafka topics per table.
● Handle tombstone records (deletes) in downstream jobs.
● Implement error handling for schema drift.
● Use checkpointing in Spark to resume from failures.
Example Diagram
+---------+ +----------+ +--------+ +-----------+
DB →| Debezium|→ | Kafka |→ | Spark |→ | Snowflake |
+---------+ +----------+ +--------+ +-----------+
Trade-offs & Reasoning
● Debezium simplifies log parsing but requires access to database logs.
● Streaming CDC reduces warehouse latency but adds operational complexity.
● Idempotent writes are critical to avoid duplicate rows.
Security & Compliance
● Encrypt CDC streams.
● Mask PII in data flows before warehouse ingestion.
● Implement warehouse access controls.
Interview Talking Points:
● Explain how Debezium handles schema changes.
● Discuss exactly-once vs at-least-once semantics.
● Describe how you’d backfill historical data if needed.
4. Data Quality Monitoring Architecture
Prompt:
Design a system to monitor and enforce data quality across data pipelines, catching issues like
null rates, schema drift, and outliers.
Real-World Context:
Without data quality checks, pipelines silently propagate bad data, leading to broken reports and
machine learning failures. Companies increasingly require automated monitoring and alerts for
data quality.
Clarifying Questions
● Which types of data checks are required (e.g. nulls, uniqueness, statistical tests)?
● Batch vs streaming pipelines to monitor?
● How many datasets need monitoring?
● Preferred alerting channels (Slack, PagerDuty)?
Functional Requirements
● Validate data with configurable rules
● Visualize historical quality trends
● Alert engineers on failures
● Maintain audit logs for compliance
Non-Functional Requirements
● Scalable to thousands of datasets
● Minimal overhead for streaming checks
● Centralized monitoring dashboards
Architecture Overview
● Validation Tools:
Use Great Expectations, Deequ, or custom checks.
● Orchestration:
Integrate checks into Airflow or pipeline frameworks.
● Storage:
Time-series DB (e.g. Prometheus, InfluxDB) stores metrics over time.
● Alerts:
Notify teams via Slack, email, or incident tools.
Low-Level Design
● Define validation rules as YAML or JSON config files.
● Store validation results and metrics per table.
● Track trends over time to detect gradual data drift.
Example Diagram
Data Pipelines → Validation Engine → Metrics Store → Dashboards
↓
Alerts
Trade-offs & Reasoning
● Deep statistical checks catch subtle errors but add latency.
● Validation on streaming data is more complex than batch.
● Central dashboards help prioritize which pipelines need attention.
Security & Compliance
● Mask sensitive fields in logs.
● Enforce access controls on dashboards.
● Maintain validation run history for audits.
Interview Talking Points:
● Explain differences between batch and streaming data quality checks.
● Discuss how to measure and track data drift.
● Describe error-handling strategies for quality failures.
5. Data Pipeline Cost Optimization
Prompt:
Design a system or strategy to reduce costs in an existing data pipeline that uses cloud services
like Spark, Snowflake, and S3.
Real-World Context:
As data volumes grow, cloud costs can skyrocket. Companies often seek ways to balance
performance with cost-efficiency.
Clarifying Questions
● Which parts of the pipeline are the costliest?
● Any hard SLAs for data freshness?
● Batch or streaming workloads?
● Budget constraints vs performance goals?
Functional Requirements
● Analyze pipeline costs
● Recommend cost-saving changes
● Implement optimizations without breaking SLAs
Non-Functional Requirements
● Scalability to large workloads
● Transparent reporting to stakeholders
● Minimal performance degradation
Architecture Overview
● Cost Monitoring:
Integrate cloud billing APIs to collect cost data.
● Optimization Engine:
Analyze query patterns, data sizes, and job runtimes.
● Techniques:
○ Partition pruning
○ Data clustering
○ Auto-scaling Spark clusters
○ Warehouse query rewriting
○ Data retention and tiering policies
○ Off-peak scheduling
● Dashboards:
Visualize cost trends and potential savings.
Low-Level Design
● Airflow tasks extract cost metrics daily.
● Identify long-running queries in Snowflake using query history.
● Set warehouse auto-suspend timeouts to avoid idle compute charges.
Example Diagram
Data Pipelines → Cost Metrics Collector → Dashboard
↓
Optimization Engine
Trade-offs & Reasoning
● Aggressive cost cuts might increase job runtimes.
● Older data tiering saves money but complicates retrieval for rare queries.
● Monitoring costs introduce some overhead themselves.
Security & Compliance
● Restrict financial data visibility to appropriate teams.
● Log optimization changes for compliance reviews.
Interview Talking Points:
● Discuss specific Spark and Snowflake tuning techniques.
● Describe cost-visibility tools like cloud cost explorers.
● Explain how to balance cost savings with data SLAs.
6. Cross-Region Data Replication
Prompt:
Design a system to replicate user activity data between US and EU regions, enabling global
analytics while remaining GDPR-compliant.
Real-World Context:
Many global platforms operate in multiple regions and must keep data available for global
reporting while respecting laws like GDPR, which restricts where personal data can be stored.
Clarifying Questions
● How consistent does data need to be across regions (seconds vs minutes delay)?
● Is all data replicated or just specific tables?
● Are there specific GDPR rules about data crossing regions?
● Is there a preference for batch or streaming replication?
Functional Requirements
● Ingest and process data locally in each region
● Replicate selected datasets to a central location
● Aggregate data for global reporting
● Maintain data residency for sensitive user data
Non-Functional Requirements
● Minimal replication lag
● Scalability as traffic grows
● Resilience to regional outages
● Compliance with data privacy laws
Architecture Overview
● Regional Processing:
Kafka in each region ingests events locally.
● Replication:
Use MirrorMaker 2.0 (or Confluent Replicator) to copy topics across regions.
● Processing:
Spark jobs in each region process and store data in S3 buckets.
● Aggregation:
A centralized batch job merges data from regional buckets into a global warehouse for
analytics.
Low-Level Design
● Tag all records with a region identifier for auditing.
● Mask or anonymize personal data before sending across regions.
● Configure MirrorMaker with topic-level filtering so only necessary data crosses regions.
● Use daily batch jobs to aggregate global metrics.
Example Diagram
+----------+ +--------+ +--------+
US → | Kafka |→ | Spark |→ | S3 (US) |
+----------+ +--------+ +--------+
↓
MirrorMaker
↓
+----------+ +--------+ +--------+
EU → | Kafka |→ | Spark |→ | S3 (EU) |
+----------+ +--------+ +--------+
↓
Aggregator Job → Global Data Warehouse
Trade-offs & Reasoning
● MirrorMaker offers straightforward replication but adds latency.
● GDPR compliance means some data might stay regional only.
● A centralized warehouse simplifies reporting but can become a bottleneck.
Security & Compliance
● Encrypt all replicated data streams.
● Restrict which topics can cross borders.
● Audit logs for all cross-region transfers.
Interview Talking Points:
● Describe how GDPR impacts architecture decisions.
● Explain how you’d handle schema changes across replicated topics.
● Discuss failover strategies if one region becomes unavailable.
.
7. Data Lakehouse Architecture for Diverse Data
Prompt:
Design a data lakehouse architecture that supports structured, semi-structured, and
unstructured data for analytics and machine learning.
Real-World Context:
Modern enterprises deal with diverse data types—CSV files, JSON logs, images, videos, and
streaming data. Traditional warehouses struggle to store everything cost-effectively. The
lakehouse approach bridges the gap between flexible storage and reliable analytics.
Clarifying Questions
● Estimated data volume (TBs, PBs)?
● Expected latency for analytics (near real-time vs overnight batch)?
● Any regulatory restrictions (e.g. on PII storage)?
● How many concurrent users for BI and ML workloads?
Functional Requirements
● Unified storage for all data types
● Support for ACID transactions for consistency
● Efficient querying for analytics
● Integration with BI tools and ML platforms
● Schema evolution support
Non-Functional Requirements
● Cost-efficient storage for large volumes
● Scalability for both read and write workloads
● Lineage tracking and governance
● High durability and availability
Architecture Overview
● Storage:
Store data in cloud object storage (e.g. S3, ADLS Gen2).
● Format:
Use formats like Delta Lake, Iceberg, or Hudi to provide ACID transactions, schema
evolution, and time-travel features.
● Processing:
Spark or Databricks handles both batch and streaming workloads.
● Catalog:
Use Hive Metastore, AWS Glue, or Unity Catalog to maintain metadata.
● Visualization:
Connect tools like Power BI, Looker, or Tableau directly to lakehouse tables for
analytics.
Low-Level Design
● Partition data by business-relevant keys (e.g. date, region, product_id).
● Leverage Delta Lake’s time travel to recover previous versions.
● Optimize file sizes to reduce small-file issues.
Example Diagram
Raw Data → Data Lake (Delta / Iceberg / Hudi)
↓
Spark Processing
↓
BI Tools & ML Models
Trade-offs & Reasoning
● Delta Lake makes querying object storage reliable but increases storage overhead due
to transaction logs.
● Iceberg excels at large table scalability with hidden partitioning.
● Lakehouses simplify cost by avoiding the need for separate data lakes and warehouses.
Security & Compliance
● Encrypt data at rest.
● Apply fine-grained permissions on table access.
● Use data masking for sensitive columns (PII).
Interview Talking Points:
● Explain differences between Delta Lake, Iceberg, and Hudi.
● Discuss how lakehouses handle schema evolution.
● Describe BI query performance optimizations (e.g. Z-Ordering).
8. Data Mesh Architecture
Prompt:
Design a data platform using Data Mesh principles for a large enterprise.
Real-World Context:
Traditional centralized data lakes can become bottlenecks as companies grow. Data Mesh
pushes data ownership to domain teams while maintaining governance and discoverability.
Clarifying Questions
● How many domains exist in the organization?
● How is data governance handled centrally?
● Real-time vs batch data products?
● Existing data catalog solutions in place?
Functional Requirements
● Allow domain teams to own and publish data products
● Enable data discovery through a central catalog
● Define standard contracts for data products
● Support both batch and streaming products
Non-Functional Requirements
● Scalability for many domains
● Secure sharing of data across teams
● Cost transparency and tracking
Architecture Overview
● Data Products:
Each domain publishes data products with well-defined contracts (schemas, SLAs, etc.).
● Catalog:
Central catalog (DataHub, Amundsen) indexes all data products for discovery.
● APIs:
Standardized APIs for consuming data products.
● Streaming Backbone:
Kafka may be used for cross-domain event sharing.
Low-Level Design
● Domains maintain their own pipelines and storage.
● Catalog crawlers ingest metadata from each domain’s systems.
● Usage metrics help identify popular products.
Example Diagram
Domain A Data Product
Domain B Data Product
↓
Central Catalog
↓
Consumers via APIs
Trade-offs & Reasoning
● Data Mesh improves agility but requires strong governance.
● Catalogs prevent chaos by centralizing discovery.
● Standard APIs reduce integration headaches.
Security & Compliance
● Federated IAM policies across domains.
● Auditable metadata changes.
● Data masking standards enforced across all products.
Interview Talking Points:
● Discuss how contracts help avoid data quality issues.
● Explain governance challenges in a decentralized model.
● Describe how to handle domain-level outages.
9. Near Real-Time Deduplication and Idempotent Writes
Prompt:
Design a streaming pipeline that ingests events which might contain duplicates, ensuring
downstream systems receive unique events only once.
Real-World Context:
Many systems send the same event multiple times for reliability. However, duplicate events can
lead to overcounting or corrupted analytics if not handled carefully.
Clarifying Questions
● How often do duplicates occur?
● Are duplicate events exactly identical or slightly different?
● How long can duplicates arrive after the original?
● What is the required latency for downstream consumers?
Functional Requirements
● Detect and eliminate duplicate events.
● Maintain consistent state across restarts.
● Handle very high event volumes.
Non-Functional Requirements
● Low-latency deduplication.
● Scalable state storage for large event volumes.
● Robust fault recovery.
Architecture Overview
● Ingestion:
Kafka ingests events from producers.
● Processing:
Flink or Spark Streaming uses keyed state to track recently seen event IDs.
● Storage:
Deduplicated events are stored in a warehouse or operational DB.
● Idempotent Writes:
Downstream writes use UPSERT semantics to avoid duplication.
Low-Level Design
● Maintain a cache or state table keyed on unique event IDs.
● Watermark logic clears old entries to limit memory usage.
● Writes to downstream systems must support idempotency.
Example Diagram
Client → Kafka → Flink/Spark → Deduplicated Events → Data Warehouse
Trade-offs & Reasoning
● Stateful deduplication requires significant memory.
● Flink provides low-latency deduplication but adds complexity.
● Idempotent writes simplify recovery after failures.
Security & Compliance
● Encrypt data streams.
● Apply RBAC on streaming jobs.
● Audit deduplication logic for correctness.
Interview Talking Points:
● Explain time windows for deduplication.
● Discuss state storage strategies in stream processors.
● Describe how idempotency avoids duplicates in downstream systems.
10. Scalable Data Catalog and Metadata Store
Prompt:
Design a data catalog system where engineers and analysts can discover datasets, track
lineage, and manage metadata at scale.
Real-World Context:
Modern enterprises have thousands of datasets spread across tools like Hive, Snowflake, and
Airflow. Without a data catalog, it’s impossible for teams to find, trust, and reuse existing data.
Clarifying Questions
● How many datasets and tables are we cataloging (thousands, millions)?
● Do we need to track lineage in real-time?
● Types of metadata stored (schemas, business definitions, data quality, tags)?
● Integration with which tools (Airflow, Spark, Snowflake, etc.)?
Functional Requirements
● Centralized searchable catalog
● Store technical and business metadata
● Visualize data lineage
● Expose APIs for programmatic queries
● Track usage metrics and popularity
Non-Functional Requirements
● Scalability for millions of assets
● Fast search latency
● Fine-grained security and access control
Architecture Overview
● Metadata Store:
Use graph DBs like Neo4j for lineage, or ElasticSearch for fast text search.
● Ingestion:
Metadata crawlers periodically scan data sources for schemas, tables, and pipelines.
● APIs and UI:
Expose RESTful APIs and build a frontend UI for search and lineage exploration.
Low-Level Design
● Store entities as nodes and relationships as edges in a graph DB.
● Load metrics like data usage into a time-series DB for popularity tracking.
● Integrate crawlers with orchestration tools like Airflow for automatic updates.
Example Diagram
Data Sources → Crawlers → Metadata Store → API → UI
↓
Lineage Graph
Trade-offs & Reasoning
● Graph databases are perfect for relationship queries like lineage.
● ElasticSearch improves free-text search but requires data duplication.
● Crawlers must balance freshness vs load on data sources.
Security & Compliance
● Role-based access to metadata.
● Mask sensitive information in metadata (e.g. table comments with PII).
● Audit logs for metadata changes.
Interview Talking Points:
● Describe graph models for representing data lineage.
● Explain how you’d handle schema changes in catalogs.
● Discuss how to implement usage-based ranking of assets.
11. Machine Learning Feature Store Design
Prompt:
Design a feature store to centralise machine learning features for reuse across different models
and teams.
Real-World Context:
Without a feature store, data scientists often create the same features repeatedly, leading to
inconsistent results between training and production. A feature store improves efficiency and
model reliability.
Clarifying Questions
● Real-time or batch feature serving required?
● Expected size and number of features?
● How should features be versioned?
● Integration with which ML platforms?
Functional Requirements
● Central repository of features
● Consistent feature values for training and online inference
● Versioning and lineage tracking
● Ability to share features across teams
Non-Functional Requirements
● Low latency for online serving
● Scalability for hundreds or thousands of features
● Data governance and auditing
Architecture Overview
● Offline Store:
Data warehouse (e.g. Snowflake, BigQuery) stores large feature tables for training.
● Online Store:
Low-latency databases like Redis or DynamoDB serve real-time features for production
inference.
● Processing:
Spark pipelines calculate and store new feature values.
● Serving Layer:
REST APIs expose features to models in real-time.
Low-Level Design
● Maintain feature definitions in a metadata catalog.
● Batch compute features daily and write to warehouse.
● Real-time compute handled by streaming jobs (e.g. Flink).
● Implement TTL (time-to-live) for online store data.
Example Diagram
Raw Data → Spark → Feature Store (Offline & Online)
↓
ML Models
Trade-offs & Reasoning
● Offline stores are cost-efficient but slow for real-time predictions.
● Online stores cost more but reduce inference latency.
● Versioning avoids training-serving skew.
Security & Compliance
● RBAC on who can access features.
● Lineage tracking for auditability.
● Masking of sensitive feature values.
Interview Talking Points:
● Explain how to avoid feature drift.
● Discuss handling time-travel queries for training consistency.
● Describe the trade-offs between online and offline stores.
12. IoT Sensor Data Platform
Prompt:
Design a platform to ingest, store, and analyse temperature sensor data from 100,000 IoT
devices across India, supporting real-time alerts for extreme readings.
Real-World Context:
IoT applications in manufacturing, agriculture, and smart cities generate vast amounts of sensor
data that needs to be processed in real-time for critical alerts.
Clarifying Questions
● How frequently does each sensor send data? (e.g. once a second or once a minute?)
● Are alerts needed in real time or can they be delayed slightly?
● How long should raw data be retained?
● Is there location metadata included?
Functional Requirements
● High-frequency data ingestion
● Real-time processing to detect extreme readings
● Store raw data for historical analysis
● Provide dashboards for time-series trends
Non-Functional Requirements
● Low latency for critical alerts
● Cost-effective storage
● Fault tolerance for device outages
Architecture Overview
● Ingestion:
Use MQTT broker for lightweight IoT communication. Kafka may be used if messages
are larger or arrive at high frequency.
● Processing:
Flink processes incoming streams, applies threshold rules, and triggers alerts for
extreme values.
● Storage:
○ Raw data → S3 or GCS for cost-effective archival
○ Aggregated data → Time-series DB like TimescaleDB or InfluxDB
● Alerting:
Serverless functions (like AWS Lambda) send notifications (SMS, email, etc.).
● Visualization:
Grafana dashboards for time-series trends and anomalies.
Low-Level Design
● Flink uses keyed streams to separate processing per device.
● Maintain stateful computations for rolling averages.
● Define alert thresholds dynamically via a config service.
Example Diagram
+------------+
IoT → | MQTT/Kafka | → Flink → Alerts
+------------+ ↓
+-----------+
| S3 (Raw) |
+-----------+
↓
TimescaleDB → Grafana
Trade-offs & Reasoning
● MQTT is efficient for IoT devices with limited bandwidth.
● Flink provides low-latency, stateful stream processing.
● Time-series DBs are ideal for fast queries on sensor data.
Security & Compliance
● TLS for secure communication between devices and brokers.
● Device-level authentication via tokens or certificates.
● Access control on dashboards for sensitive operational data.
Interview Talking Points:
● Explain why MQTT is preferred over HTTP for IoT.
● Discuss how to handle network partitions affecting real-time alerts.
● Describe cost trade-offs for keeping all raw data vs aggregates.