Data Engineer Interview Q&A Cheat Sheet - Shaik
Q: What is ETL and how have you implemented it?
A: ETL stands for Extract, Transform, Load. I implemented ETL using SSIS to extract flat files and SQL data, applied
transformations (e.g., lookups, derived columns), and loaded data into SQL Server. For cloud ETL, I used Azure Data
Factory to load from Blob to Azure SQL DB.
Q: What is the difference between ADF and SSIS?
A: SSIS is an on-prem ETL tool while ADF is a cloud-native data integration tool. ADF is scalable and integrates with
various Azure/cloud services. SSIS works well within SQL Server environments.
Q: How do you debug failed pipelines in Azure Data Factory?
A: Check Monitor logs, review each activity output, inspect Linked Services and schema mismatches, and rerun in
debug mode to trace errors.
Q: How do you design a pipeline to load JSON from Blob to Azure SQL DB?
A: Use Copy Activity in ADF. Define source dataset as JSON (linked to Blob), and sink as Azure SQL DB. Map schema,
use parameterized file paths if needed.
Q: How do you create APIs using Flask?
A: Use Flask to define routes for CRUD (GET, POST, PUT, DELETE). Accept JSON payloads, use
SQLAlchemy/pyodbc to interact with DB, and test using Postman.
Q: How do you optimize SQL queries?
A: Avoid SELECT *, use WHERE clauses early, use proper indexes, prefer JOINs over subqueries, and write set-based
logic.
Q: How do you handle version control in a project?
A: Use GitHub for repository management. Create feature branches, commit with messages, open pull requests, and
Data Engineer Interview Q&A Cheat Sheet - Shaik
resolve merge conflicts collaboratively.
Q: How do you handle bad data in ETL?
A: Use validation checks, route invalid rows to error tables, and alert on data quality issues using conditional split and
logging mechanisms.