Introduction to
Data Engineering
Speaker – Baitanik Talukder
| 2024-03-12 | Page 1 Ericsson
Course Content
Module 3: Data
Module 1: Introduction Module 2: Python Module 4: Introduction Module 5: Packaging Module 6: Capstone
Serialization and
to Data Engineering Fundamentals to DataFrame & Project setup Project
different connectors
Basic Data Types Real-world Data
Overview of Data Overview of Data Using of Pandas Using setup.py to
and Data Structures module Engineering
Engineering Serialization build wheel file
in Python Project
Control Structures Working with Dataframe Running in Presenting and
Role of Data Operations
and Functions in JSON and CSV Aggregation and containerized Demonstrating the
Engineers
Python Data Formats Transformations environment Capstone Project
Working with database Introduction to
Handling
Data Science Vs • RDBMS - Postgres Additional tool workflow
Exceptions and management
Data Engineering • NoSql - Elastic Search (py-spark)
Errors (argo/airflow)
Working with Files Connect with Assignment Assignment
( Handle large (Application
and File I/O Message Bus scale data packaging &
Operations (kafka) processing) deployment)
Assignment (Data
Assignment exchange between sql to
no-sql)
| 2024-03-12 | Page 2
What is Data Engineering
Aspects Description
● Data engineering is a field in computer science that
Data Acquisition Data engineers are involved in sourcing data from various
focuses on designing, building, and maintaining internal and external sources, such as databases, APIs,
streaming platforms, logs, sensors, and other data repositories
systems and infrastructure for managing large
volumes of data. Data engineers are responsible for
Data Storage Data engineers design and implement storage solutions that are
the development and operation of data pipelines, optimized for the organization's data requirements. This
includes selecting appropriate data storage technologies such as
data warehouses, and other data infrastructure relational databases, NoSQL databases, data lakes, distributed
file systems, and cloud storage services.
components that enable organizations to collect,
store, process, and analyze data efficiently and Data Integration Data engineers integrate data from disparate sources and
formats to create unified and consistent views of the data. This
reliably. involves resolving data schema inconsistencies, managing data
quality issues, and ensuring data integrity across the
organization.
Data Transformation Data engineers develop and maintain ETL (Extract, Transform,
Load) processes to move data between different systems and
formats. They may use batch processing or real-time streaming
techniques depending on the requirements of the use case.
| 2024-03-12 | Page 3
What is Data Engineering Cont.…
Overall, data engineering plays a critical role in Aspects Description
enabling organizations to derive actionable Data Quality and Data engineers implement data quality checks,
Governance monitoring, and governance mechanisms to ensure the
insights, make data-driven decisions, and drive accuracy, completeness, and reliability of the data. This
innovation by providing reliable, scalable, and includes establishing data quality metrics, implementing
data validation rules, and enforcing data governance
efficient data infrastructure and processes. Data policies.
engineers collaborate closely with data
scientists, analysts, and other stakeholders to Scalability and Performance Data engineers design data systems that can scale
horizontally and vertically to accommodate growing
ensure that data solutions meet the data volumes and user demands. They optimize data
pipelines and infrastructure for performance, reliability,
organization's business objectives and data and cost-effectiveness.
requirements.
Infrastructure Automation Data engineers leverage automation tools and
frameworks to provision, configure, and manage data
infrastructure resources efficiently. This may include
using infrastructure as code (IaC) tools, containerization
technologies, and cloud services for deployment and
orchestration
| 2024-03-12 | Page 4
The Evolving Role of the Data Engineer
Data engineers work in various settings to build
systems that collect, manage, and convert raw data
into usable information for data scientists and
business analysts to interpret. Their ultimate goal is to
make data accessible so that organizations can use it
to evaluate and optimize their performance.
| 2024-03-12 | Page 5
| 2024-03-12 | Page 6
Data Engineering vs Data Science
Area Data Engineering Data Science
Focus Primarily concerned with the design, development, and maintenance of data pipelines and Focuses on extracting insights and knowledge from data through advanced analytics, statistical
infrastructure. Data engineers focus on the collection, storage, and processing of data at scale, modeling, machine learning, and data visualization techniques. Data scientists leverage data to solve
ensuring its accessibility, reliability, and efficiency for downstream analytics and applications. complex problems, make predictions, and drive decision-making processes.
Skills Requires strong programming skills, particularly in languages like Python, Java, or Scala, along with Requires a combination of skills in statistics, mathematics, programming (often in Python or R),
expertise in data storage technologies (e.g., databases, data lakes, distributed file systems), data machine learning, data visualization, and domain expertise. Data scientists must be adept at
processing frameworks (e.g., Apache Spark, Hadoop), and proficiency in ETL (Extract, Transform, exploratory data analysis, predictive modeling, and communicating insights effectively.
Load) processes.
Responsibilities Responsibilities include designing and building data pipelines, integrating data from various sources, Responsibilities include identifying business problems that can be addressed with data analysis,
maintaining data infrastructure, optimizing data storage and retrieval, ensuring data quality and collecting and exploring relevant data, preprocessing and transforming data for analysis, developing
reliability, and collaborating with other teams (e.g., data science, software engineering) to support and validating predictive models, interpreting results, and communicating findings to stakeholders.
analytical and operational needs.
Tools and Technologies Utilizes tools and technologies for data storage (e.g., relational databases, NoSQL databases, data Relies on tools and technologies for data manipulation and analysis (e.g., Pandas, NumPy), statistical
lakes), data processing (e.g., Apache Spark, Apache Hadoop), workflow management (e.g., Apache modeling and machine learning (e.g., scikit-learn, TensorFlow, PyTorch), data visualization (e.g.,
Airflow, Luigi), and infrastructure automation (e.g., Kubernetes, Docker). Matplotlib, Seaborn, Plotly).
End Goals Aims to ensure efficient, reliable, and scalable data infrastructure to support various data-driven Aims to extract actionable insights, patterns, and predictions from data to inform decision-making,
applications and analytical needs within an organization. optimize processes, drive innovation, and create value for businesses and organizations.
While there is overlap between data engineering and data science, particularly in areas such as data
preprocessing and feature engineering, they represent distinct skill sets and roles within the broader
domain of data analytics. Effective collaboration between data engineers and data scientists is crucial
for successful data-driven initiatives, as they complement each other's expertise in building end-to-
end data solutions and extracting meaningful insights from data.
| 2024-03-12 | Page 7