100% found this document useful (1 vote)

7K views6 pages

Week 3 - Data Engineering Lifecycle

The document discusses the key components of a data engineering lifecycle including data ingestion, storage, processing, analysis and user interface layers. It also covers important concepts like data platforms, data stores, data security, data collection, data wrangling and tools used for data transformation. Some key points are: - The architecture of a data platform consists of layers that ingest, store, process and deliver data to users. - Choosing a data store depends on data type, volume, intended use and security/governance needs. - Data collection gathers data from various sources using tools like APIs, web scraping and data exchanges. - Data wrangling transforms and cleanses raw data using

Uploaded by

Amine Bouzidi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

7K views6 pages

Week 3 - Data Engineering Lifecycle

Uploaded by

Amine Bouzidi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

🚀

Week 3 - Data Engineering Lifecycle

Data Platforms, Data Stores, and Security
Summary and Highlights
The architecture of a data platform can be seen as a set of layers, or functional components, each one performing a set of
specific tasks. These layers include:

Data Ingestion or Data Collection Layer, responsible for bringing data from source systems into the data platform.

Data Storage and Integration Layer, responsible for storing and merging extracted data.

Data Processing Layer, responsible for validating, transforming, and applying business rules to data.

Analysis and User Interface Layer, responsible for delivering processed data to data consumers.

Data Pipeline Layer, responsible for implementing and maintaining a continuously flowing data pipeline.

A well-designed data repository is essential for building a system that is scalable and capable of performing during high
workloads.

The choice or design of a data store is influenced by the type and volume of data that needs to be stored, the intended
use of data, and storage considerations. The privacy, security, and governance needs of your organization also influence
this choice.

The CIA, or Confidentiality, Integrity, and Availability triad are three key components of an effective strategy for information
security. The CIA triad is applicable to all facets of security, be it infrastructure, network, application, or data security.

Practice Quiz
Question 1
Which one of these steps is an intrinsic part of the “Data Storage and Integration Layer” of a data platform?

Read data in batch or streaming modes from storage and apply transformations

Transform and merge extracted data, either logically or physically

Transfer data from data sources to the data platform in streaming, batch, or both modes

Deliver processed data to data consumers

The Storage and Integration layer in a data platform stores, transforms, and merges extracted data to make it available for
data processing.

Question 2
Systems that are used for capturing high-volume transactional data need to be designed for faster response times to
complex queries.

True

False

Systems that are used for capturing high-volume transactional data need to be designed for high-speed read, write, and
update operations.

Question 3
What is the role of “Intrusion Detection” and “Intrusion Prevention” in the area of network security?

Ensure endpoint security by allowing only authorized devices to connect to the network

Inspect incoming network traffic for intrusion attempts and vulnerabilities

Create silos, or virtual local area networks, within a network so that you can segregate your assets

Week 3 - Data Engineering Lifecycle 1

Ensure attackers cannot tap into data while it is in transit

Intrusion Detection and Intrusion Prevention systems inspect network vulnerabilities and intrusion attempts and prevent
them from happening.

Graded Quiz
Question 1
Which one of these steps is an intrinsic part of the “Data Processing Layer” of a data platform?

Deliver processed data to data consumers

Transfer data from data sources to the data platform in streaming, batch, or both modes

Transform and merge extracted data, either logically or physically

Read data in batch or streaming modes from storage and apply transformations

Question 2
Systems that are used for capturing high-volume transactional data need to be designed for high-speed read, write, and
update operations.

True

False

High-speed read, write, and update operations are essential for systems that need to capture large volumes of
transactional data.

Question 3
What is the role of “Network Access Control” systems in the area of network security?

To ensure endpoint security by allowing only authorized devices to connect to the network

To ensure attackers cannot tap into data while it is in transit

To create silos, or virtual local area networks, within a network so that you can segregate your assets

To inspect incoming network traffic for intrusion attempts and vulnerabilities

This is achieved with the help of Network Access Control systems.

Question 4
____________ ensures that users access information based on their roles and the privileges assigned to their roles.

Authentication

Authorization

Firewalls

Security Monitoring

One of the primary controls for data security is to enable access to data through a system of Authorization. It allows
access to information based on a user’s role and role-based privileges.

Question 5
Security Monitoring and Intelligence systems:

Create virtual local area networks within a network so that you can segregate your assets

Create an audit history for triage and compliance purposes

Ensure users access information based on their role and privileges

Ensure only authorized devices can connect to a network

Security Monitoring and Intelligence systems create an audit trail and provide reports and alerts that help enterprises react
to security violations in time.

Data Collection and Data Wrangling

Week 3 - Data Engineering Lifecycle 2

Summary and Highlights
Depending on where the data must be sourced from, there are a number of methods and tools available for gathering
data. These include query languages for extracting data from databases, APIs, Web Scraping, Data Streams, RSS Feeds,
and Data Exchanges.

Once the data you need has been gathered and imported, your next step is to make it analytics-ready. This is where the
process of Data Wrangling, or Data Munging, comes in.

Data Wrangling involves a whole range of transformations and cleansing activities performed on the data. Transformation
of raw data includes the tasks you undertake to:

Structurally manipulate and combine data using Joins and Unions.

Normalize data, that is, clean the database of unused and redundant data.

Denormalize data, that is, combine data from multiple tables into a single table so that it can be queried faster.

Cleansing activities include:

Profiling data to uncover anomalies and quality issues.

Visualizing data using statistical methods in order to spot outliers.

Fixing issues such as missing values, duplicate data, irrelevant data, inconsistent formats, syntax errors, and outliers.

A variety of software and tools are available for the data wrangling process. Some of the popularly used ones include
Excel Power Query, Spreadsheets, OpenRefine, Google DataPrep, Watson Studio Refinery, Trifacta Wrangler, Python,
and R, each with their own set of features, strengths, limitations, and applications.

Practice Quiz
Question 1
How is data gathered using Application Programming Interfaces, or APIs?

APIs are used for aggregating constant streams of data flowing from instruments, IoT devices and applications, and
GPS data from cars

APIs are used for downloading specific data from web pages based on defined parameters

APIs are used for capturing updated data from online forums and news sites where data is refreshed on an ongoing
basis

APIs are invoked from applications to access databases, web services, data marketplaces and other such
data endpoints for gathering data

Question 2
What is one of the common structural transformations used for combining data from one or more tables?

Joins

Cleaning

Denormalization

Normalization

Question 3
What tool allows you to discover, cleanse, and transform data with built-in operations?

Watson Studio Refinery

OpenRefine

Trifacta Wrangler

Google DataPrep

Watson Studio Refinery has built-in features that allow you to discover, cleanse, and transform data.

Graded Quiz
Question 1

Week 3 - Data Engineering Lifecycle 3

Web scraping is used to extract what type of data?

Text, videos, and data from relational databases

Text, videos, and images

Images, videos, and data from NoSQL databases

Data from news sites and NoSQL databases

Question 2
___________ focuses on cleaning the database of unused data and reducing redundancy and inconsistency.

Denormalization

Data Visualization

Data Profiling

Normalization

Normalization cleanses the database of unused data and inconsistencies in data that is coming from multiple sources.

Question 3
OpenRefine is an open-source tool that allows you to:

Transform data into a variety of formats such as TSV, CSV, XLS, XML, and JSON

Automatically detect schemas, data types, and anomalies

Enforces applicable data governance policies automatically

Use add-ins such as Microsoft Power Query to identify issues and clean data

Question 4
When you’re combining rows of data from multiple source tables into a single table, what kind of data transformation are
you performing?

Denormalization

Joins

Unions

Normalization

Unions are a common structural transformation used for combining rows of data from multiple source tables.

Question 5
When you detect a value in your data set that is vastly different from other observations in the same data set, what would
you report that as?

Missing value

Irrelevant data

Outlier

Syntax error

Outliers are values in your data set that may be vastly different from other values in the same data field.

Querying Data, Performance Tuning, and Troubleshooting

Summary and Highlights
In order for raw data to become analytics-ready, a number of transformation and cleansing tasks need to be
performed on raw data. And that requires you to understand your dataset from multiple perspectives. One of the ways
in which you can explore your dataset is to query it.

Basic querying techniques can help you explore your data, such as, counting and aggregating a dataset, identifying
extreme values, slicing data, sorting data, filtering patterns, and grouping data.

Week 3 - Data Engineering Lifecycle 4

In a data engineering lifecycle, the performance of data pipelines, platforms, databases, applications, tools, queries,
and scheduled jobs, need to be constantly monitored for performance and availability.

The performance of a data pipeline can get impacted if the workload increases significantly, or there are application
failures, or a scheduled job does not work as expected, or some of the tools in the pipeline run into compatibility
issues.

Databases are susceptible to outages, capacity overutilization, application slowdown, and conflicting activities and
queries being executed simultaneously.

Monitoring and alerting systems collect quantitative data in real time to give visibility into the performance of data
pipelines, platforms, databases, applications, tools, queries, scheduled jobs, and more.

Time-based and condition-based maintenance schedules generate data that helps identify systems and procedures
responsible for faults and low availability.

Practice Quiz
Question 1
In the video, we used a query function to see how spread out the values in the “Sale Amount” field are. What function did
we use?

Average

Count

Maximum Value

Standard Deviation

Question 2
______________ helps you assess if the size of a workload is slowing down the system.

Monitoring the performance of queries

Job-level Runtime Monitoring

Monitoring the amount of data being processed through a data pipeline

Database Monitoring

Governance and Compliance

Summary and Highlights
Data Governance is a collection of principles, practices, and processes that help maintain the security, privacy, and
integrity of data through its lifecycle.

Personal Information and Sensitive Personal Information, that is, data that can be traced back to an individual or can be
used to identify or cause harm to an individual, needs to be protected through governance regulations.
General Data Protection Regulation, or GDPR, is one such regulation that protects the personal data and privacy of EU
citizens for transactions that occur within EU member states.
Regulations, such as HIPAA (Health Insurance Portability and Accountability Act) for Healthcare, PCI DSS (Payment Card
Industry Data Security Standard) for retail, and SOX (Sarbanes Oxley) for financial data are some of the industry-specific
regulations.

Compliance covers the processes and procedures through which an organization adheres to regulations and conducts its
operations in a legal and ethical manner.
Compliance requires organizations to maintain an auditable trail of personal data through its lifecycle, which includes
acquisition, processing, storage, sharing, retention, and disposal of data.
Tools and technologies play a critical role in the implementation of a governance framework, offering features such as:

Authentication and Access Control.

Encryption and Data Masking.

Hosting options that comply with requirements and restrictions for international data transfers.

Monitoring and Alerting functionalities.

Week 3 - Data Engineering Lifecycle 5

Data erasure tools that ensure deleted data cannot be retrieved.

Practice Quiz
At what stage of the data lifecycle would you establish which third-party vendors in your supply chain will have access to
the data you are collecting?

Data Sharing

Data Acquisition

Data Processing

Data Storage

It is in the Data Sharing phase of the data lifecycle that you establish which third-party vendors will have access to your
data, and how they will be held accountable to the same regulations you are liable for.

Graded Quiz
Question 1
In which phase of the data lifecycle do you establish the data you need, the amount of data you need, and how you intend
to use the data you are collecting.

Data Processing

Data Acquisition

Data Sharing

Data Retention

In the Data Acquisition phase, you establish the data you need to collect, the amount of data you need, and its intended
use.

Question 2
The process of _____________ abstracts the presentation layer without changing the data in the database physically.

Encryption

Data Profiling

Anonymization

Pseudonymization

Using Anonymization, the presentation layer is abstracted without changing the data in the database itself.

Week 3 - Data Engineering Lifecycle 6

VTU Exam Question Paper With Solution of 17CS73 Machine Learning Jan-2021-Swathi Y
No ratings yet
VTU Exam Question Paper With Solution of 17CS73 Machine Learning Jan-2021-Swathi Y
7 pages
Data Mining Techniques Explained
No ratings yet
Data Mining Techniques Explained
9 pages
LP-VI - BI - Lab Manual
No ratings yet
LP-VI - BI - Lab Manual
48 pages
Unit 1
No ratings yet
Unit 1
61 pages
Question Paper Code:: (10×2 20 Marks)
No ratings yet
Question Paper Code:: (10×2 20 Marks)
2 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
Enterprise Reporting
No ratings yet
Enterprise Reporting
40 pages
Data Mining and Model Selection
No ratings yet
Data Mining and Model Selection
27 pages
Association Rule - Data Mining
100% (1)
Association Rule - Data Mining
131 pages
Big Data Challenges & Solutions
No ratings yet
Big Data Challenges & Solutions
136 pages
Hadoop and Their Ecosystem
100% (2)
Hadoop and Their Ecosystem
24 pages
Cs2351 Artificial Intelligence 16 Marks
100% (1)
Cs2351 Artificial Intelligence 16 Marks
1 page
K-Means Clustering Numerical Example
No ratings yet
K-Means Clustering Numerical Example
5 pages
Grade 12 IT Final Exam Paper
No ratings yet
Grade 12 IT Final Exam Paper
3 pages
MapReduce, Hive & Pig Question Bank
100% (2)
MapReduce, Hive & Pig Question Bank
2 pages
Computer Networks and Security: Model Question Paper-1 With Effect From 2019-20 (CBCS Scheme)
No ratings yet
Computer Networks and Security: Model Question Paper-1 With Effect From 2019-20 (CBCS Scheme)
3 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Business Intelligence Handbook
No ratings yet
Business Intelligence Handbook
33 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
38 pages
Ass 1 DM q1 Sol
No ratings yet
Ass 1 DM q1 Sol
2 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
10 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
BDA Assignments
No ratings yet
BDA Assignments
5 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
6.1 Emerging Databases
No ratings yet
6.1 Emerging Databases
18 pages
Lecture 3 - Uninformed Search II
No ratings yet
Lecture 3 - Uninformed Search II
39 pages
Data Engineering Interview Preparation Questions
No ratings yet
Data Engineering Interview Preparation Questions
7 pages
Database Normalization Guide
No ratings yet
Database Normalization Guide
31 pages
Distributed Database Transparency Features
No ratings yet
Distributed Database Transparency Features
6 pages
DBMS Data Models & ER Diagrams
No ratings yet
DBMS Data Models & ER Diagrams
12 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Direct Hashing and Pruning (Park-Chen-Yu) Direct Hashing and Pruning
No ratings yet
Direct Hashing and Pruning (Park-Chen-Yu) Direct Hashing and Pruning
3 pages
ASSIGNMENT 1 Questions BI
No ratings yet
ASSIGNMENT 1 Questions BI
1 page
Compact Representation of Frequent Item Set
No ratings yet
Compact Representation of Frequent Item Set
59 pages
BDA Unit 5 HIVE HBASE
No ratings yet
BDA Unit 5 HIVE HBASE
33 pages
Data Visualization With Case Study
No ratings yet
Data Visualization With Case Study
10 pages
Mobile Computing Unit 4 Guide
No ratings yet
Mobile Computing Unit 4 Guide
31 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
10-FDs & Normalization in DBMS - Print - Quizizz
No ratings yet
10-FDs & Normalization in DBMS - Print - Quizizz
5 pages
Object-Oriented Design Guide
No ratings yet
Object-Oriented Design Guide
67 pages
IAT-I Question Paper With Solution of 18CS823 Nosql Database May-2021-Poonam Tijare
100% (1)
IAT-I Question Paper With Solution of 18CS823 Nosql Database May-2021-Poonam Tijare
12 pages
Module 3
No ratings yet
Module 3
43 pages
Difference Between MOLAP, ROLAP and HOLAP in SSAS
No ratings yet
Difference Between MOLAP, ROLAP and HOLAP in SSAS
3 pages
DV Lab Manual
No ratings yet
DV Lab Manual
38 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Chapter 5
No ratings yet
Chapter 5
40 pages
Lab Assignment1 Mongodb
100% (1)
Lab Assignment1 Mongodb
2 pages
L21 Mining Social Network Graphs
No ratings yet
L21 Mining Social Network Graphs
30 pages
Major Information Retrieval Models
No ratings yet
Major Information Retrieval Models
4 pages
DBMS Lab (18IS507) Manual With Solutions-1
No ratings yet
DBMS Lab (18IS507) Manual With Solutions-1
24 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
CS3491 Artificial Intelligence and Machine Learning Two Mark Questions 1
No ratings yet
CS3491 Artificial Intelligence and Machine Learning Two Mark Questions 1
23 pages
Database Management System Notes
No ratings yet
Database Management System Notes
25 pages
Concept Learning
No ratings yet
Concept Learning
62 pages
Mining Frequent Itemsets Using Vertical Data Format
No ratings yet
Mining Frequent Itemsets Using Vertical Data Format
14 pages
1st Course of Data Engineering Quiz
No ratings yet
1st Course of Data Engineering Quiz
12 pages
DE Unit I
No ratings yet
DE Unit I
12 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
De Unit-2
No ratings yet
De Unit-2
10 pages
RI QP Computer Science Grade 8 Cambridge TT2 20022024 VF
No ratings yet
RI QP Computer Science Grade 8 Cambridge TT2 20022024 VF
8 pages
NIS2 Implementing Regulation
No ratings yet
NIS2 Implementing Regulation
27 pages
IS404 Capstone Project
100% (1)
IS404 Capstone Project
2 pages
Unit 5 - Assignment 2 Brief PDF
50% (2)
Unit 5 - Assignment 2 Brief PDF
49 pages
11th Computer Science Important Questions em
50% (4)
11th Computer Science Important Questions em
8 pages
Cyber Security SyllabusNew
No ratings yet
Cyber Security SyllabusNew
38 pages
Data Communication and Computer Networks Management System Project Report
No ratings yet
Data Communication and Computer Networks Management System Project Report
77 pages
CaaS: Transforming Connectivity
No ratings yet
CaaS: Transforming Connectivity
41 pages
9500 MPR Users Manual Tempest Telecom Solutions
100% (1)
9500 MPR Users Manual Tempest Telecom Solutions
188 pages
Del TestyourSkillsforComputerBasics
No ratings yet
Del TestyourSkillsforComputerBasics
21 pages
ACOPOS Error Texts
No ratings yet
ACOPOS Error Texts
21 pages
Bizhub 250: Office Efficiency Boost
No ratings yet
Bizhub 250: Office Efficiency Boost
8 pages
Current Analysis Nokia TAS Report en
No ratings yet
Current Analysis Nokia TAS Report en
10 pages
BSIT CT 1 A Manual For Setup of Piso Wifi
No ratings yet
BSIT CT 1 A Manual For Setup of Piso Wifi
12 pages
Data Communication & Computer Networks (DCCN) IT-360: Basic Terms of Networking
No ratings yet
Data Communication & Computer Networks (DCCN) IT-360: Basic Terms of Networking
20 pages
640-802 CCNA Question Review PDF
No ratings yet
640-802 CCNA Question Review PDF
77 pages
Analysis of Brand Awareness and Consumer Behaviour of Reliance Broadband Connection in Pune
No ratings yet
Analysis of Brand Awareness and Consumer Behaviour of Reliance Broadband Connection in Pune
60 pages
Compact Lab Testing Chambers
No ratings yet
Compact Lab Testing Chambers
22 pages
Daily Breakup Syllabus COPA
No ratings yet
Daily Breakup Syllabus COPA
40 pages
Final Project Report Combined
No ratings yet
Final Project Report Combined
36 pages
09 IPTV STB Principle and Operation-42p
No ratings yet
09 IPTV STB Principle and Operation-42p
42 pages
Configure WGB 00 PDF
No ratings yet
Configure WGB 00 PDF
7 pages
Practical Research Content
No ratings yet
Practical Research Content
92 pages
M100 - Quickstart Guide PDF
No ratings yet
M100 - Quickstart Guide PDF
4 pages
Lesson2 - Introduction To IT
No ratings yet
Lesson2 - Introduction To IT
54 pages
15cs81-Iot Syllabus
No ratings yet
15cs81-Iot Syllabus
3 pages
Cnmaestro 3.0.0 Release Notes
No ratings yet
Cnmaestro 3.0.0 Release Notes
19 pages
Hospital Network Topology Guide
No ratings yet
Hospital Network Topology Guide
17 pages
NQA - Configuration Guide - System Monitor (V800R002C01 - 01)
No ratings yet
NQA - Configuration Guide - System Monitor (V800R002C01 - 01)
83 pages
Designing Cisco Enterprise Wireless Networks (ENWLSD) v1.0: What You'll Learn in This Course
No ratings yet
Designing Cisco Enterprise Wireless Networks (ENWLSD) v1.0: What You'll Learn in This Course
4 pages