Data Science – Hierarchy of Needs:
Data warehouses:
Data warehouses are both the engine and the fuels that enable higher level analytics, like
business intelligence, online experimentation, or machine learning or IR.
Data Warehouse is for OLAP
Database is for OLTP
ETL: Extract, Transform, and Load:
The following three conceptual steps are how most data pipelines are designed and structured. They serve as a
blueprint for how raw data is transformed to analysis-ready data. To understand this flow more concretely, I
found the following picture from Robinhood’s engineering blog very useful:
Extract: this is the step where sensors wait for upstream data sources to land (e.g. a upstream
source could be machine or user-generated logs, relational database copy, external dataset …
etc). Upon available, we transport the data from their source locations to further
transformations.
Transform: This is the heart of any ETL job, where we apply business logic and perform
actions such as filtering, grouping, and aggregation to translate raw data into analysis-ready
datasets. This step requires a great deal of business understanding and domain knowledge.
Load: Finally, we load the processed data and transport them to a final destination. Often, this
dataset can be either consumed directly by end-users or it can be treated as yet another
upstream dependency to another ETL job, forming the so called data lineage.
What is Data Engineering?
Data Engineering steps:
Data Engineering Viewpoints:
The role of a Data Engineer includes:
Gathering data from disparate sources.
Integrating data into a unified view for data consumers.
Preparing data for analytics and reporting.
Managing data pipelines for a continuous flow of data from source to destination systems.
Managing the complete infrastructure for the collection, processing, and storage of data.
Technical skills of Software Engineer:
Data Pipeline example:
Data Engineering Road Map:
Data Engineer Tools:
Data Repositories: Transactional and Analytical (on-relational databases, DW, DM, DL, and big data
stores)
Data warehouse is relational. Central repository for data collected from different sources. It is analysis-
ready data.
Data integration involves combining data residing in different sources and providing users with a unified
view of them. This process becomes significant in a variety of situations, which include both commercial
and scientific domains.
Transactional means it is good for transaction like it allows updating while querying
While data integration combines disparate data into a unified view of the data, a data pipeline covers the
entire data movement journey from source to destination systems, and ETL is a process within data
integration.
Integration pipeline.
Types of Data:
Extracting Data:
Databases:
Others:
Systems that are used for capturing high-volume transactional data need to be designed for high-speed
read, write, and update operations. Not for complex querying
UNSTRUCTURED DATA CAN be stored in NoSQL databases or Data-Lakes
Join vs. Union:
Join combines columns.
Union combines rows.
How do you clean data?
Step 1: Remove duplicate or irrelevant observations
Step 2: Fix structural errors
Step 3: Filter unwanted outliers
Step 4: Handle missing data
Step 5: Validate and QA
SQL: ORDER BY, GROUP BY, Like.
Data Processing
Data Retention
Data Sharing
Data Acquisition
Anonymization
Pseudonymization
Data Profiling
Encryption
Data Platforms:
Data Optimization for performance:
Data Figures:
DynamoDB:
SQL or RelationaL: Less storage, higer CPU
NoSQL: More storage, Less CPU
SQL: data is consistent and up to date
NoSQL: data isn't necessarily the same version on all datapoints but considered relatively consistent
There are a number of reasons why you may be wanting to dive into using a NoSQL database system like
DynamoDB.
You may be looking for something that provides an easier method for horizontal scalability.
Maybe you're looking for a low latency way to handle a high amount of transactional data.
Starting with the relational database system, these were a big leap forward at their time.
And the fact that they're still heavily utilized speaks to their strong functionality.
A major issue that occurred during the earlier days of data storage was the high expense of storage.
The unstructured data required large amounts of storage and as the need for quick access to the data
grew, so did the need to better manage the cost of handling that data, while also continuing to keep up
with the performance needs.
The relational database management system provided a way to normalize the data, in order to provide a
methodology for analytical processing.
The relational database application gave a strong analytical approach for business intelligence.
It provided better access and organization for the data being created, and provided a system that could
be both redundant and scalable.
Another thing the relational system provided was a trade-off of storage costs for CPU costs.
While the data was being stored in a much more efficient manner, the ability to consume and normalize
that data required more compute capabilities than was originally needed.
Additionally, it took much more CPU power in order to take the normalized data across multiple datasets
and tables, and display it based on what was being requested.
When looking at NoSQL, it is important to remember that this does not mean anti-SQL, but instead, not
only SQL.
This should be seen as another option to keep in your tool bag, in addition to the relational systems.
So before we dive into how NoSQL systems work, let's quickly introduce the CAP Theorem.
CAP, or C-A-P, stands for consistency, availability, and partition tolerance.
Partition tolerance is a requirement for our systems regardless of the use case.
So because of this, our operations have to choose between consistency and availability.
If we choose strongly consistent, then we are choosing the consistency and partition tolerance, and our
availability will suffer in the response times.
If we choose the highest levels of availability, then we are choosing availability and partition tolerance,
and our consistency will be eventual instead of strong.
SQL: Row-based.
Cassandra and HBase: Column based.
partition and sort keys
max item size is 400kb
maximum partiition keys is 2048 bytes
maximum sort key is 1024
partiition key is the primary key
Composite key is when having Partition Key and Sort key together form PrimaryKey