Introduction to Big Data
Ref book: Hadoop Essentials by Shiva Achari, ISBN 978-1-78439-668-8
Available on ProQuest e-book central
Outcome #1
Describe the concepts of Big Theory
Data, its characteristics and big
data domains.
Understand the need of Big Data
What is Big Data mean?
Big Data is all about finding the needle of value in a haystack of structured, semi-structured and
unstructured information.
Big Data capabilities over the traditional system:
Big data systems can process data analytics not only faster but also efficiently for a large data and can
enhance the scope of research and development analysis and can produce more meaningful insights
and faster than any other analytic or BI system.
Big data systems have emerged due to some issues and limitations in traditional systems. The
traditional systems are good for Online Transaction Processing (OLTP) and Business Intelligence
(BI), but are not easily scalable considering cost, effort, and manageability aspect. Processing heavy
computations are difficult and prone to memory issues, or will be very slow. Traditional systems lack
extensively in data science analysis and make big data systems powerful
INTRODUCTION TO BIG DATA 3
Some examples of big data use cases:
Predictive analytics
Fraud analytics
Machine learning
Identifying patterns
Data analytics
Semi-structured and unstructured data processing and analysis.
INTRODUCTION TO BIG DATA 4
V's of big data
Image copied from Achari, Shiva. Hadoop Essentials, Packet Publishing Ltd, 2015. ProQuest Ebook
INTRODUCTION TO BIG DATA 5
Volume
Big data systems are designed to store petabytes or zettabytes of data even more
than that.
The cost per terabyte storage in big data is very less than in other systems
Data will be distributed and replicated across multiple nodes.
Data easily scalable, and nodes can be added without much maintenance effort.
INTRODUCTION TO BIG DATA 6
Velocity
The rate at which data is processed should be equal to the rate at which data is
generated.
Big data systems can process huge complex algorithms on huge data much quickly, as
it leverages parallel processing across distributed environment.
Big Data systems executes multiple processes in parallel at the same time, and the job
can be completed much faster.
INTRODUCTION TO BIG DATA 7
Variety
Another big challenge for the traditional systems is to handle different
variety of
semi-structured data or unstructured data such as e-mails, audio and video
analysis,
image analysis, social media, gene, geospatial, 3D data, and so on.
Big data can not only help store, but also utilize and process such data using
algorithms much more quickly and also efficiently.
Big data systems can efficiently handle Semi-structured and unstructured
complex data processing with minimal or no preprocessing like other systems.
INTRODUCTION TO BIG DATA 8
Who is creating big data?
A list of some sources that are creating big data is mentioned as follows:
Monitoring sensors: Climate or ocean wave monitoring sensors generate data
consistently and in a good size, and there would be more than millions of sensors that
capture data.
Posts to social media sites: Social media websites such as Facebook, Twitter, and
others have a huge amount of data in petabytes.
Digital pictures and videos posted online: Websites such as YouTube, Netflix, and
others process a huge amount of digital videos and data that can be petabytes.
INTRODUCTION TO BIG DATA 9
Who is creating big data?....
Transaction records of online purchases: E-commerce sites such as eBay, Amazon,
Flipkart, and others process thousands of transactions on a single time.
Server/application logs: Applications generate log data that grows consistently, and
analysis on these data becomes difficult.
CDR (call data records): Roaming data and cell phone GPS signals to name a few.
Science, genomics, biogeochemical, biological, and other complex and/or
interdisciplinary scientific research.
INTRODUCTION TO BIG DATA 10
Understanding big data
big data is a terminology which refers to challenges that we are facing due to exponential growth
of data in terms of V problems. The challenges can be subdivided into the following phases:
• Capture
• Storage
• Search
• Sharing
• Analytics
• Visualization
INTRODUCTION TO BIG DATA 11
Strategy to solve Big Data Problems
Big data systems also refer to technologies that can process and analyze data, which we
discussed as volume, velocity, and variety data problems. The technologies that can
solve big data problems should use the following architectural strategy:
• Distributed computing system
• Massively parallel processing (MPP)
• NoSQL (Not only SQL)
• Analytical database
INTRODUCTION TO BIG DATA 12
Why NoSQL
A NoSQL database is a widely adapted technology due to the schema less
design, and its ability to scale up vertically and horizontally is fairly
simple and in much less effort. SQL and RDBMS have ruled for more
than three decades, and it performs well within the limits of the
processing environment, and beyond that the RDBMS system
performance degrades, cost increases, and manageability decreases, we
can say that NoSQL provides an edge over RDBMS in these scenarios.
INTRODUCTION TO BIG DATA 13
Types of NoSQL databases
As the NoSQL databases are non-relational they have different sets of
possible architecture and design. There are four general types of
NoSQL databases, based on how the data is stored:
1. Key-value store
2. Column store
3. Document database
4. Graph database
5. Analytical database
INTRODUCTION TO BIG DATA 14
1. Key-value store
• These databases are designed for storing data in a key-value store.
Some popular key-value type databases are DynamoDB, Azure Table
Storage (ATS), Riak, and BerkeleyDB.
• stores information in form of matched pairs with only two columns
permitted - the key (hashed key) and the value.
• The values can be simple text or complex data types such as sets of
data.
• Data must be retrieved via an exact match on the key.
• The advantage of this type of NoSQL database is that new types of
data can easily be added to the database as new key value pairs.
INTRODUCTION TO BIG DATA 15
1. Key-value store
INTRODUCTION TO BIG DATA 16
2. Column store
These databases are designed for storing data as a group
of column families. Read/write operation is done using
columns, rather than rows.
They deliver high performance on aggregation queries
like SUM, COUNT, AVG, MIN etc. as the data is readily
available in a column.
Column-based NoSQL databases are widely used to
manage data warehouses, business intelligence, CRM,
Library card catalogs,
Some popular column store type databases are HBase,
BigTable, Cassandra, Vertica, and Hypertable.
INTRODUCTION TO BIG DATA 17
3. Document database
These databases are designed for storing, retrieving, and managing document-oriented
information. A document database expands on the idea of key-value stores where values
or documents are stored using some structure and are encoded in formats such as XML.
Pair of each key with a complex data structure known as a Document
Document store, is a computer program designed for storing, retrieving, and managing
document-oriented information, also known as semi-structured data.
Document-oriented databases are inherently a subclass of the key-value store
Documents can contain many different key-value pairs, or key-array pairs, or even
nested documents.
Some popular document-type databases are MongoDB and CouchDB.
INTRODUCTION TO BIG DATA 18
3. Document database
INTRODUCTION TO BIG DATA 19
4. Graph database
These databases are designed for data whose relations are well represented as trees or a
graph, and has elements, usually with nodes and edges, which are interconnected. Relational
databases are not so popular in performing graph-based queries as they require a lot of
complex joins.
Graph Databases are used to store information about networks of data, such as social
connections.
A graph database is a database that uses graph structures for semantic queries with
nodes, edges and properties to represent and store data.
A key concept of the system is the graph (or edge or relationship), which directly relates
data items in the store.
Some popular graph-type databases are Neo4J and Polyglot.
INTRODUCTION TO BIG DATA 20
4. Graph database
INTRODUCTION TO BIG DATA 21
5. Analytical database
An analytical database is a type of database built to store,
manage, and consume big data. Analytical databases are vendor-
managed DBMS, which are optimized for processing advanced
analytics that involves highly complex queries on terabytes of data
and complex statistical processing, data mining, and NLP (natural
language processing).
Examples of analytical databases are Vertica (acquired by HP),
Aster Data (acquired by Teradata), Greenplum (acquired by EMC),
and so on.
INTRODUCTION TO BIG DATA 22
Big data use case patterns
There are many technological scenarios, and some of them are
similar in pattern. It is a good idea to map scenarios with
architectural patterns.
Once these patterns, are understood, they become the fundamental
building blocks of solutions. We will discuss five types of patterns.
INTRODUCTION TO BIG DATA 23
Big data use case patterns
1. Big data as a storage pattern
Big data systems can be used as a storage pattern or as a data warehouse, where
data from multiple sources, even with different types of data, can be stored and can
be utilized later.
2. Big data as a data transformation pattern
Big data systems can be designed to perform transformation as the data loading
and cleansing activity, and many transformations can be done faster than traditional
systems due to parallelism. Transformation is one phase in the Extract–Transform–Load
of data ingestion and cleansing phase.
INTRODUCTION TO BIG DATA 24
Big data use case patterns
1. Big data for a data analysis pattern
Data analytics is of wider interest in big data systems, where a huge amount of data can be
analyzed to generate statistical reports and insights about the data, which can be useful in
business and understanding of patterns.
2. Big data for data in a real-time pattern
Big data systems integrating with some streaming libraries and systems are capable of handling
high scale real-time data processing. Real- time processing for a large and complex requirement
possesses a lot of challenges such as performance, scalability, availability, resource management,
low latency, and so on. Some streaming technologies such as Storm and Spark Streaming can be
integrated with YARN.
3. Big data for a low latency caching pattern
Big data systems can be tuned as a special case for a low latency system, where reads are much
higher and updates are low, which can fetch the data faster and can be stored in memory, which
can further improve the performance and avoid overheads.
INTRODUCTION TO BIG DATA 25