KEMBAR78
Unit 1 | PDF | Big Data | Data
0% found this document useful (0 votes)
197 views59 pages

Unit 1

Uploaded by

nosopa5904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views59 pages

Unit 1

Uploaded by

nosopa5904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Big Data Analytics(BDA)

GTU #3170722

Unit-1

Introduction to Big Data


 Outline
Looping
• Introduction to Big Data
• Classification of Digital Data
• Big Data Characteristics
• Evolution of Big Data
• Definition of Big Data
• Challenges of Conventional System
• Intelligent Data Analysis
• Traditional vs. Big Data business Approach
• Introduction to Big Data Analytics.
• Case Study or Examples
Introduction
 Firstly, We need to know “what is data?”
 The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic,
optical, or mechanical recording media.
Data Comes From Types of Data
Computer Data as Information
 Computer data is information processed or stored by a computer.
 This information may be in the form of text documents, images, audio clips, software programs,
or other types of data.
 Computer data may be processed by the computer's CPU and is stored in files and folders on
the computer's hard disk.
Definition – Big Data
 Big Data is a massive collection of data that continues to
grow dramatically over time.
 It is a data set that is so huge and complicated that no
typical data management technologies can effectively store
or process it.
 Big Data is like regular data, but it is much larger.
 A data which are very large in size.
 Normally we work on data of size MB(WordDoc ,Excel) or
maximum GB(Movies, Codes) but data in Peta bytes i.e.
1015 byte size is called Big Data.
 It is stated that almost 90% of today's data has been
generated in the past 3 years.
Sources of Big Data
Posts, Photos Videos, Likes and
Comments on Social Media
Traffic data & GPS Signals

Emails, Blogs and e-news


Software logs, camera and microphone

Huge data from Weather station and


satellite that stored and manipulated to
forecasting
Digital Pictures & Videos
Classification of Digital Data
Classification of Digital Data
1. Unstructured
2. Semi-structured
3. Structured
Unstructured
 Any data with unknown form or the structure is classified as unstructured data.
 In addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it.
 Typical example of unstructured data is, a heterogeneous data source containing a
combination of simple text files, images, videos like search in Google Engine.
 Now a day organizations have wealth of data available with them but unfortunately they don't
know how to derive value out of it since this data is in its raw form or unstructured format.
Human Generated Data Machine Generated Data
Unstructured - Example
 The output returned by 'Google Search'
Structured
 Any data that can be stored, accessed and processed in the form of fixed format is termed as a
"Structured" data.
 Over the period of time, talent in computer science have achieved greater success in
developing techniques for working with such kind of data (where the format is well known in
advance) and also determining value out of it.
 When size of such data grows to a huge extent, typical sizes are being in the range of multiple
zettabyte.
 Data stored in a relational database management system in one example of a structured data.
Structured - Example
 Employee_Table
Employee_ID Employee_Name Gender Department Salary_In_lacs
1 XYX MALE FINANCE 850000
2 ABC MALE ADMIN 250000
3 PQR FEMALE SALES 350000
4 MNR FEMALE FINANCE 600000
Semi-structured
 Semi structured is the third type of big data.
 Semi-structured data can contain both the forms of data.
 Semi-structured data pertains to the data containing both the formats mentioned above, that is,
structured and unstructured data.
 To be precise, it refers to the data that although has not been classified under a particular
repository (database), yet contains vital information or tags that segregate individual elements
within the data.
 Web application data, which is unstructured, consists of log files, transaction history files etc.
 Online transaction processing systems are built to work with structured data wherein data is
stored in relations (tables).
Semi-structured - Example
 User can see semi-structured data as a structured in form but it is actually not defined with e.g.
a table definition in relational DBMS.
 Personal data stored in a XML file:
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
Difference
Factors Structured data Semi-structured data Unstructured data
 It is more flexible than
 It is flexible in nature and
 It is dependent and less structured data but less than
 Flexibility there is an absence of a
flexible flexible than unstructured
schema
data
 Matured transaction and
 The transaction is adapted  No transaction management
 Transaction Management various concurrency
from DBMS not matured and no concurrency
technique
 Structured query allow  Queries over anonymous  An only textual query is
 Query performance
complex joining nodes are possible possible
 It is based on the relational  This is based on character
 Technology  It is based on RDF and XML
database table and library data
Big Data Characteristics

 Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data
volume in Petabytes.
Big Data Characteristics

 Value refers to turning data into value. By turning accessed big data into values, businesses
may generate revenue.
Big Data Characteristics

 Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of
data that brings incompleteness and inconsistency.
Big Data Characteristics

 Visualization is the process of displaying data in charts, graphs, maps, and other visual
forms.
Big Data Characteristics

 Variety refers to the different data types i.e. various data formats like text, audios, videos,
etc.
Big Data Characteristics

 Velocity is the rate at which data grows. Social media contributes a major role in the velocity
of growing data.
Big Data Characteristics

 Virality describes how quickly information gets spread across people to people (P2P)
networks.
Volume
 As it follows from the name, big data is used to refer to enormous
amounts of information.
Volume
[ Data at Rest ]
 We are talking about not gigabytes but terabytes and petabytes of
data.
 The IoT (Internet of Things) is creating exponential growth in data.
 The volume of data is projected to change significantly in the
coming years.
 Hence, 'Volume' is one characteristic which needs to be
considered while dealing with Big Data. • Terabytes, Petabytes
• Records/Arch
• Table/Files
• Distributed
Variety
 Variety refers to heterogeneous sources and the nature of data,
both structured and unstructured.
Variety
[ Data in many Forms ]
 Data comes in different formats – from structured, numeric data in
traditional databases to unstructured text documents, emails,
videos, audios, stock ticker data and financial transactions.
 This variety of unstructured data poses certain issues for storage,
mining and analysing data.
 Organizing the data in a meaningful way is no simple task,
especially when the data itself changes rapidly. • Structured
• Unstructured
 Another challenge of Big Data processing goes beyond the • Text
massive volumes and increasing velocities of data but also in • Multimedia
manipulating the enormous variety of these data.
Veracity
 Veracity describes whether the data can be trusted.
Veracity
 Veracity refers to the uncertainty of available data. [ Data in Doubt ]
 Veracity arises due to the high volume of data that brings
incompleteness and inconsistency.
 Hygiene of data in analytics is important because otherwise, you
cannot guarantee the accuracy of your results.
 Because data comes from so many different sources, it’s difficult
to link, match, cleanse and transform data across systems.
• Trustworthiness
 However, it is useless if the data being analysed are inaccurate or • Authenticity
incomplete. • Accurate
 Veracity is all about making sure the data is accurate, which • Availability
requires processes to keep the bad data from accumulating in
your systems.
Velocity
 Velocity is the speed in which data is grows, process and becomes
accessible.
Velocity
[ Data in Motion ]
 A data flows in from sources like business processes, application
logs, networks, and social media sites, sensors, Mobile devices,
etc.
 The flow of data is massive and continuous.
 Most data are warehoused before analysis, there is an increasing
need for real-time processing of these enormous volumes.
 Real-time processing reduces storage requirements while • Streaming
• Batch
providing more responsive, accurate and profitable responses.
• Real / Near Time
 It should be processed fast by batch, in a stream-like manner • Processes
because it just keeps growing every years.
Value
 It refers to turning data into value. By turning accessed big data
into values, businesses may generate revenue.
Value
[ Data into Money ]
 Value is the end game. After addressing volume, velocity, variety,
variability, veracity, and visualization – which takes a lot of time,
effort and resources – you want to be sure your organization is
getting value from the data.
 For example, data that can be used to analyze consumer behavior
is valuable for your company because you can use the research
results to make individualized offers. • Statistical
• Events
• Correlations
Visualization
 Big data visualization is the process of displaying data in charts,
graphs, maps, and other visual forms.
Visualization
[ Data Readable ]
 It is used to help people easily understand and interpret their data
at a glance, and to clearly show trends and patterns that arise from
this data.
 Raw data comes in a different formats, so creating data
visualizations is process of gathering, managing, and transforming
data into a format that’s most usable and meaningful.
 Big Data Visualization makes your data as accessible as possible • Readable
to everyone within your organization, whether they have technical • Accessible
data skills or not. • Presentation
• Visual Forms
Virality
 Virality describes how quickly information gets spread across
people to people (P2P) networks.
Virality
[ Data Spread ]
 It is measures how quickly data is spread and shared to each
unique node.
 Time is a determinant factor along with rate of spread.

• P2P
• Shared
• Rate of Spread
Evolution of Big Data
 1940s to 1989 – Data Warehousing and Personal
Desktop Computers
 1989 to 1999 – Emergence of the World Wide Web
 2000s to 2010s – Controlling Data Volume, Social Media
and Cloud Computing
 2010s to now– Optimization Techniques, Mobile Devices
and IoT
Definition of Big Data
 Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can
store it or process it efficiently. Big data is also a data but with huge size.
Challenges of Conventional System
 There are main three challenges of conventional system, which are as follows:
1. Volume of Data
2. Processing and Analyzing
3. Management of Data
Volume of Data
 The volume of data increasing day by day, especially the data generated from machine,
telecommunication service, airline services, data from sensors, etc…
 The rapid growth in data every year is coming with new source of data which are emerging.
 As per survey, the growth in volume of data is so rapid that it is expected by IBM that by 2020
around 35 zettabyte of data will get stored in the world.
Processing & Analyzing
 Processing of such large volume of data is major challenge and is very difficult.
 Organization make use of such large volume of data by analyzing in order to achieve their
business goals.
 Taking out insights from such large amount of data is time consuming and it also takes lot of
effort to do.
 Processing and analyzing of data is also costly since the data is in different format and is
complex.
Management of Data
 As the data gathered have different formats like structured, semi-structured and unstructured, it
is very challenging to manage such different variety of data.
Intelligent Data Analysis
 Intelligent Data Analysis (IDA) is one of the major issues in the field of artificial intelligence and
information.
 Intelligent data analysis reveals implicit, previously unknown and potentially valuable
information or knowledge from large amounts of data.
 It also helps in making a decision.
 All zones of data visualization, data pre-preparing(combination, altering, change, separating,
examining), data engineering, database mining procedure, devices and applications, use of
domain knowledge in in data analysis, big data applications, developmental algorithms, etc…
 It includes three major steps:
1. Data Preparation
2. Rules finding or data mining
3. Result validation and explanation
Intelligent Data Analysis – Cont.
 Data Preparation:
 It includes extracting or collecting relevant data from source and then creating an data set.
 Rules finding or Data mining:
 It is working out rules contained in the dataset by means of certain methods or algorithms.
 Result Validation and Explanation:
 This result validation means examining these rules.
 And Result explanation is giving intuitive, reasonable, and understandable description using logical
reasoning.
 IDA is to extract useful knowledge, the process demands a combination of extraction, analysis,
conversion, classification, organization, reasoning, and so on.
 We can imply machine learning and deep learning concept for IDA.
 It will helps in many area:
 Banking & Securities, Communications, Media, & Entertainment
 Healthcare Providers
Importance of Big Data
 Complex or massive data sets which are quite impractical to be managed using the traditional
database system and software tools are referred to as big data.
 Big data is utilized by organizations in one or another way. It is the technology which possibly
realizes big data’s value.
 It is the voluminous amount of both multi-structured as well unstructured data.
Traditional vs. Big Data
 Confidentiality & Data Accuracy
 Data Relationship
 Data Storage Size
 Different types of data
 Flexibility
 Real-time Analytics
 Distributed Architecture
Majors between Traditional Data & Big Data

TRADITIONAL DATA BIG DATA


Traditional Data
 Traditional data is generated in enterprise  Big data is generated in outside and enterprise
level. level.

 Its volume ranges from Gigabytes to  Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.

 Traditional database system deals with  Big data system deals with structured, semi
structured data. structured and unstructured data.
Big Data
 Traditional data is generated per hour or per  But big data is generated more frequently
day or more. mainly per seconds.

 Traditional data source is centralized and it is  Big data source is distributed and it is
managed in centralized form. managed in distributed form.
Majors between Traditional Data & Big Data

TRADITIONAL DATA BIG DATA


Traditional Data

 Data integration is very easy.  Data integration is very difficult.

 Normal system configuration is capable to  High system configuration is required to


process traditional data. process big data.

 The size of the data is very small.  The size is more than the traditional data size.

Big Data
 Traditional data base tools are required to  Special kind of data base tools are required to
perform any data base operation. perform any data base operation.

 Normal functions can manipulate data.  Special kind of functions can manipulate data.
Majors between Traditional Data & Big Data

TRADITIONAL DATA BIG DATA


Traditional Data
 Its data model is strict schema based and it is  Its data model is flat schema based and it is
static. dynamic.

 Traditional data is stable and inter  Big data is not stable and unknown
relationship. relationship.

 Big data is in huge volume which becomes


 Traditional data is in manageable volume.
unmanageable.
Big Data
 It is difficult to manage and manipulate the
 It is easy to manage and manipulate the data.
data.
 Its data sources includes ERP transaction
 Its data sources includes social media, device
data, CRM transaction data, financial data,
data, sensor data, video, images, audio etc.
organizational data, web transaction data etc.
Case Study of Big Data Solution
 Undoubtedly Big Data has become a major game change in most part of the cutting edge
industries over the last few years.
 As Big Data keeps on going day by day, the number of various organizations that are adopting
Big Data keeps on expanding.
 Let’s discuss example:
 An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to its top 10
customers who have spent the most in the previous year.
 Moreover, they want to find the buying trend of these customers so that company can suggest more items
related to them.
 Issues: Huge amount of unstructured data which needs to be stored, processed and analyzed.
 Solution:
 Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which uses commodity
hardware to form clusters and store data in a distributed fashion. It works on Write once, read many times principle.
 Processing: Map Reduce paradigm is applied to data distributed over network to find the required output.
 Analyze: Pig, Hive can be used to analyze the data.
 Cost: Hadoop is open source so the cost is no more an issue.
Where are businesses finding uses for Big Data ?
Walmart
 Biggest retiler in the world and world’s biggest organization by revenue.
 Approx. 2 million workers and 20000 stores in 28+ nations.
 It started to use Big Data concept in earlier stage.
 It used data mining to find designs pattern that can be used to give product suggestions to
client, depending on which products were brought together.
 Based on data mining result, it has expanding its conversion rate of customers.
 Main taget of walmart is to holding customers and enhance their experience.
 Hadoop and NoSQL technologies are used to furnished these customers real time data to
gathered from various sources and their effective valuable use.
Uber
 It is the best option for individuals around the globe when moving people and making
conveyances.
 It utilizes individuals information of the user to intently monitor which features of services are
used.
 To analyze usage pattern and to figure out where the services should be more engaged.
 It focuses around the oraganic market of the services because of which the costs of services
gave changes.
 The use of data is surge pricing and its influences the rate of demand.
Netflix
 It is very popular entertainment company work in online on-request web based video streaming
for its customers.
 It has been determined to be able to predict what precisely its customers will appreciate
viewing with Big Data.
 Recently, Netflix begun positioning itself as a content creator, not simply a distribution medium
which is solidly said based on data analytics.
 Data likes are recommandation engines take care of customers watch, regularly playback
halted, ratings and so on.
 It has incorporates with Hadoop, Hive and Pig and other traditional business intelligence.
Transportation
 Congestion management and traffic control
Thanks to Big Data analytics, Google Maps can now tell you the least traffic-prone route to any
destination.
 Route planning
Different itineraries can be compared in terms of user needs, fuel consumption, and other
factors to plan for maximize efficiency.
 Traffic safety
Real-time processing and predictive analytics are used to pinpoint accident-prone areas.
More Case Studies of Big Data
 https://www.scnsoft.com/blog/big-data-use-cases-stats-and-examples
 https://www.tableau.com/learn/articles/big-data-examples-use-cases

You might also like