Introduction to Big Data
By: Faizan Irshad
What is Big Data?
Big Data is a phrase used to mean a massive volume of
both structured and unstructured data which is so large that it is
difficult to process using traditional database and software
techniques.
In most enterprise scenarios the volume of data is too big or it
moves too fast or it exceeds current processing capacity.
The Information Continuum
Types of Data
Quantitative Data Qualitative Data
Measurable Descriptive
Collected through Collected through
measuring things that observation, field work,
have a fixed reality focus groups, interviews,
Close ended recording or filming
conversations
Open ended
VOLUME VARIETY
The amount The types
of data The 4 V’s of data
of
VELOCITY Big Data VERACITY
The frequency of The quality
data of data
Volume: scale of data
Volume: scale of data
90% of today’s data has been created in just the last 2 years
Every day we create 2.5 quintillion bytes of data
Most companies in the US have over 100 terabytes (100,000 gigabytes) of
data stored
Variety: different forms of data
Velocity: analysis of streaming data
Veracity: trustworthiness of data
Origin
Authenticity
Trustworthiness
Completeness
Integrity
The Structure of Big Data 12
❖ Structured
• Most traditional data sources
❖ Semi-structured
• Many sources of big data
❖ Unstructured
• Video data, audio data
What is Unstructured Data?
Typical human-generated unstructured data includes:
•Text files: Word processing, spreadsheets, presentations, email, logs.
•Email: Email has some internal structure thanks to its metadata, and we sometimes refer to it
as semi-structured. However, its message field is unstructured and traditional analytics
tools cannot parse it.
•Social Media: Data from Facebook, Twitter, LinkedIn.
•Website: YouTube, Instagram, photo sharing sites.
•Mobile data: Text messages, locations, phone recordings.
•Media: MP3, digital photos, audio and video files.
Typical machine-generated unstructured data includes:
•Satellite imagery: Weather data, land forms, military movements.
•Scientific data: Oil and gas exploration, space exploration, seismic imagery, atmospheric data.
•Digital surveillance: Surveillance photos and video.
•Sensor data: Traffic, weather, oceanographic sensors.
Why Big Data
• Growth of Big Data is needed because of:
– Increase of storage capacities
– Increase of processing power
– Availability of data(different data types)
Big Data sources
Users
Application Large and growing
files
(Big data files)
Systems
Sensors
Data generation points Examples
Mobile Devices
Microphones
Readers/Scanners
Science facilities
Programs/ Software
Social Media
Cameras
Sensing devices
Smartwatches
Smart jewelry
Fitness trackers
Sport watches
Smart glasses
Smart clothing…
Technologies of Big Data
Traditionally, data are stored in relational databases. The data is
required to be extracted periodically according to the needs of
the organization from operational databases. The traditional
processing systems and tools set fall short when it comes to deal
with big data. Therefore new processes and technologies are
required to deal with big data.
Additional technologies applied on big data are massively-
parallel processing(MPP) databases, Hadoop and MapReduce,
data mining, search –based applications, distributed databases
and file systems.
RDBMS vs. Hadoop
Big Data Analytics
Benefits of Big Data
•Real-time big data isn’t just a process for storing petabytes or
exabytes of data in a data warehouse, It’s about the ability to
make better decisions and take meaningful actions at the right
time.
•Fast forward to the present and technologies like Hadoop give
you the scale and flexibility to store data before you know how
you are going to process it.
•Technologies such as MapReduce, Hive and Impala enable you
to run queries without changing the data structures underneath.
Application Of Big Data analytics
Smarter Multi-channel
Healthcare sales
Homeland Telecom
Security
Trading
Traffic Control Analytics
Search
Manufacturing Quality
Hurdles and Risks
Unstructured Data (~75% of data in the healthcare
environment)
Data privacy/security
Inconsistent, incomplete , unavailable, poor quality or invalid
data
Poor analysis/analytics leading to erroneous
correlations/conclusions
Thank You