Chapter Two
Overview Of Data Science
Habtamu Abune
habtamu.abune@aau.edu.et
Learning outcomes
After completing this lesson you should be able to
❑ Describe what data science is and the role of data scientists.
❑ Differentiate data, information and Knowledge.
❑ Describe data processing life cycle
❑ Understand different data types from diverse perspectives
❑ Describe data value chain in emerging era of big data
❑ Basic concepts of Big Data
What is Data Science ? Why we need it?
More data usually beats better algorithms, techniques and tools
An Overview of Data Science
➢ Data science is a multidisciplinary field .
➢ It uses scientific methods, processes and algorithm systems to extract
knowledge, Insights from structured, semi-structured and unstructured data
➢ Data science is much more than simply analyzing data.
➢ It offers a range of roles and requires a range of skills.
➢ As an academic discipline and profession:
➢ Data science continues to evolve as one of the most promising and in-demand
career paths for skilled professionals.
➢ Today, successful data professionals must advance past the traditional skills of
analysing large amounts of data, data mining, data warehousing, programming
skills and modelling to build and anlyze algorithm.
➢ Data scientists need to be curious and result-oriented with
exceptional industry specific knowledge and communication skill
Need for data science
● Data science is the ability
○ To store large amounts of data
○ To understand, process and extract value from it
○ To visualize and communicate it for decision making and problem
solving in such dynamic world
● Data Science supports “Business Intelligence”
○ For smart decision-making and problem solving
○ For predicting potential market, potential product,
potential customers
○ Need data for identifying risks, opportunities, conducting
“what-if” analyses
Component of Data Science
What is data? How it is created? What makes it different from
information and knowledge?
What is data?
❑ Data is a representation of facts, concepts, or instructions in a formalized
manner, which should be suitable for communication, interpretation, or
processing by human or electronic machine.
➢ No meaning attached to it as a result of which it may have multiple
meaning
● Example: what does “Habtamu” mean?
❑ Data can be described as unprocessed facts and figures
❑ It is represented with the help of characters such as alphabets (A-Z, a-z), digits
(0-9) or special characters (+, -, /, *, <,>, =, etc.).
❑ It can also be defined as groups of non-random symbols in the form of
text, images, and voice representing quantities, action and objects
What is Information?
❑ Aggregation of data as per the context that makes decision
making easier.
❑Meaning is attached and contextualized
❑ Answers question: "who", "what", "where", and "when"
❑ Organized or classified data, which has some meaningful values for the
receiver
❑ Processed data on which decisions and actions are based.
❑ Plain collected data as raw facts cannot help much in decision-making
❑ Interpreted data created from organized, structured, and processed data in
a particular context.
What is knowledge?
➢ Includes facts about the real world entities and the relationship
between them.
➢ It is an Understanding gained through experience
➢ Knowledge is the appropriate collection of information, the intent of which
is usefulness.
➢ Answer question "how"
What is Wisdom?
● Wisdom embodies an understanding of fundamental principles, insight, ethical
code and moral by integrating knowledge
○ Answer ‘why’ question
● Knowledge that are essentially the basis for the knowledge being what it
is.
Data→Information→Knowledge→Wisdom
Data Processing Cycle
❑ Data processing is the re-structuring or re-ordering of data by people or
machine to increase their usefulness and add values for a particular
purpose.
❑ Data Proccessing cyle contain three steps such as to take input, process it
and generate output.
Cont.………
Input
➢ The input data is prepared in some convenient form for processing
➢ The form will depend on the processing machine
➢ For example, when electronic computers are used, the input data can
be recorded on any one of the several types of input medium, such
as flash disks, hard disk, and so on
Cont.…..
Processing
➢ In this step, the input data is changed to produce data in a more
useful form
➢ For example, interest can be calculated on deposit to a bank, or a
summary of sales for a month can be calculated from the sales orders
data
Cont.……
Output
➢ At this stage, the result of the proceeding processing step is
collected
➢ The particular form of the output data depends on the use of the
data
➢ For example, output data can be total sale in a month or may be
payroll for eemployee.
Data types and its representation
Data types from Computer programming perspective
➢ Data type simply an attribute that tells compiler or intreppreter how the
programmer intended to use the data
➢ Common data types include
➢ Integers(int)- to store whole numbers
➢ Booleans(bool)- true or false
➢ Characters(char)- to store a single character
➢ Floating-point numbers(float)- to store real numbers
➢ Alphanumeric strings(string)- to store a combination of characters and
numbers
➢ Data type define:
➢ The opertation that can be done on the data
➢ The meaning of data and
➢ The way values of that type can stored
Data types from Data Analytics perspective
● There are three common types of data type or structure:
I. Structure
II. Semi- Structure
III. Unstructured
Structured Data
➢ Predefined data model and is therefore straightforward to analyze
➢ It Conforms to a tabular format with the relationship between different
rows and columns
➢ Common examples are Excel files or SQL databases.
Cont.….
● SQL Data ● Excel File
Semi-structured Data
❑ A form of structured data that does not conform with the formal
structure of data models associated with relational databases or other
forms of data tables
❑ But contain tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data
❑ Therefore, it is also known as self-describing structure
❑ Common Examples are XML, JSON, Sensor Data etc
Semi-structured Data -- examples
Examples of semi-structured data
JSON and XML
Unstructured Data
➢ Data that either does not have a predefined data model or is not
organized in a predefined manner
➢ It is typically text-heavy but may contain data such as dates, numbers,
and facts as well.
➢ This result in irregularity and ambigiuties which make it difficult to process or
understand using tradittional program (database) unlike structured data.
➢ Common examples are audio, video files, NoSQL, pictures, pdfs , word
docs.
Unstructured Data -- examples
● Pdf files ● Images
Metadata
➢ Data about Data
➢ It provides additional information about a specific set of data
➢ It is one of the most important element for big data analysis and solution.
➢ It is the last category of data type
For example
➢ Metadata of a photo could describe when and where the photos
were taken
➢ The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data
Metadata -- example
● Metadata about an image
Cont….
Data Value Chain
➢ Describe the information flow within a big data system as a series of
steps needed to generate value and useful insights from data
➢ The Big Data Value Chain identifies the following key high-level activities
1. Data Acquisition
2. Data Analysis
3. Data Curation
4. Data Storage and
5. Data Usage
Data Value Chain
Data Acquisition
➢ It is the process of gathering, filtering, and cleaning data before it is put
in a data warehouse or any other storage solution on which data analysis
can be carried out.
➢ Data cleaning tasks
➢ Correct redundant data
➢ Fill in missing value
➢ Correct inconsistency
➢ Data acquisition is one of the major big data challenges in terms of
infrastructure requirements
Data Acquisition Cont…
● The infrastructure required for data acquisition must
➢ Deliver low, predictable latency in both capturing data and in
executing queries.
➢ Be able to handle very high transaction volumes, often in a
distributed environment
➢ Support flexible and dynamic data structures
● To extract value from the data, the data needs to be cleaned to remove
noise.
○ Cleansing of data is important so that incorrect and faulty data can be filtered
out.
Data Analysis
● Data analysis is a process of preparing, exploring and modeling data
with the goal of discovering useful information and knowledge towards
informed decision-making.
○ Big data analysis is a special kind of data analysis with more massive volumes of
data
○ Data analysis has multiple facets and approaches, encompassing diverse
techniques under a variety of names, such as statistics, data mining, business
intelligence and data analytics.
● Related areas include data mining, business intelligence, and machine
learning
● Data mining and data analytics focuses on knowledge discovery for
predictive and descriptive purposes,
Four types of data analytics
Data Curation
➢ It is an active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage
➢ Data curation processes can be categorized into different
activities such as
➢ content creation, selection, classification, transformation, validation, and
preservation.
➢ Data curation is performed by expert curators that are responsible for
improving the accessibility and quality of data.
➢ Data curators (also known as scientific curators, or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable, accessible,
reusable, and fit for purpose
➢ A key trend for the curation of big data utilizes community and crowd
sourcing approaches.
Data Storage
➢ It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data
➢ Relational Database Management Systems (RDBMS) have been the main,
and almost unique, solution to the storage paradigm for nearly 40 years.
➢ Relational database that guarantee database transactions, lack flexibility
with regard to schema changes, performance and fault tolerance when
data volumes and complexity grow, making them unsuitable for big data
scenarios.
➢ ACID properties (Atomicity, Consistency, Isolation, and Durability)
➢ NoSQL technologies have been designed with the scalability goal in mind
and present a wide range of solutions based on alternative data models
Data Usage
➢ It covers the data-driven business activities that need access to the
curated data, its analysis, and the tools needed to integrate the data
analysis within the business activity
➢ In business decision-making , it can enhance competitiveness through
reduction of costs, increased added value, or any other parameter that
can be measured against existing performance criteria
What Is Big Data?
➢ Big data refers to the large, diverse sets of information that grow at ever-
increasing rates.
➢ The term big data is used for massive scale data that is difficult to store,
manage and process using traditional databases and data processing
architectures.
➢ It is so difficult to process using on-hand database management tools require special
tool.
➢ Big data can be structured (often numeric, easily formatted and stored) or
unstructured (more free-form, less quantifiable).
➢ Nearly every department in a company can utilize findings from big data
analysis, but handling its clutter and noise can pose problems.
Cont.…..
● Nowadays Systems/services generate huge amount of data from TBs to
PB/ZBs of information
● Examples:
○ Google (processes 20 PB a day), Facebook (15 TB/day), eBay (50
TB/day), Walmart, Twitter (500M tweets/day), traffic surveillance
cameras, detecting fraud, identity theft...
Big data cont.…….
• Some examples of big data are listed as follows:
o Data generated by social networks including text, images, audio and video
data
o Click-stream data generated by web applications such as e-Commerce to
analyze user behavior
o Machine sensor data collected from sensors embedded in industrial and
energy systems
o Healthcare data collected in electronic health record (EHR) systems for
monitoring their health and detecting failures
o Logs generated by web applications
o Stock markets data
o Transactional data generated by banking and financial applications
The 4 V’s Characterizing Big Data.
Volume: Massive scale of data
○ Large amounts of data in yottabytes or Zetabytes/Massive datasets
Velocity: How fast the data is generated
○ Data generated by certain sources can arrive at very high velocities, for example,
social media data or sensor data.
Variety: Different forms of the data
○ Data comes in many different forms from/ diverse sources and formats
Veracity: how accurate is the data.
○ Can we trust the data? How accurate is it? Doubt in data etc
The 4 V’s cont….
Clustered Computing and Hadoop Ecosystem
Clustered Computing
● Because of the qualities of big data, individual computers are often
inadequate for handling the data at most stages.
● To better address the high storage and computational needs of
big data, computer clusters are a better fit.
● Big data clustering software provide a number of benefits:
○ Resource Pooling
○ High Availability
○ Easy Scalability
● Using clusters requires a solution: for managing cluster membership,
coordinating resource sharing, and scheduling actual work on individual
nodes
● Cluster membership and resource allocation can be handled by
software like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
Hadoop and its Ecosystem
● Hadoop is an open-source framework intended to make interaction with
big data easier.
○ It is a framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models.
● The four key characteristics of Hadoop are:
○ Economical: Its systems are highly economical as ordinary computers can be
used for data processing.
○ Reliable: It is reliable as it stores copies of the data on different machines and is
resistant to hardware failure.
○ Scalable: It is easily scalable both, horizontally and vertically.
■ A few extra nodes help in scaling up the framework.
○ Flexible: It is flexible and you can store as much structured and unstructured
data as you need to and decide to use them later.
Cont.…..
● Hadoop ecosystems has four core components: data management, data
access, data processing, and data storage.
● It is continuously growing to meet the needs of Big Data.
● It comprises the following components and many others:
○ HDFS: Hadoop Distributed File System
○ YARN: Yet Another Resource Negotiator
○ MapReduce: Programming based Data Processing
○ Spark: In-Memory data processing
○ PIG, HIVE: Query-based processing of data services
○ HBase: NoSQL Database
○ Mahout, Spark MLLib: Machine Learning algorithm libraries
○ Solar, Lucene: Searching and Indexing
○ Zookeeper: Managing cluster
○ Oozie: Job Scheduling
Cont….
Big Data Life Cycle with Hadoop
❑ Ingesting data into the system: The first stage of Big Data processing is
Ingest.
❑ The data is ingested or transferred to Hadoop from various sources such as
relational databases, systems, or local files.
❑ Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event
data.
❑ Processing the data in storage: The second stage is Processing.
● In this stage, the data is stored and processed.
❑ The data is stored in the distributed file system, HDFS, and the NoSQL
distributed data, HBase. Spark and MapReduce perform data processing.
Cont.…..
❑ Computing and analyzing data : The third stage is to Analyze.
○ Here, the data is analyzed by processing frameworks such as Pig, Hive, and
Impala.
○ Pig converts the data using a map and reduce and then analyzes it.
○ Hive is also based on the map and reduce programming and is most suitable for
structured data.
❑ Visualizing the results: The fourth stage is Access, which is performed by
tools such as Hue and Cloudera Search.
○ In this stage, the analyzed data can be accessed and communicated by users.
THANK YOU!