1
Course Name:
    Introduction to
    Emerging
    Technologies Course
    Module
    (EMTE1011/1012)
    Prepared by Fisseha W.(M-TECH)
     2
             CHAPTER 2: Data Science
Topics Covered
1.       An Overview of Data Science
2.       Data and information
3.       Data types and representation
4.       Data Processing Cycle
5.       Data Value Chain (Acquisition, Analysis ,Curating, Storage, Usage)
6.       Basic concepts of Big data
    2.1) An Overview of Data Science
     To recall in the first chapter
    ➢ Can you describe the role of data in emerging
    technology?
    ➢ What are data and information?
    ➢ What is big data?
3
    Cont’d
    What is data science?
     Data science is a multi-disciplinary field that uses
     scientific methods, processes, algorithms, and
     systems to extract knowledge and insights from
     structured, semi-structured and unstructured data.
     It offers a range of roles and requires a range of
4
     skills.
     Is much more than simply analyzing data.
     Cont’d
    Today, successful data professionals understand
    that they must advance past the traditional skills of:
      Analysing large amounts of data
      Data mining, and
      Programming skills.
5
    Cont’d
       In order to uncover useful intelligence for their
      organizations,
    a) Data scientists must master the full spectrum of the
         data science life cycle and,
    b) Data scientists must possess a level of flexibility
         and understanding to maximize returns at each
6
         phase of the process.
        Cont’d
Data scientists:
 They needs to be curious and result-oriented.
 with exceptional industry-specific knowledge and communication
   skills that allow them to explain highly technical results to their
   non-technical
    7            counterparts.
 Data scientists are analytical experts who utilize their skills in both
   technology and social science to find trends and manage data.
    Cont’d
    They possess a strong quantitative background in:
     Statistics and linear algebra,
     Programming knowledge with focuses on data
     warehousing, mining, and modeling to build
     and analyze algorithms.
8
    Cont’d
    Data Science vs Data Mining
      Data science is often confused with data mining.
      However, data mining is a subset of data science.
      It involves analyzing large amounts of data (such
     as big data) in order to discover patterns and other
     useful information.
     Data science covers the entire scope of data
9
       collection and processing.
      What are data and information?
     i) Data:
        Data can be defined as a representation of facts, concepts,
        or instructions in a formalized manner.
        Which should be suitable for communication,
        interpretation, or processing, by human or electronic
        machines.
        It can be described as unprocessed facts and figures.
10      It is represented with the help of characters such as
        Alphabets (A-Z, a-z),
        Digits (0-9) or
        Special characters (+, -, /, *, <,>, =, etc.).
     Cont’d
     ii) Information:
       It is the processed data on which decisions and
       actions are based.
       It is data that has been processed into a form that is
       meaningful to the recipient.
       Information is real or perceived value in the current
       or the prospective action or decision of recipient.
      Furtherer more,
11      Information is interpreted data
        Created from organized, structured, and processed data in
         a particular context.
12
     FUNDAMENTALS OF DATABASE SYSTEM
     Data Processing Cycle
        Data processing is the re-structuring or re-ordering of
         data by people or machines to increase their usefulness
         and add values for a particular purpose.
        Data processing consists of the following basic steps
                   Input
                   Processing and
                   Output.
13      The set of operations used to transform data into useful
         information.
14
     FUNDAMENTALS OF DATABASE SYSTEM
    Cont’d
1) Input
 The input data is prepared in some convenient form for processing.
 The form will depend on the processing machine.
                                Proces
              Input                                Output
                                   s
15                 Fig. Data Processing Cycle
 For example, when electronic computers are used, the input data can be
  recorded on any one of the several types of storage medium, such as hard
  disk, CD, flash disk and so on.
        Cont’d
     2) Processing
      The input data is changed to produce data in a more useful form.
      For example, interest can be calculated on deposit to a bank, or a
       summary of sales for the month can be calculated from the sales orders .
     3) Output
      The result of the proceeding processing step is collected.
      The particular form of the output data depends on the use of the data.
16
      For example, output data may be payroll for employees.
     Examples
17
     FUNDAMENTALS OF DATABASE SYSTEM
     1.2) Data types and their representation
      Data types can be described from diverse
      perspectives.
18
        A data type is simply an attribute of data
        that tells the compiler or interpreter how
        the programmer intends to use the data.
Data types from Computer programming
perspective
           Almost all programming languages explicitly include the
           notion of data type, though different languages may use
           different terminology.
        Common data types include:
       Integers(int)- is used to store whole numbers, mathematically
        known as integers
       Booleans(bool)- is used to represent restricted to one of two values:
        true or false
 19    Characters(char)- is used to store a single character
       Floating-point numbers(float)- is used to store real numbers
       Alphanumeric strings(string)- used to store a combination of
        characters and numbers
     Data types from Data Analytics perspective
      From a data analytics point of view, it is important to understand
      that there are three common types of data types:
       Structured, Semi-structured, and Unstructured
        data types.
                        Unstructured           Semi-structured
20
             Fig. Data types from a data analytics perspective
      Cont’d
Structured Data
 Structured data is data that adheres to a pre-defined data model and is
  therefore straightforward to analyse.
 Conforms to a tabular format with a relationship between the different
  rows and columns.
      Examples: Excel files or SQL databases
               o Each of these has structured rows and columns that can be
21               sorted
         Cont’d
Semi-Structured Data
 A form of structured data that does not conform with the formal structure
  of data models associated with relational databases or other forms of data
  tables.
 Contains tags or other markers to separate semantic elements and enforce
  hierarchies of records and fields within the data.
 Therefore, it is also known as a self-describing structure.
 22    Examples:
         o JavaScript Object Notation(JSON)and Extendible Markup Language(XML)
      Cont’d
     Unstructured Data
      Information that either does not have a predefined data
       model or is not organized in a pre-defined manner.
      Unstructured information is typically text-heavy but may
       contain data such as dates, numbers, and facts as well.
23
        Cont’d
      This results in irregularities and ambiguities
      That make it difficult to understand using traditional programs as
       compared to data stored in structured databases.
         Examples:
            Audio,
            Video files or
            No-SQL databases.
24
     Metadata
     Metadata is data about data.
     It provides additional information about a specific set of data.
     In a set of photographs, for example, metadata could describe
     when and where the photos were taken.
     Advantages: For analyzing Big Data & its solution.
25
  2.3) Data value Chain
26    Introduced to describe the information flow within a big data system as
       a series of steps needed to generate value and useful insights from data.
   The Big Data Value Chain identifies the following key high-level
    activities:
       Data Acquisition
         Data Analysis
         Data Storage
         Data Usage
                                                            -
       Cont’d
27
     Data Acquisition: Is the process of gathering, filtering, and
     cleaning data before it is put in a data warehouse or any other
     storage solution on which data analysis can be carried out.
     Data acquisition is one of the major big data challenges in terms
     of infrastructure requirements.
     Data acquisition can be able to handle:
       Very high transaction volumes
       Often in a distributed environment
       Support flexible and dynamic data structures
     Cont’d
28
 Data Analysis: Involves exploring, transforming, and modeling
 data with the goal of highlighting relevant data, synthesizing and
 extracting useful hidden information with high potential from a
 business point of view.
  Related areas include:
     Data mining
     Business intelligence, and
     Machine learning.
      Cont’d
 29
Data Curation: It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for its effective usage.
Data curation processes can be categorized into different activities such as
      Content creation
      Selection
      Classification
      Transformation
      Validation
      Preservation/protection
      Cont’d
 30
 Data curation is performed by expert curators that are
  responsible for improving the accessibility and quality of
  data.
 Data curators (also known as scientific curators or data
  annotators) hold the responsibility of ensuring that data are:
     Trustworthy
      Discoverable
      Accessible
      Reusable and fit their purpose
       Cont’d
  31
Data Storage: Is the persistence and management of data in a
scalable way that satisfies the needs of applications that require
fast access to the data.
The good examples for data storage is:
Relational Database Management Systems (RDBMS) which is
the main, and almost unique, a solution to the storage paradigm
for nearly 40 years.
     Cont’d
32
      Data Usage :It covers the data-driven business
       activities that need access to data, its analysis, and
       the tools needed to integrate the data analysis within
       the business activity.
      Data usage in business decision-making can enhance
       competitiveness through the reduction of costs,
       increased added value, or any other parameter that
       can be measured against existing performance
       criteria.
     2.4. Basic concepts of big
33
     data
     What is Big Data?
      Big data is a blanket term for the non-traditional
       strategies and technologies needed to gather,
       organize, process, and gather insights from large
       datasets.
      Big data is the term for a collection of data sets so
       large and complex that it becomes difficult to process
       using on-hand database management tools or
       traditional data processing applications.
              What by means large dataset/data sets so large?
Cont’d
34
      A “large dataset” means a dataset too large to reasonably process
       or store with traditional tooling or on a single computer.
      This means that the common scale of big datasets is constantly
       shifting and may vary significantly from organization to
       organization.
             Cont’d
    35
    Big data is characterized by 3V and more:
 Volume:     large amounts of data Zeta bytes/Massive
  datasets
 Velocity: Data is live streaming or in motion
 Variety:    data comes in many different forms from
  diverse sources
 Veracity:    can we trust the data? How accurate is it?
  etc.                                                      Figure X-stics of Big
                                                            data
        Cont’d
     Problems with Big Data
36
          What do you think as a solution for such
          problem?
     2.5.2. Clustered Computing and Hadoop Ecosystem
37
1. Clustered Computing: To better address the high
   storage and computational needs of big data, computer clusters
   are a better fit.
 Big data clustering software combines the resources of many
 smaller machines, seeking to provide a number of benefits:
        Cont’d
38
     Resource Pooling: Combining the available storage space to hold data
     is a clear benefit, but CPU and memory pooling are also extremely
     important.
     Processing large datasets requires large amounts of all three of these
     resources.
     Cont’d
39
 High Availability: Clusters can provide varying levels of fault tolerance
 and availability guarantees to prevent hardware or software failures from
 affecting access to data and processing.
 This becomes increasingly important as we continue to emphasize the
 importance of real-time analytics.
     Cont’d
40
     Easy Scalability: Clusters make it easy to scale horizontally by
     adding additional machines to the group. This means the system can
     react to changes in resource requirements without expanding the
     physical resources on a machine.
     Cont’d
41
      Using clusters requires a solution for managing
          Cluster membership,
           Coordinating resource sharing, and
           Scheduling actual work on individual nodes.
        Cluster membership and resource allocation can be handled by
        software like Hadoop’s YARN (which stands for Yet Another
        Resource Negotiator).
           2.Hadoop and its Ecosystem
  42
 Hadoop is an open-source framework intended to make interaction with big data
  easier.
 It is a framework that allows for the distributed processing of large datasets across
  clusters of computers using simple programming models.
The four key characteristics of Hadoop are:
   Economical: Its systems are highly economical as ordinary computers can
      be used for data processing.
   Reliable: It is reliable as it stores copies of the data on different machines and
      is resistant to hardware failure.
   Scalable: It is easily scalable both, horizontally and vertically. A few extra
      nodes help in scaling up the framework.
   Flexible: It is flexible and you can store as much structured and unstructured
      data as you need to and decide to use them later.
      The 4 core components of Hadoop and
 43   its Ecosystem
The 4 core
components of
Hadoop includes
✓ Data Management,
✓ Data Access,
✓ Data Processing
✓ Data Storage.
          Cont…
44   Hadoop ecosystem includes of the following components
      HDFS: Hadoop Distributed File System
      YARN: Yet Another Resource Negotiator
      MapReduce: Programming based Data Processing
      Spark: In-Memory data processing
      PIG, HIVE: Query-based processing of data services
      •HBase: NoSQL Database
      Mahout, Spark MLLib: Machine Learning algorithm libraries
      Solar, Lucene: Searching and Indexing
      Zookeeper: Managing cluster
           FUNDAMENTALS OF DATABASE SYSTEM
      Oozie: Job Scheduling
      Cont’d
45
     How HDFS work?
     Cont’d
46
     Cont’d
47
              NameNode:
     Cont’d
48
      YARN:
            3. Big Data Life Cycle with Hadoop (stages)
    49
   Stage 1- Ingesting data into the system
 The data is ingested or transferred to Hadoop from various sources such as relational
 databases, systems, or local files.
 Stage 2- Processing the data in storage (stored and processed )
 The data is stored in the distributed file system, HDFS, and the NoSQL distributed
 data, HBase. Spark and MapReduce perform data processing.
 Stage 3- Computing and analyzing data
 by processing frameworks such as Pig, Hive, and Impala.
 Stage 4- Visualizing the results (Access)
     Cont…
50
 A. Ingesting data into the system
 The first stage of Big Data processing is Ingest.
 The data is ingested or transferred to Hadoop from various sources such as
 relational databases, systems, or local files.
 Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers
 event data
            FUNDAMENTALS OF DATABASE SYSTEM
        Cont’d
51
     B. Processing the data in storage
     In this stage, the data is stored and processed. The data is
     stored in the distributed file system, HDFS, and the
     NoSQL distributed data, HBase. Spark and MapReduce
     perform data processing.
      Cont’d
52
     C. Computing and analyzing data
     Here, the data is analyzed by processing frameworks such as Pig,
     Hive and Impala.
     Pig converts the data using a map and reduce and then analyzes it.
     Hive is also based on the map and reduce programming and is
     most suitable for structured data.
      Cont’d
53
     D. Visualizing the results
      The fourth stage is access, which is performed by tools such
     as Hue and Cloudera Search. In this stage, the analyzed data
     can be accessed by users.
     Chapter Two Review Questions
54
     1. Define data science; what are the roles of a data scientist?
     2. Discuss data and its types from computer programming
     and data analytics perspectives?
     3. Discuss a series of steps needed to generate value and
     useful insights from data?
     4. What is the principal goal of data science?
     5. List out and discuss the characteristics of Big Data?
     6. How we ingest streaming data into Hadoop Cluster?
55
     End of Chapter 2
      FUNDAMENTALS OF DATABASE SYSTEM