KEMBAR78
Lecture 11 Unstructured Data and the Data Warehouse | PPT
Building Data WareHouse by
 Inmon

Chapter 11: Unstructured Data and the Data Warehouse

http://it-slideshares.blogspot.com/
Contents
Overview
Integrating the Two Worlds
A Themed Match
A Two-Tiered Data Warehouse
A Self-Organizing Map (SOM)
Fitting the Two Environments Together
Summary
Overview
Unstructured   data
 ◦ Casual, informal activities such as those found
   on the personal computer and the Internet
 ◦ Ex: Emails, Spreadsheets, Text files,
   Documents, Portable Document Format
   (.PDF) files, Microsoft PowerPoint (.PPT) files
Structured   data
 ◦ Standard DBMSs, reports, indexes, databases,
   fields, records, and the like
Overview (cont’)
The  primary differences between
 structured data and unstructured data
Integrating the Two Worlds
Text   — The Common Link

                 Plenty of problems arise:
                 • Misspelling
                 • Context
                 • Same name
                 • Nicknames
                 • Diminutives
                 • Incomplete names
                 • Word stems
Integrating the Two Worlds (con’t)
A   Fundamental Mismatch
 ◦ The unstructured environment represents
   documents and communications.
 ◦ The structured environment represents
   transactions.
Matching   Text across the Environments
 ◦ Remove extraneous stop words
 ◦ Reduction of words back to their stem
Integrating the Two Worlds (con’t)
A   Probabilistic Match
Integrating the Two Worlds (con’t)
Matching   All the Information
A Themed Match
Industrially   Recognized Themes
 ◦ The unstructured data is analyzed according
   to the existence of words that relate to
   industrialized themes.
A Themed Match
Naturally   Occurring Themes
                    •   fire—296 occurrences
                    •   fireman—285 occurrences
                    •   hose—277 occurrences
                    •   firetruck—201 occurrences
                    •   alarm—199 occurrences
                    •   smoke—175 occurrences
                    •   heat—128 occurrences


                    •   fire—296 occurrences
                    •   Rock Springs, WY—2
                    •   alabaster—1
                    •   angel—2
                    •   Rio Grande river – 1
                    •   beaver dam—1
A Themed Match
Linkage   through Themes and Themed
 Words
A Themed Match
Linkagethrough Abstraction and
 Metadata
 ◦ Is another way to link the two environments.
A Two-Tiered Data Warehouse
Two-Tiered    Data Warehouse
 ◦ One tier of the data warehouse is for
   unstructured data and another tier of the data
   warehouse is for structured data.
A Two-Tiered Data Warehouse
Dividing
        the Unstructured Data
 Warehouse
 ◦ Unstructured communications
 ◦ Documents and libraries
A Two-Tiered Data Warehouse
Documents      in the Unstructured Data
 Warehouse
 Factors determine whether or not the actual
  document is stored in the data warehouse:
   How many documents are there?
   What is the size of the documents?
   How critical is the information in the document?
   Can the document be easily reached if it is not
    stored in the warehouse?
   Can subsections of the document be captured?
A Two-Tiered Data Warehouse
Visualizing   Unstructured Data
 ◦ Unstructured visualization is the counterpart
   to structured visualization.
 ◦ Structured visualization is known as Business
   Intelligence
 ◦ The essence of structured visualization is the
   display of numbers
A Two-Tiered Data Warehouse
A   Self-Organizing Map (SOM)
 ◦ Produces a display that appears to be a
   topographical map
 ◦ Shows how different words and the
   documents are clustered, and displayed
   according to themes
A Themed Match

The   Unstructured Data Warehouse
 ◦ Is divided into two basic organizations—one part
   for documents and another part for
   communications
A Themed Match

Volumesof Data and the Unstructured Data
 Warehouse
 ◦ Volumes of data are an issue
 ◦ Mitigate the volumes of data that can collect in the
   unstructured data warehouse
Fitting the Two Environments
Together the unstructured environment contains
      Maybe
       data that is incompatible with data from the
       structured environment
      However there are ways that the two
       environments can be related
Fitting the Two Environments
Together
http://it-slideshares.blogspot.com/
Summary
World   of information technology is really
 divided into two worlds—structured data and
 unstructured data
The common bond between the two worlds is
 text.
The structured environment and the
 unstructured environment can be matched at:
 ◦ the identifier level
 ◦ the close identifier level using a probabilistic
   match
 ◦ the keyword to metadata or repository level

Lecture 11 Unstructured Data and the Data Warehouse

  • 1.
    Building Data WareHouseby Inmon Chapter 11: Unstructured Data and the Data Warehouse http://it-slideshares.blogspot.com/
  • 2.
    Contents Overview Integrating the TwoWorlds A Themed Match A Two-Tiered Data Warehouse A Self-Organizing Map (SOM) Fitting the Two Environments Together Summary
  • 3.
    Overview Unstructured data ◦ Casual, informal activities such as those found on the personal computer and the Internet ◦ Ex: Emails, Spreadsheets, Text files, Documents, Portable Document Format (.PDF) files, Microsoft PowerPoint (.PPT) files Structured data ◦ Standard DBMSs, reports, indexes, databases, fields, records, and the like
  • 4.
    Overview (cont’) The primary differences between structured data and unstructured data
  • 5.
    Integrating the TwoWorlds Text — The Common Link Plenty of problems arise: • Misspelling • Context • Same name • Nicknames • Diminutives • Incomplete names • Word stems
  • 6.
    Integrating the TwoWorlds (con’t) A Fundamental Mismatch ◦ The unstructured environment represents documents and communications. ◦ The structured environment represents transactions. Matching Text across the Environments ◦ Remove extraneous stop words ◦ Reduction of words back to their stem
  • 7.
    Integrating the TwoWorlds (con’t) A Probabilistic Match
  • 8.
    Integrating the TwoWorlds (con’t) Matching All the Information
  • 9.
    A Themed Match Industrially Recognized Themes ◦ The unstructured data is analyzed according to the existence of words that relate to industrialized themes.
  • 10.
    A Themed Match Naturally Occurring Themes • fire—296 occurrences • fireman—285 occurrences • hose—277 occurrences • firetruck—201 occurrences • alarm—199 occurrences • smoke—175 occurrences • heat—128 occurrences • fire—296 occurrences • Rock Springs, WY—2 • alabaster—1 • angel—2 • Rio Grande river – 1 • beaver dam—1
  • 11.
    A Themed Match Linkage through Themes and Themed Words
  • 12.
    A Themed Match LinkagethroughAbstraction and Metadata ◦ Is another way to link the two environments.
  • 13.
    A Two-Tiered DataWarehouse Two-Tiered Data Warehouse ◦ One tier of the data warehouse is for unstructured data and another tier of the data warehouse is for structured data.
  • 14.
    A Two-Tiered DataWarehouse Dividing the Unstructured Data Warehouse ◦ Unstructured communications ◦ Documents and libraries
  • 15.
    A Two-Tiered DataWarehouse Documents in the Unstructured Data Warehouse Factors determine whether or not the actual document is stored in the data warehouse:  How many documents are there?  What is the size of the documents?  How critical is the information in the document?  Can the document be easily reached if it is not stored in the warehouse?  Can subsections of the document be captured?
  • 16.
    A Two-Tiered DataWarehouse Visualizing Unstructured Data ◦ Unstructured visualization is the counterpart to structured visualization. ◦ Structured visualization is known as Business Intelligence ◦ The essence of structured visualization is the display of numbers
  • 17.
    A Two-Tiered DataWarehouse A Self-Organizing Map (SOM) ◦ Produces a display that appears to be a topographical map ◦ Shows how different words and the documents are clustered, and displayed according to themes
  • 18.
    A Themed Match The Unstructured Data Warehouse ◦ Is divided into two basic organizations—one part for documents and another part for communications
  • 19.
    A Themed Match VolumesofData and the Unstructured Data Warehouse ◦ Volumes of data are an issue ◦ Mitigate the volumes of data that can collect in the unstructured data warehouse
  • 20.
    Fitting the TwoEnvironments Together the unstructured environment contains Maybe data that is incompatible with data from the structured environment However there are ways that the two environments can be related
  • 21.
    Fitting the TwoEnvironments Together
  • 22.
    http://it-slideshares.blogspot.com/ Summary World of information technology is really divided into two worlds—structured data and unstructured data The common bond between the two worlds is text. The structured environment and the unstructured environment can be matched at: ◦ the identifier level ◦ the close identifier level using a probabilistic match ◦ the keyword to metadata or repository level

Editor's Notes

  • #6 Matching different formats of electricity—alternating current (AC) and direct current (DC). The unstructured world operates on AC and the structured world operates on DC. Problem in integrating by text: Misspelling—What if two words are found in the two environments— Chernobyl and Chernobile? Should there be a match made between these two worlds? Do they refer to the same thing or something different? Context—The term “bill” is found in the two worlds. Should they be matched? In one case, the reference is to a bird’s beak and in the other case, the reference is to how much money a person is owed. Same name —The same name, “Bob Smith,” appears in both worlds. Are they the same thing? Do they refer to the same person? Or, do they refer to entirely different people who happen to have matching names? Nicknames—In one world, there appears the name “Bill Inmon.” In another world there appears the name “William Inmon.” Should a match be made? Do they refer to the same person? Diminutives —Is 1245 Sharps Ct the same as 1245 Sharps Court? Is NY, NY, the same as New York, New York? Incomplete names —Is Mrs. Inmon the same as Lynn Inmon? Word stems —Should the word “moving” be connected and matched with the word “moved”?
  • #7 A stop word is a word that occurs so frequently as to be meaningless to the document. Typical stop words include the following: a, an, the, for, to, by from, when, which… The second basic edit that must be done is the reduction of words back to their stem. For example, the following words all have the same grammatical Stem: moving, moved, moves, mover, removing  “move”
  • #8 In a probabilistic match, as much data that might be used to indicate the “Bob Smith” that you’re looking for is gathered and is used as a basis for a match against similar data found where other “Bob Smiths” are located. Then, all the data that intersects is used to determine if a match on the name is valid.
  • #9 In a probabilistic match, as much data that might be used to indicate the “Bob Smith” that you’re looking for is gathered and is used as a basis for a match against similar data found where other “Bob Smiths” are located. Then, all the data that intersects is used to determine if a match on the name is valid.
  • #10 The accounting theme would contain words and phrases such as the following: receivable, payable, cash on hand, asset, debit, due date, account… The finance theme would contain such information as the following: price, margin, discount, gross sale, net sale, interest rate, carrying loan, balance due There can be many industrially recognized themes for word collections. Some of the word themes might be the following: sales, marketing, finance, human resources, engineering, accounting, distribution…
  • #11 In an organization by “natural” themes, the unstructured data is collected on a document-by-document basis. Once the data is collected, the words and phrases are ranked by number of occurrences. Then, a theme to the document is formed by ranking the words and phrases inside the document based on the number of occurrences.
  • #12 Raw match of data: if a word is found anywhere in the structured environment and the word is part of the theme of a document, the unstructured document is linked to the structured record. But such a matching is not very meaningful and may actually be misleading.
  • #13 In Figure 11-11, data in the unstructured environment includes such people as Bill Jones, Mary Adams, Wayne Folmer, and Susan Young. All of these people exist in records of data that have a data element called “Name.” Put another way, data exists at two levels in the structured environment—the abstract level and the actual occurrence level. Figure 11-12 shows this relationship of data. In Figure 11-12, data exists at an abstract level—the metadata level. In addition, data exists at the occurrence level—where the actual occurrences of data reside.
  • #14 The data found in the unstructured data warehouse is in many ways similar to the data found in the structured data warehouse. Consider the following when looking at data in the unstructured environment: It exists at a low level of granularity. It has an element of time attached to the data. It is typically organized by subject area or “theme.”
  • #19 The data that can be stored in each section includes the following: ■■ The first n bytes of the document ■■ The document itself (optional) ■■ The communication itself (optional) ■■ Context information ■■ Keyword information
  • #21 An identifier is an occurrence of data that serves to specifically identify a record. Close identifiers are i dentifiers where there is a good probability that a solid identification has been made.