KEMBAR78
Data Analysis | PDF | Data Analysis | Data Quality
0% found this document useful (0 votes)
114 views18 pages

Data Analysis

Data analysis involves inspecting, cleaning, transforming, and modeling data to highlight useful information and support decision making. It has multiple approaches across business, science, and social science. The process of data analysis involves data cleaning, initial analysis of data quality and measurements, optional initial data transformations, and checking that the study was implemented as intended. This is followed by the main data analysis phase to answer the research questions.

Uploaded by

Shouib Mehreyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views18 pages

Data Analysis

Data analysis involves inspecting, cleaning, transforming, and modeling data to highlight useful information and support decision making. It has multiple approaches across business, science, and social science. The process of data analysis involves data cleaning, initial analysis of data quality and measurements, optional initial data transformations, and checking that the study was implemented as intended. This is followed by the main data analysis phase to answer the research questions.

Uploaded by

Shouib Mehreyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Data analysis

From Wikipedia, the free encyclopedia


This article needs additional citations for verification. Please help improve this article by adding citations
to reliable sources. Unsourced material may be challenged and removed. (December 2008)

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting
useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches,
encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than
purely descriptive purposes. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business
information. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis(EDA),
and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data and CDA on confirming or falsifying
existing hypotheses. Predictive analytics focuses on application of statistical or structural models for predictive forecasting or
classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual
sources, a species of unstructured data. All are varieties of data analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The
term data analysis is sometimes used as a synonym for data modeling.

Contents
[hide]

1 Type of data

2 The process of data analysis

3 Data cleaning

4 Initial data analysis

o 4.1 Quality of data

o 4.2 Quality of measurements

o 4.3 Initial transformations

o 4.4 Did the implementation of the study fulfill the intentions of the research design?

o 4.5 Characteristics of data sample

o 4.6 Final stage of the initial data analysis

o 4.7 Analyses

5 Main data analysis

o 5.1 Exploratory and confirmatory approaches

o 5.2 Stability of results


o 5.3 Statistical methods

6 Free software for data analysis

7 Nuclear and particle physics

8 See also

9 References

10 Further reading

[edit]Type of data

Data can be of several types

 Quantitative data data is a number

 Often this is a continuous decimal number to a specified number of significant digits

 Sometimes it is a whole counting number

 Categorical data data one of several categories

 Qualitative data data is a pass/fail or the presence or lack of a characteristic


[edit]The process of data analysis

Data analysis is a process, within which several phases can be distinguished:[1]

[edit]Data cleaning

Data cleaning is an important procedure during which the data are inspected, and erroneous data are—if necessary, preferable, and
possible—corrected. Data cleaning can be done during the stage of data entry. If this is done, it is important that no subjective
decisions are made. The guiding principle provided by Adèr (ref) is: during subsequent manipulations of the data, information should
always be cumulatively retrievable. In other words, it should always be possible to undo any data set alterations. Therefore, it is
important not to throw information away at any stage in the data cleaning phase. All information should be saved (i.e., when altering
variables, both the original values and the new values should be kept, either in a duplicate data set or under a different variable
name), and all alterations to the data set should carefully and clearly documented, for instance in a syntax or a log. [2]

[edit]Initial data analysis

The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis
one refrains from any analysis that are aimed at answering the original research question. The initial data analysis phase is guided by
the following four questions:[3]

[edit]Quality of data
The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of
analyses: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency
histograms, normal probability plots), associations (correlations, scatter plots).
Other initial data quality checks are:

 Checks on data cleaning: have decisions influenced the distribution of the variables? The distribution of the variables before data
cleaning is compared to the distribution of the variables after data cleaning to see whether data cleaning has had unwanted
effects on the data.

 Analysis of missing observations: are there many missing values, and are the values missing at random? The missing
observations in the data are analyzed to see whether more than 25% of the values are missing, whether they are missing at
random (MAR), and whether some form of imputation is needed.

 Analysis of extreme observations: outlying observations in the data are analyzed to see if they seem to disturb the distribution.

 Comparison and correction of differences in coding schemes: variables are compared with coding schemes of variables external
to the data set, and possibly corrected if coding schemes are not comparable.

 Test for common-method variance.

The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be
conducted in the main analysis phase.[4]

[edit]Quality of measurements
The quality of the measurement instruments should only be checked during the initial data analysis phase when this is not the focus
or research question of the study. One should check whether structure of measurement instruments corresponds to structure reported
in the literature.
There are two ways to assess measurement quality:

 Confirmatory factor analysis

 Analysis of homogeneity (internal consistency), which gives an indication of the reliability of a measurement instrument. During
this analysis, one inspects the variances of the items and the scales, the Cronbach's α of the scales, and the change in the
Cronbach's alpha when an item would be deleted from a scale.[5]
[edit]Initial transformations
After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to perform initial
transformations of one or more variables, although this can also be done during the main analysis phase. [6]
Possible transformations of variables are:[7]

 Square root transformation (if the distribution differs moderately from normal)

 Log-transformation (if the distribution differs substantially from normal)

 Inverse transformation (if the distribution differs severely from normal)

 Make categorical (ordinal / dichotomous) (if the distribution differs severely from normal, and no transformations help)
[edit]Did the implementation of the study fulfill the intentions of the research design?
One should check the success of the randomization procedure, for instance by checking whether background and substantive
variables are equally distributed within and across groups.
If the study did not need and/or use a randomization procedure, one should check the success of the non-random sampling, for
instance by checking whether all subgroups of the population of interest are represented in sample.
Other possible data distortions that should be checked are:

 dropout (this should be identified during the initial data analysis phase)

 Item nonresponse (whether this is random or not should be assessed during the initial data analysis phase)

 Treatment quality (using manipulation checks).[8]


[edit]Characteristics of data sample
In any report or article, the structure of the sample must be accurately described. It is especially important to exactly determine the
structure of the sample (and specifically the size of the subgroups) when subgroup analyses will be performed during the main
analysis phase.
The characteristics of the data sample can be assessed by looking at:

 Basic statistics of important variables

 Scatter plots

 Correlations

 Cross-tabulations[9]
[edit]Final stage of the initial data analysis
During the final stage, the findings of the initial data analysis are documented, and necessary, preferable, and possible corrective
actions are taken.
Also, the original plan for the main data analyses can and should be specified in more detail and/or rewritten.
In order to do this, several decisions about the main data analyses can and should be made:

 In the case of non-normals: should one transform variables; make variables categorical (ordinal/dichotomous); adapt the analysis
method?

 In the case of missing data: should one neglect or impute the missing data; which imputation technique should be used?

 In the case of outliers: should one use robust analysis techniques?

 In case items do not fit the scale: should one adapt the measurement instrument by omitting items, or rather ensure comparability
with other (uses of the) measurement instrument(s)?

 In the case of (too) small subgroups: should one drop the hypothesis about inter-group differences, or use small sample
techniques, like exact tests or bootstrapping?
 In case the randomization procedure seems to be defective: can and should one calculate propensity scores and include them as
covariates in the main analyses?[10]
[edit]Analyses

Several analyses can be used during the initial data analysis phase:[11]

 Univariate statistics

 Bivariate associations (correlations)

 Graphical techniques (scatter plots)

It is important to take the measurement levels of the variables into account for the analyses, as special statistical techniques are
available for each level:[12]

 Nominal and ordinal variables

 Frequency counts (numbers and percentages)

 Associations

 circumambulations (crosstabulations)

 hierarchical loglinear analysis (restricted to a maximum of 8 variables)

 loglinear analysis (to identify relevant/important variables and possible confounders)

 Exact tests or bootstrapping (in case subgroups are small)

 Computation of new variables

 Continuous variables

 Distribution

 Statistics (M, SD, variance, skewness, kurtosis)

 Stem-and-leaf displays

 Box plots
[edit]Main data analysis

In the main analysis phase analyses aimed at answering the research question are performed as well as any other relevant analysis
[13]
needed to write the first draft of the research report.

[edit]Exploratory and confirmatory approaches


In the main analysis phase either an exploratory or confirmatory approach can be adopted. Usually the approach is decided before
data is collected. In an exploratory analysis no clear hypothesis is stated before analysing the data, and the data is searched for
models that describe the data well. In a confirmatory analysis clear hypotheses about the data are tested.
Exploratory data analysis should be interpreted carefully. When testing multiple models at once there is a high chance on finding at
least one of them to be significant, but this can be due to atype 1 error. It is important to always adjust the significance level when
testing multiple models with, for example, a bonferroni correction. Also, one should not follow up an exploratory analysis with a
confirmatory analysis in the same dataset. An exploratory analysis is used to find ideas for a theory, but not to test that theory as well.
When a model is found exploratory in a dataset, then following up that analysis with a comfirmatory analysis in the same dataset
could simply mean that the results of the comfirmatory analysis are due to the same type 1 error that resulted in the exploratory model
in the first place. The comfirmatory analysis therefore will not be more informative than the original exploratory analysis. [14]

[edit]Stability of results
It is important to obtain some indication about how generalizable the results are. [15] While this is hard to check, one can look at the
stability of the results. Are the results reliable and reproducible? There are two main ways of doing this:

 Cross-validation: By splitting the data in multiple parts we can check if analyzes (like a fitted model) based on one part of the data
generalize to another part of the data as well.

 Sensitivity analysis: A procedure to study the behavior of a system or model when global parameters are (systematically) varied.
One way to do this is with bootstrapping.
[edit]Statistical methods
A lot of statistical methods have been used for statistical analyses. A very brief list of four of the more popular methods is:

 General linear model: A widely used model on which various statistical methods are based (e.g. t
test, ANOVA, ANCOVA, MANOVA). Usable for assessing the effect of several predictors on one or more continuous dependent
variables.

 Generalized linear model: An extension of the general linear model for discrete dependent variables.

 Structural equation modelling: Usable for assessing latent structures from measured manifest variables.

 Item response theory: Models for (mostly) assessing one latent variable from several binary measured variables (e.g. an exam).

 Data integrity
 From Wikipedia, the free encyclopedia
This article needs attention from an expert on the subject. Please add a reason or a talk parameter to this
template to explain the issue with the article. WikiProject Databases may be able to help recruit an
expert. (November 2010)
It has been suggested that this article or section be merged with database integrity. (Discuss) Proposed since
March 2011.

This article relies largely or entirely upon a single source. Please help improve this article by
introducing citations to additional sources. Discussion about the problems with the sole source used may be
found on the talk page. (August 2010)

 Data Integrity in its broadest meaning refers to the trustworthiness of information over its entire life cycle. In more analytic
terms, it is "the representational faithfulness of information to the true state of the object that the information represents,
where representational faithfulness is composed of four essential qualities or core attributes: completeness,
currency/timeliness, accuracy/correctness and validity/authorization."[1] The concept of business rules is already widely used
nowadays and is subdivided into six categories which include data rules. Data is further subdivided Data Integrity Rules, data
sourcing rules, data extraction rules, data transformation rules and data deployment rules.
 Data Integrity is very important in database operations in particular and Data Warehousing and Business Intelligence in
general. Because Data Integrity ensured that data is of high quality, correct, consistent and accessible, in is important to
follow rules governing Data Integrity.
 A Data Value Rule or Conditional Data Value Rule specifies data domains. The difference between the two is that the former
specifies the domain of allowable values for a data attribute which applies to all situation while the latter does not apply to all
situations but only when there exceptions or certain conditions that applies.
 Data Structure Rule defines that cardinality of data for a data relation in cases where there are no conditions of exceptions
which apply. This rule makes data structure very easy to understand. A conditional data structure rule is slightly different in
that is governs when conditions or exceptions apply on data cardinality for a data relation.
 A Data Derivation Rule specifies the how a data value is derived based on algorithm, contributors and conditions. It also
specifies the conditions on how the data value could be re-derived.
 A Data Retention rule specifies the length of time of data values which can/should be retained in a particular database. It is
specifies what can be done with data values when its validity or usefulness for a database expires. A data occurrence
retention rule specifies the length of time the data occurrence is to be retained and what can be done with data when it is no
longer considered useful. A data attribute retention rule is similar to a data retention rule but the data attribute retention rule
only applies to specific data values rather than the entire data occurrence.
 These Data Integrity Rules, like any other rules, are totally without meaning when they are not implemented and enforced.
 In order to achieve Data Integrity, these rules should be consistently and routinely applied to all data which are entering the
Data Warehouse or any Data Resource for that matter. There should be no waivers or exceptions for the enforcement of
these rules because any slight relaxation of enforcement could mean a tremendous error result.
 As much as possible, these Data Integrity Rules must be implemented in as close to the initial capture of data so that early
detection and correction of potential breach of integrity can be taken action. This can greatly prevent errors and
inconsistencies from entering the database.
 With strict implementation and enforcement of these Data Integrity Rules, data error rates could be much lower so less time
is spent on trying to troubleshoot and trace faulty computing results. This translates to savings from manpower expense.
 Since there is low error rate, there can only be high quality data that can be had to provide better support in the statistical
analysis, trend and pattern spotting, and decision making tasks of a company. In today's digital age, information one major
key to success and having the right information means having better edge over the competitors.
 Most narrowly, data with integrity has a complete or whole structure. All characteristics of the data including business rules,
rules for how pieces of data relate, dates, definitions and lineage must be correct for data to be complete.
 Per the discipline of data architecture, when functions are performed on the data the functions must ensure integrity.
Examples of functions are transforming the data, storing the history, storing the definitions (Metadata) and storing the lineage
of the data as it moves from one place to another. The most important aspect of data integrity per the data architecture
discipline is to expose the data, the functions and the data's characteristics.
 Data that has integrity is identically maintained during any operation (such as transfer, storage or retrieval). Put simply in
business terms, data integrity is the assurance that data is consistent, certified and can be reconciled.
 In terms of a database data integrity refers to the process of ensuring that a database remains an accurate reflection of the
universe of discourse it is modelling or representing. In other words there is a close correspondence between the facts stored
in the database and the real world it models.[2]
 [edit]Types of integrity constraints
 Data integrity is normally enforced in a database system by a series of integrity constraints or rules. Three types of integrity
constraints are an inherent part of the relational data model: entity integrity, referential integrity and domain integrity.
 Entity integrity concerns the concept of a primary key. Entity integrity is an integrity rule which states that every table must
have a primary key and that the column or columns chosen to be the primary key should be unique and not null.
 Referential integrity concerns the concept of a foreign key. The referential integrity rule states that any foreign key value can
only be in one of two states. The usual state of affairs is that the foreign key value refers to a primary key value of some table
in the database. Occasionally, and this will depend on the rules of the business, a foreign key value can be null. In this case
we are explicitly saying that either there is no relationship between the objects represented in the database or that this
relationship is unknown.
 Domain integrity specifies that all columns in relational database must be declared upon a defined domain. The primary unit
of data in the relational data model is the data item. Such data items are said to be non-decomposable or atomic. A domain
is a set of values of the same type. Domains are therefore pools of values from which actual values appearing in the columns
of a table are drawn.
 If a database supports these features it is the responsibility of the database to insure data integrity as well as the consistency
model for the data storage and retrieval. If a database does not support these features it is the responsibility of the
application to insure data integrity while the database supports the consistency model for the data storage and retrieval.
 Having a single, well controlled, and well defined data integrity system increases stability (one centralized system performs
all data integrity operations), performance (all data integrity operations are performed in the same tier as the consistency
model), re-usability (all applications benefit from a single centralized data integrity system), and maintainability (one
centralized system for all data integrity administration).
 Today, since all modern databases support these features (see Comparison of relational database management systems), it
has become the defacto responsibility of the database to ensure data integrity. Out-dated and legacy systems that use file
systems (text, spreadsheets, ISAM, flat files, etc.) for their consistency model lack any kind of data integrity model. This
requires companies to invest a large amount of time, money, and personnel in the creation of data integrity systems on a per
application basis that effectively just duplicate the existing data integrity systems found in modern databases. Many
companies, and indeed many database systems themselves, offer products and services to migrate out-dated and legacy
systems to modern databases to provide these data integrity features. This offers companies a substantial savings in time,
money, and resources because they do not have to develop per application data integrity systems that must be re-factored
each time business requirements change.
 [edit]Examples

 An example of a data integrity mechanism is the parent and child relationship of related records. If a parent record owns one
or more related child records all of the referential integrity processes are handled by the database itself, which automatically
insures the accuracy and integrity of the data so that no child record can exist without a parent (also called being orphaned)
and that no parent loses their child records. It also ensures that no parent record can be deleted while the parent record
owns any child records. All of this is handled at the database level and does not require coding integrity checks into each
applications.

Data quality
From Wikipedia, the free encyclopedia

Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran). Alternatively,
the data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from
these definitions, as data volume increases, the question of internal consistency within data becomes paramount, regardless of
fitness for use for any external purpose, e.g. a person's age and birth date may conflict within different parts of a database. The first
views can often be in disagreement, even about the same set of data used for the same purpose. This article discusses the concept
as it related to business data processing, although of course other data have various quality issues as well.

Contents
[hide]

1 Definitions

2 History

3 Overview

4 Criticism of existing tools and processes

5 Professional associations

6 See also

7 References

8 Further reading

[edit]Definitions

1. Data exhibited by the data in relation to the portrayal of the actual scenario.

2. The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific
use. Government of British Columbia
3. The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of
excellence for factors related to data. Glossary of Quality Assurance Terms

4. Glossary of data quality terms published by IAIDQ

5. Data quality: The processes and technologies involved in ensuring the conformance of data values to business requirements and
acceptance criteria

6. Complete, standards based, consistent, accurate and time stamped [http://www.gs1.org/gdsn/dqf GS1

[edit]History

Before the rise of the inexpensive server, massive mainframe computers were used to maintain name and address data so that the
mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and
typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married,
divorced, or experienced other life-changing events. Government agencies began to make postal data available to a few service
companies to cross-reference customer data with the National Change of Address registry (NCOA). This technology saved large
companies millions of dollars compared to manually correcting customer data. Large companies saved on postage, as bills and direct
marketing materials made their way to the intended customer more accurately. Initially sold as a service, data quality moved inside
the walls of corporations, as low-cost and powerful server technology became available.

Companies with an emphasis on marketing often focus their quality efforts on name and address information, but data quality is
recognized as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional
data, and nearly every other category of data found in the enterprise. For example, making supply chain data conform to a certain
standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) improving the
understanding of vendor purchases to negotiate volume discounts; and 3) avoiding logistics costs in stocking and shipping parts
across a large organization.

While name and address data has a clear standard as defined by local postal authorities, other types of data have few recognized
standards. There is a movement in the industry today to standardize certain non-address data. The non-profit group GS1 is among
the groups spearheading this movement.

For companies with significant research efforts, data quality can include developing protocols for research methods,
reducing measurement error, bounds checking of the data, cross tabulation, modeling and outlier detection, verifying data integrity,
etc.

[edit]Overview

There are a number of theoretical frameworks for understanding data quality. A systems-theoretical approach influenced by American
pragmatism expands the definition of data quality to include information quality, and emphasizes the inclusiveness of the fundamental
dimensions of accuracy and precision on the basis of the theory of science (Ivanov, 1972). One framework seeks to integrate the
product perspective (conformance to specifications) and the service perspective (meeting consumers' expectations) (Kahn et al.
2002). Another framework is based in semiotics to evaluate the quality of the form, meaning and use of the data (Price and Shanks,
2004). One highly theoretical approach analyzes the ontological nature of information systems to define data quality rigorously (Wand
and Wang, 1996).

A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or
dimensions) of data. These lists commonly include accuracy,correctness, currency, completeness and relevance. Nearly 200 such
terms have been identified and there is little agreement in their nature (are these concepts, goals or criteria?), their definitions or
measures (Wang et al., 1993). Software engineers may recognise this as a similar problem to "ilities".

MIT has a Total Data Quality Management program, led by Professor Richard Wang, which produces a large number of publications
and hosts a significant international conference in this field (International Conference on Information Quality, ICIQ).

In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from data
warehousing and business intelligence to customer relationship management and supply chain management. One industry study
estimated the total cost to the US economy of data quality problems at over US$600 billion per annum (Eckerson, 2002). Incorrect
data – which includes invalid and outdated information – can originate from different data sources – through data entry, or data
migration and conversion projects.[1]

In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. mail sent is incorrectly
addressed.[2]

One reason contact data becomes stale very quickly in the average database – more than 45 million Americans change their address
every year.[3]

In fact, the problem is such a concern that companies are beginning to set up a data governance team whose sole role in the
corporation is to be responsible for data quality. In some organizations, this data governance function has been established as part of
a larger Regulatory Compliance function - a recognition of the importance of Data/Information Quality to organizations.

Problems with data quality don't only arise from incorrect data. Inconsistent data is a problem as well. Eliminating data shadow
systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.

Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their
common data.[4]

The market is going some way to providing data quality assurance. A number of vendors make tools for analysing and repairing poor
quality data in situ, service providers can clean the data on a contract basis and consultants can advise on fixing processes or
systems to avoid data quality problems in the first place. Most data quality tools offer a series of tools for improving data, which may
include some or all of the following:

1. Data profiling - initially assessing the data to understand its quality challenges

2. Data standardization - a business rules engine that ensures that data conforms to quality rules

3. Geocoding - for name and address data. Corrects data to US and Worldwide postal standards
4. Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. Matching may use
"fuzzy logic" to find duplicates in the data. It often recognizes that 'Bob' and 'Robert' may be the same individual. It might be
able to manage 'householding', or finding links between husband and wife at the same address, for example. Finally, it often
can build a 'best of breed' record, taking the best components from multiple data sources and building a single super-record.

5. Monitoring - keeping track of data quality over time and reporting variations in the quality of data. Software can also auto-
correct the variations based on pre-defined business rules.

6. Batch and Real time - Once the data is initially cleansed (batch), companies often want to build the processes into enterprise
applications to keep it clean.

There are several well-known authors and self-styled experts, with Larry English perhaps the most popular guru. In addition,
the International Association for Information and Data Quality (IAIDQ)was established in 2004 to provide a focal point for
professionals and researchers in this field.

ISO 8000 is the international standard for data quality.

[edit]Criticism of existing tools and processes

The value and current approaches to Data Cleansing have come under criticism[who?] due to some parties claiming large costs and low
return on investment from major data cleansing initiatives.

The main reasons cited[citation needed] are:

 Project costs: costs typically in the hundreds of thousands of dollars

 Time: lack of enough time to deal with large-scale data-cleansing software

 Security: concerns over sharing information, giving an application access across systems, and effects on legacy systems

Data integration
From Wikipedia, the free encyclopedia

Data integration involves combining data residing in different sources and providing users with a unified view of these data.[1] This
process becomes significant in a variety of situations, which include both commercial (when two similar companies need to merge
their databases) and scientific (combining research results from different bioinformatics repositories, for example) domains. Data
integration appears with increasing frequency as the volume and the need to share existing data explodes.[2] It has become the focus
of extensive theoretical work, and numerous open problems remain unsolved. In management circles, people frequently refer to data
integration as "Enterprise Information Integration" (EII).

Contents
[hide]

1 History

2 Example
3 Theory of data integration

o 3.1 Definitions

o 3.2 Query processing

4 Data Integration in the Life Sciences

5 See also

6 References

7 Further reading

[edit]History

Figure 1: Simple schematic for a data warehouse. TheETL process extracts information from the source databases, transforms it and then loads it
into the data warehouse.

Figure 2: Simple schematic for a data-integration solution. A system designer constructs a mediated schema against which users can run queries.
The virtual database interfaces with the source databases viawrapper code if required.

Issues with combining heterogeneous data sources under a single query interface have existed for some time. The rapid adoption
ofdatabases after the 1960s naturally led to the need to share or to merge existing repositories. This merging can take place at
several levels in the database architecture. One popular solution is implemented based on data warehousing (see figure 1). The
warehouse system extracts, transforms, and loads data from heterogeneous sources into a single common queriable schema so data
becomes compatible with each other. This approach offers a tightly coupled architecture because the data is already physically
reconciled in a single repository at query-time, so it usually takes little time to resolve queries. However, problems arise with the
"freshness" of data, which means information in warehouse is not always up-to-date. Therefore, when an original data source gets
updated, the warehouse still retains outdated data and theETL process needs re-execution for synchronization. Difficulties also arise
in constructing data warehouses when one has only a query interface to summary data sources and no access to the full data. This
problem frequently emerges when integrating several commercial query services like travel or classified advertisement web
applications.

As of 2009 the trend in data integration has favored loosening the coupling between data[citation needed] and providing a unified query-
interface to access real time data over a mediated schema (see figure 2), which makes information can be retrieved directly from
original databases. This approach may need to specify mappings between the mediated schema and the schema of original sources,
and transform a query into specialized queries to match the schema of the original databases. Therefore, this middleware architecture
is also termed as "view-based query-answering" because each data source is represented as a view over the (nonexistent) mediated
schema. Formally, computer scientists term such an approach "Local As View" (LAV) — where "Local" refers to the local
sources/databases. An alternate model of integration has the mediated schema functioning as a view over the sources. This
approach, called "Global As View" (GAV) — where "Global" refers to the global (mediated) schema — has attractions owing to the
simplicity of answering queries by means of the mediated schema. However, it is necessary to reconstitute the view for the mediated
schema whenever a new source gets integrated and/or an already integrated source modifies its schema.

As of 2010 some of the work in data integration research concerns the semantic integration problem. This problem addresses not the
structuring of the architecture of the integration, but how to resolve semantic conflicts between heterogeneous data sources. For
example if two companies merge their databases, certain concepts and definitions in their respective schemas like "earnings"
inevitably have different meanings. In one database it may mean profits in dollars (a floating-point number), while in the other it might
represent the number of sales (an integer). A common strategy for the resolution of such problems involves the use
of ontologies which explicitly define schema terms and thus help to resolve semantic conflicts. This approach represents ontology-
based data integration. On the other hand, the problem of combining research results from different bioinformatics repositories
requires bench-marking of the similarities, computed from different data sources, on a single criterion such as, positive predictive
value. This enables the data sources to be directly comparable and can be integrated even when the natures of experiments are
distinct.[3]

As of 2011 it was determined that current data modeling methods were imparting data isolation into every data architecture in the
form of islands of disparate data and information silos. This data isolation is an unintended artifact of the data modeling methodology
that results in the development of disparate data models. Disparate data models, when instantiated as databases, form disparate
databases. An enhanced data model methodology was developed to eliminate the data isolation artifact and to promote the
development of integrated data models. This enhanced data modeling method recasts data models by augmenting them with
structural metadata in the form of standardized data entities. As a result of recasting multiple data models, the set of recast data
models will now share one or more commonality relationships that relate the structural metadata now common to these data models.
Commonality relationships are a peer-to-peer type of entity relationships that relate the standardized data entities of multiple data
models. Multiple data models that contain the same standard data entity may participate in the same commonality relationship. When
integrated data models are instantiated as databases and are properly populated from a common set of master data, then these
databases are integrated.

In 2012, high-throughput biological data, such as, phenotypic profiles, gene expression microarrays, protein sequences, Kyoto
Encyclopedia of Genes and Genomes (KEGG) pathway, and protein–protein interaction data are integrated through a weighted power
scoring framework, called weighted power biological score (WPBS), for predicting the function of some of the unclassified yeast
Saccharomyces cerevisiae genes.[4] The relative power and weight coefficients of different data sources are estimated systematically
by utilizing functional annotations [yeast Gene Ontology (GO)-Slim: Process] of classified genes, available from Saccharomyces
Genome Database. Genes are then clustered by applying k-medoids algorithm on WPBS, and functional categories of 334
unclassified genes are predicted using a P-value cutoff 1 × 10^{−5}.

[edit]Example

Consider a web application where a user can query a variety of information about cities (such as crime statistics, weather, hotels,
demographics, etc.). Traditionally, the information must be stored in a single database with a single schema. But any single enterprise
would find information of this breadth somewhat difficult and expensive to collect. Even if the resources exist to gather the data, it
would likely duplicate data in existing crime databases, weather websites, and census data.

A data-integration solution may address this problem by considering these external resources as materialized views over a virtual
mediated schema, resulting in "virtual data integration". This means application-developers construct a virtual schema —
the mediated schema — to best model the kinds of answers their users want. Next, they design "wrappers" or adapters for each data
source, such as the crime database and weather website. These adapters simply transform the local query results (those returned by
the respective websites or databases) into an easily processed form for the data integration solution (see figure 2). When an
application-user queries the mediated schema, the data-integration solution transforms this query into appropriate queries over the
respective data sources. Finally, the virtual database combines the results of these queries into the answer to the user's query.

This solution offers the convenience of adding new sources by simply constructing an adapter or an application software blade for
them. It contrasts with ETL systems or with a single database solution, which require manual integration of entire new dataset into the
system. The virtual ETL solutions leverage virtual mediated schema to implement data harmonization; whereby the data is copied
from the designated "master" source to the defined targets, field by field. Advanced Data virtualization is also built on the concept of
object-oriented modeling in order to construct virtual mediated schema or virtual metadata repository, using hub and
spoke architecture.

[edit]Theory of data integration

The theory of data integration[1] forms a subset of database theory and formalizes the underlying concepts of the problem in first-order
logic. Applying the theories gives indications as to the feasibility and difficulty of data integration. While its definitions may appear
abstract, they have sufficient generality to accommodate all manner of integration systems. [citation needed]
[edit]Definitions

Data integration systems are formally defined as a triple where is the global (or mediated) schema, is the
heterogeneous set of source schemas, and is the mapping that maps queries between the source and the global schemas.
Both and are expressed in languages over alphabets composed of symbols for each of their respective relations.
The mapping consists of assertions between queries over and queries over . When users pose queries over the data
integration system, they pose queries over and the mapping then asserts connections between the elements in the global schema
and the source schemas.

A database over a schema is defined as a set of sets, one for each relation (in a relational database). The database corresponding to
the source schema would comprise the set of sets of tuples for each of the heterogeneous data sources and is called the source
database. Note that this single source database may actually represent a collection of disconnected databases. The database
corresponding to the virtual mediated schema is called the global database. The global database must satisfy the mapping
with respect to the source database. The legality of this mapping depends on the nature of the correspondence between and .
Two popular ways to model this correspondence exist: Global as View or GAV and Local as View or LAV.

Figure 3: Illustration of tuple space of the GAV and LAV mappings.[5] In GAV, the system is constrained to the set of tuples mapped by the mediators
while the set of tuples expressible over the sources may be much larger and richer. In LAV, the system is constrained to the set of tuples in the
sources while the set of tuples expressible over the global schema can be much larger. Therefore LAV systems must often deal with incomplete
answers.

GAV systems model the global database as a set of views over . In this case associates to each element of as a query
over .Query processing becomes a straightforward operation due to the well-defined associations between and . The burden
of complexity falls on implementing mediator code instructing the data integration system exactly how to retrieve elements from the
source databases. If any new sources join the system, considerable effort may be necessary to update the mediator, thus the GAV
approach appears preferable when the sources seem unlikely to change.

In a GAV approach to the example data integration system above, the system designer would first develop mediators for each of the
city information sources and then design the global schema around these mediators. For example, consider if one of the sources
served a weather website. The designer would likely then add a corresponding element for weather to the global schema. Then the
bulk of effort concentrates on writing the proper mediator code that will transform predicates on weather into a query over the weather
website. This effort can become complex if some other source also relates to weather, because the designer may need to write code
to properly combine the results from the two sources.

On the other hand, in LAV, the source database is modeled as a set of views over . In this case associates to each element
of a query over . Here the exact associations between and are no longer well-defined. As is illustrated in the next section,
the burden of determining how to retrieve elements from the sources is placed on the query processor. The benefit of an LAV
modeling is that new sources can be added with far less work than in a GAV system, thus the LAV approach should be favored in
cases where the mediated schema is more stable and unlikely to change.[1]

In an LAV approach to the example data integration system above, the system designer designs the global schema first and then
simply inputs the schemas of the respective city information sources. Consider again if one of the sources serves a weather website.
The designer would add corresponding elements for weather to the global schema only if none existed already. Then programmers
write an adapter or wrapper for the website and add a schema description of the website's results to the source schemas. The
complexity of adding the new source moves from the designer to the query processor.

[edit]Query processing
The theory of query processing in data integration systems is commonly expressed using conjunctive queries.[6] One can loosely think
of a conjunctive query as a logical function applied to the relations of a database such as " where ". If a tuple or
set of tuples is substituted into the rule and satisfies it (makes it true), then we consider that tuple as part of the set of answers in the
query. While formal languages like Datalog express these queries concisely and without ambiguity, common SQL queries count as
conjunctive queries as well.

In terms of data integration, "query containment" represents an important property of conjunctive queries. A query contains
another query (denoted ) if the results of applying are a subset of the results of applying for any database. The two
queries are said to be equivalent if the resulting sets are equal for any database. This is important because in both GAV and LAV
systems, a user poses conjunctive queries over a virtual schema represented by a set of views, or "materialized" conjunctive queries.
Integration seeks to rewrite the queries represented by the views to make their results equivalent or maximally contained by our
user's query. This corresponds to the problem of answering queries using views (AQUV).[7]

In GAV systems, a system designer writes mediator code to define the query-rewriting. Each element in the user's query corresponds
to a substitution rule just as each element in the global schema corresponds to a query over the source. Query processing simply
expands the subgoals of the user's query according to the rule specified in the mediator and thus the resulting query is likely to be
equivalent. While the designer does the majority of the work beforehand, some GAV systems such as Tsimmis involve simplifying the
mediator description process.

In LAV systems, queries undergo a more radical process of rewriting because no mediator exists to align the user's query with a
simple expansion strategy. The integration system must execute a search over the space of possible queries in order to find the best
rewrite. The resulting rewrite may not be an equivalent query but maximally contained, and the resulting tuples may be incomplete. As
of 2009 the MiniCon algorithm[7] is the leading query rewriting algorithm for LAV data integration systems.
In general, the complexity of query rewriting is NP-complete.[7] If the space of rewrites is relatively small this does not pose a problem
— even for integration systems with hundreds of sources.

[edit]Data Integration in the Life Sciences

Large-scale questions in science, such as global warming, invasive species spread, and resource depletion, are increasingly requiring
the collection of disparate data sets for meta-analysis. This type of data integration is especially challenging for ecological and
environmental data because metadata standards are not agreed upon and there are many different data types produced in these
fields. National Science Foundation initiatives such as Datanet are intended to make data integration easier for scientists by
providing cyberinfrastructure and setting standards. The two fundedDatanet initiatives are DataONE and the Data Conservancy.


RSS
 Reprints

Data quality is a perception or an assessment of data's fitness to serve its purpose in a given context.

Aspects of data quality include:

 Accuracy
 Completeness
 Update status
 Relevance
 Consistency across data sources
 Reliability
 Appropriate presentation
 Accessibility

Within an organization, acceptable data quality is crucial to operational and transactional processes and
to the reliability of business analytics (BA) / business intelligence (BI) reporting. Data quality is
affected by the way data is entered, stored and managed. Data quality assurance (DQA) is the process
of verifying the reliability and effectiveness of data.

Maintaining data quality requires going through the data periodically and scrubbing it. Typically this
involves updating it, standardizing it, and de-duplicating records to create a single view of the data,
even even if it is stored in multiple disparate systems. There are many vendor applications on the
market to make this job easier.

You might also like