0% found this document useful (0 votes)

36 views11 pages

Integrating Data From Different Sources

Uploaded by

sivamuu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views11 pages

Integrating Data From Different Sources

Uploaded by

sivamuu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Integrating Data from Different Sources

Explain Integrating Disparate Data Stores in simple language

Integrating disparate data stores means combining data from different systems, formats, or
databases so that it can be used together, even if it originally comes from separate places.

Imagine you have several boxes of information: one with paper notes, one with digital files, and one
with photos. Each box has useful data, but because it's in different forms, it's hard to see the full
picture. Integration is like organizing all this data into one big, easy-to-read document where
everything fits together, regardless of its original format.

In the business world, companies might store their customer information in one system, financial data
in another, and product details in a third system. Integrating these systems allows the business to
access all this information together in a smooth, consistent way, making it easier to make decisions
and find insights.

To do this, companies use data integration tools that handle different file types, formats, or
databases, and they combine the data so that it can be accessed and analyzed together.

Explain Data quality issues such as missing values, inconsistencies, and redundancy in Disparate
Data.Tell in simple language

When you are dealing with disparate data—which means data coming from different systems or
sources—you often run into some problems that can affect how useful the data is. Here are three
common data quality issues:

1. Missing Values

This happens when some pieces of data are simply missing. For example, imagine you have a
customer database, but some entries don’t have email addresses or phone numbers. This is a problem
because missing information makes it harder to understand the whole picture or make decisions.

Example:

A table with customer names but no birthdates for some people, leaving blanks in the data.

2. Inconsistencies
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 1/11
Inconsistent data means that the same type of information is recorded in different ways across
systems. This can lead to confusion or errors when trying to combine or compare data. For instance,
one system might record a customer’s name as "John Smith," while another system uses "J. Smith" or
"Smith, John." If the data isn't consistent, it’s hard to know if you're talking about the same person.

Example:

Date formats like "01/10/2024" and "10/01/2024" can mean different things in different systems
(January 10 or October 1).

3. Redundancy

Redundancy happens when the same data is stored in multiple places unnecessarily. This can cause
problems because if one version of the data is updated and the others are not, you end up with
outdated or conflicting information.

Example:

If a customer’s address is stored in two systems, and only one gets updated when they move,
you have conflicting addresses.

In summary, these issues—missing values, inconsistencies, and redundancy—can make it harder

to trust and use data effectively. Fixing them is important to ensure the data is clean, accurate, and
useful when it’s integrated from different sources.

Explain this in simple language:

Data Cleaning Process

Handling typos, missing fields, and inconsistencies.

Procedures like error correction, documenting error types, and modifying entry procedures

The Data Cleaning Process is all about fixing messy or incorrect data so that it becomes clean,
accurate, and ready to use. When data comes from different sources or has been entered by different
people, it can contain mistakes or missing information. Here’s how the cleaning process works, broken
down into simple steps:

1. Handling Typos

Typos are small mistakes in the data, like spelling errors or wrong numbers. For example, someone
might type "Jhon" instead of "John." Data cleaning involves finding and correcting these errors so the
data is correct.

What you do:

Use software or manually scan for obvious mistakes and fix them.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 2/11
2. Fixing Missing Fields

Sometimes, data is incomplete. For example, a customer’s address might be missing, or their phone
number is left blank. The goal is to handle this by either filling in the missing data (if you can find it) or
deciding how to deal with it if the information isn’t available.

What you do:

Find the missing information from other sources or use methods to fill gaps, like estimating
values.
If you can't fill it, you might leave it blank or mark it clearly as missing.

3. Resolving Inconsistencies

Inconsistent data means that the same type of information is written differently in various places. For
instance, one record might show “NY” for New York, while another says “New York.” Cleaning this data
means making sure everything is written in a consistent way.

What you do:

Standardize the data so it’s the same across all records (e.g., converting all "NY" to "New York").

Procedures in the Data Cleaning Process

1. Error Correction Once you find mistakes, you correct them. For example, if someone typed a
wrong value, you fix it to the correct one.

Example: Changing "Jhon" to "John" in names.

2. Documenting Error Types This means keeping track of what kinds of errors you are finding. Do
you see a lot of spelling mistakes? Are dates often wrong? Documenting this helps you
understand patterns in the errors so you can fix them more easily in the future.

Example: Writing down every time there’s a typo in names or addresses so you know it’s a
common problem.

3. Modifying Entry Procedures After cleaning the data, you may want to improve how the data is
entered in the first place to avoid future mistakes. This could involve training staff to enter data
more carefully or changing the system to prevent certain errors.

Example: Setting up a form where people can only choose a state from a dropdown list, so
there’s no chance of typing “NY” or “New York” differently.

In short, data cleaning involves fixing mistakes (typos, missing info, and inconsistencies), and taking
steps to prevent future errors, like documenting problems and improving how data is entered. This
ensures the data is accurate, complete, and ready for analysis or use.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 3/11
Explain this in simple language:
Handling Missing Data

Categories: Missing Completely at Random (MCAR), Missing at Random (MAR), Not Missing at
Random (NMAR).
Methods: Discarding, using frequent/average values, regression

Handling Missing Data means figuring out what to do when some pieces of information are missing
from your data set. Depending on why the data is missing, there are different categories and ways to
deal with it.

Categories of Missing Data

1. Missing Completely at Random (MCAR): This happens when data is missing for no specific
reason. It’s just random, and the missing values don’t depend on any other data. For example, if
a survey respondent skipped a question just by chance.
Example: A person accidentally forgets to answer one question in a survey, but the rest of
their answers are complete.

2. Missing at Random (MAR): Data is missing, but it’s related to something else in the dataset. For
instance, if older people are more likely to skip a question about technology, their missing
answers aren't random but related to age.
Example: In a health survey, younger people are less likely to answer questions about
retirement savings, so their answers are missing for a specific reason.

3. Not Missing at Random (NMAR): The data is missing for a specific reason that is related to the
value itself. This is the most challenging case. For example, people with low incomes might
choose not to report their income, so the missing data is tied to their income level.
Example: Someone who feels uncomfortable about their low salary might leave that
question blank, so the missing data is related to the fact that they earn less.

Methods to Handle Missing Data

1. Discarding (Dropping the Data): You can simply remove rows or columns that have missing
values. This works if only a small portion of the data is missing, but you lose information by
doing this.
Example: If only a few people didn’t answer a question in a survey, you might remove their
entire response to avoid dealing with the missing answers.

2. Using Frequent or Average Values (Imputation): In this method, you fill in the missing data
with a common value, such as the most frequent value (for categorical data like "yes" or "no") or
the average value (for numerical data like age or income). This helps you keep the data
complete, but it can introduce bias.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 4/11
Example: If some people didn’t report their age in a survey, you might fill in their missing
ages with the average age of all the other respondents.

3. Regression: Regression is a more advanced method where you predict the missing values based
on other data in the dataset. It’s like using the known information to guess what the missing
data should be.
Example: If someone's income is missing, you could predict it based on their education
level, job title, and other known factors.

Summary

Categories of Missing Data:

MCAR: Missing by chance, no pattern.
MAR: Missing because of another factor in the data.
NMAR: Missing for a reason related to the data itself.

Methods to Handle Missing Data:

Discarding: Remove missing data.
Using Frequent/Average Values: Fill in with common or average numbers.
Regression: Use other data to predict and fill in missing values.

This process helps ensure your data is complete and accurate for analysis or decision-making.

Explain this in simple language:

Duplicates and Redundancy

Issues with duplicate observations.

Use of correlation analysis to detect redundancies

Duplicates and Redundancy refer to problems that occur when the same or similar data appears
multiple times in a dataset, which can cause confusion or errors in analysis.

1. Duplicate Observations

Duplicate observations happen when the same data is entered more than once in a dataset. This is
like having two or more copies of the same file in a folder. In data, duplicates can skew results, make
data look bigger than it really is, and create false conclusions.

Example: Imagine you have a customer database, and the same customer appears twice because
their name was accidentally entered two times. If you were analyzing how many unique customers
you have, duplicates would make it seem like you have more customers than you really do.

Problems with duplicates:

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 5/11
Wrong analysis: Counting the same data multiple times can lead to incorrect statistics or
decisions.
Wasted storage: Storing the same data repeatedly uses up unnecessary space.
Confusion: It can be hard to know which version of the duplicate is the correct one, especially if
the duplicates aren’t exactly the same.

2. Redundancy

Redundancy is when similar or unnecessary data is repeated across different parts of a dataset.
Unlike duplicates, which are exact copies, redundancy refers to storing information that could be
simplified or combined. It’s like having the same information stored in multiple places when you only
need it once.

Example: If a company has a customer’s address stored in three different systems, and the address is
updated in only one system, the other systems still have old information. This causes inconsistencies
and confusion.

Detecting Redundancy with Correlation Analysis

Correlation analysis is a technique used to find out if two or more data fields are strongly related,
meaning they contain similar or repetitive information. If the correlation between two fields is very
high, it suggests that one may be redundant and not necessary.

Example: Let’s say you have two columns in your dataset: “Total Purchase” and “Purchase Price x
Quantity.” If these two columns are almost always equal, one of them may be redundant because you
don’t need both.

By detecting redundant data with correlation, you can:

Remove unnecessary data to simplify the dataset.

Avoid storing or processing duplicate information, which improves efficiency.

Summary:

Duplicates: Exact copies of the same data repeated in a dataset, which can lead to incorrect
analysis and wasted space.
Redundancy: Similar or unnecessary data repeated in different places, causing confusion and
inefficiency.
Correlation analysis helps detect redundant data by identifying strong relationships between
fields, allowing you to remove unnecessary information and keep your data clean.

Explain this in simple language:

Variety in Disparate Data

Differences between structured, semi-structured, and unstructured data.

Challenges of data integration from multiple sources

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 6/11
When we talk about variety in disparate data, we’re referring to the fact that data comes in many
different forms, and these forms can be quite different from each other. Here’s a simple breakdown:

1. Types of Data

1. Structured Data: This is the most organized type of data. It’s like data in a spreadsheet, where
everything is neatly arranged into rows and columns, making it easy to search, filter, and
analyze. Structured data fits nicely into databases.

Example: A table with customer names, email addresses, and phone numbers.

2. Semi-Structured Data: Semi-structured data has some organization but not as neatly as
structured data. It doesn’t fit perfectly into a table but still has labels to make sense of the
information. It’s like a form where some fields are filled in, but others are free-form.

Example: An email with subject, sender, and date fields (which are structured), but the email
body is unstructured (free text).

3. Unstructured Data: Unstructured data is the most messy and free-form type. There’s no specific
format, so it can be harder to organize or analyze. This includes things like text, images, or
videos.

Example: A customer review in paragraph form, photos, or social media posts.

2. Challenges of Integrating Data from Multiple Sources

When you try to combine or integrate data from different sources that come in these different forms
(structured, semi-structured, unstructured), you run into several challenges:

1. Different Formats: Combining a neat table with rows and columns (structured data) and a
messy collection of text or images (unstructured data) is difficult. You need tools to bring these
different types together in a way that makes sense.

Example: Trying to merge sales data from a structured database with customer feedback in
unstructured text from social media.

2. Inconsistent Standards: Different systems might store the same type of data in different ways.
For instance, one system might write dates as "2024-10-24" while another writes "10/24/2024."
Integrating this data means converting it into a consistent format.

Example: Combining data from two different companies that store customer names differently
(e.g., "John Smith" vs. "Smith, John").

3. Data Quality Issues: Each source might have its own problems with missing, duplicate, or
inconsistent data. When you integrate these, you might combine those problems too, making
the data even messier if not cleaned properly.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 7/11
Example: One system might have updated addresses, while another system has old addresses,
leading to confusion about which is correct.

4. Complexity in Analysis: Structured data is easy to analyze with traditional methods (like using
spreadsheets or databases), but when you bring in semi-structured or unstructured data, it
requires more advanced tools, like AI or text analysis, to make sense of it.

Example: You can quickly analyze sales numbers from a structured table, but analyzing
customer emotions from text reviews requires advanced algorithms.

Summary:

Structured Data: Neatly organized in rows and columns (e.g., databases).

Semi-Structured Data: Partially organized with some structure, like emails.
Unstructured Data: No clear organization (e.g., text, photos).

Challenges in integrating disparate data include dealing with different formats, inconsistent
standards, and data quality issues, all of which make it harder to combine and analyze data from
multiple sources.

Explain this in simple language:

Data Migration and Integration

Processes of data retrieval, format conversion, and duplicate elimination

Data Migration and Integration refer to the processes of moving data from one place to another
and making sure that data from different systems can work together smoothly.

1. Data Migration

Data migration is like moving your data from one house to another. It involves transferring data from
one system to another, for example, when a company upgrades to a new software or database. The
goal is to move all the important information without losing or messing anything up.

2. Data Integration

Data integration is about combining data from different sources into one system, so you can use it
together. For example, if a company has customer information in one system and sales data in
another, integration pulls it all together so everything is in one place and works seamlessly.

Key Steps in Data Migration and Integration

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 8/11
1. Data Retrieval Data retrieval means pulling or extracting data from its original source, like
downloading files from an old system. This is the first step when you're migrating or integrating
data, and it ensures you have all the necessary information.

Example: Getting customer data from an old software system to use it in a new one.

2. Format Conversion Different systems often store data in different formats. Format conversion
means changing the data into a format that the new system can understand. It’s like converting
a Word document into a PDF so another program can open it.

Example: If one system stores dates as "2024-10-24" and another stores them as "24/10/2024,"
format conversion makes them consistent.

3. Duplicate Elimination During migration or integration, you might find that the same data is
stored in multiple places. Duplicate elimination is the process of removing these extra copies so
that you only keep one version of the data. This prevents confusion and ensures accuracy.

Example: If two different systems have the same customer listed twice, you delete the duplicate
so you only have one record for that customer.

Summary:

Data Migration: Moving data from one system to another.

Data Integration: Combining data from different sources into one system.

The steps involved include:

Data Retrieval: Extracting data from the old system.

Format Conversion: Changing the data into a format that works for the new system.
Duplicate Elimination: Removing extra copies of the same data to avoid confusion.

Together, these steps ensure that data moves smoothly and can be used effectively in its new system.

Explain this in simple language:

ETL Pipeline (Extract-Transform-Load)

Traditional vs. new Big Data approaches.

Schema-on-read vs. schema-on-write methods

An ETL pipeline (Extract-Transform-Load) is a process used to move data from different sources to a
single place (like a database or data warehouse) where it can be analyzed. The pipeline has three main
steps:

1. Extract: Pull the data from various sources (e.g., databases, files, or websites).

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 9/11
2. Transform: Clean, format, or modify the data so it’s consistent and ready to use.
3. Load: Put the transformed data into its final destination, usually a database or data warehouse.

Traditional ETL vs. Big Data Approaches

Traditional ETL: In the past, ETL pipelines worked with smaller, more structured data. The data
was usually extracted from systems like databases, transformed (cleaned and organized) into a
specific format, and then loaded into a data warehouse where it could be analyzed. This method
works well for structured data that fits neatly into rows and columns.

Example: A company extracts data from its sales database, cleans it (removes duplicates or
errors), and loads it into a central database for reporting.

New Big Data Approaches: With Big Data, things are different because the data comes in
massive amounts and from many different sources, often including unstructured or semi-
structured data (like social media posts, images, or videos). Instead of cleaning and transforming
everything upfront, some pipelines load the raw data directly into storage and transform it later
when it’s actually needed.

Example: A company collects raw data from social media, customer service chats, and website
logs, and instead of cleaning it all first, they store it and analyze it later when needed.

Schema-on-Write vs. Schema-on-Read

1. Schema-on-Write (Traditional ETL)

Schema-on-write means that when data is being loaded into the database, it must be
organized and structured according to a predefined format (schema). This means all the
data is cleaned, organized, and stored in a specific way upfront, before any analysis can
happen.
Example: Before loading data into a database, you define how the data should look—like
specifying that a column should only hold dates—and the data must follow that structure.

Key point: The data has to fit the format (schema) before it can be written into the database.
This method is good for structured, well-organized data.

2. Schema-on-Read (Big Data Approach)

Schema-on-read is more flexible. In this approach, you store the raw data without worrying
about its structure, and you define the structure (schema) only when you are ready to read
or analyze the data. This is especially useful for big, unstructured data like social media
posts or sensor data, where you may not know how you’ll use the data at first.
Example: You store raw customer feedback data, and only when you need to analyze it do
you decide how to structure it (e.g., by sentiment, rating, or keywords).

Key point: The data is stored as-is, and the structure is applied later when needed for analysis.
This method works well for unstructured or semi-structured data.

Summary:
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 10/11
ETL Pipeline:
Extract: Get data from different sources.
Transform: Clean and organize it.
Load: Put it into a database for use.

Traditional ETL: Works best with structured data, transforming it before storing it (schema-on-
write).

Big Data Approaches: Store large amounts of raw data and transform it when needed (schema-
on-read).

This shift allows companies to handle diverse, massive datasets more flexibly and efficiently.
ChatGPT can make mistakes. Check important info.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 11/11

DS Lec 6
No ratings yet
DS Lec 6
27 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Data Cleaning Process
No ratings yet
Data Cleaning Process
1 page
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
Lec 9
No ratings yet
Lec 9
1 page
Unit 2 Preprocessing in Data Analytics
No ratings yet
Unit 2 Preprocessing in Data Analytics
36 pages
Intro. Data Science 3
No ratings yet
Intro. Data Science 3
38 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
PHD Seminar
No ratings yet
PHD Seminar
38 pages
The Data Science Process
No ratings yet
The Data Science Process
33 pages
Data Cleaning and JSON in R
No ratings yet
Data Cleaning and JSON in R
61 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
Data Cleaning for Analysts
No ratings yet
Data Cleaning for Analysts
1 page
CS822 DataMining Week3
No ratings yet
CS822 DataMining Week3
91 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
M-II FDS U-II Questions
No ratings yet
M-II FDS U-II Questions
43 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Importance of Data Cleaning
No ratings yet
Importance of Data Cleaning
35 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
DMDW Unit II
No ratings yet
DMDW Unit II
57 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Document
No ratings yet
Document
29 pages
Introduction To Data Science 1-2-2025
No ratings yet
Introduction To Data Science 1-2-2025
14 pages
DM Unit 1
No ratings yet
DM Unit 1
18 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
41 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
cs614 Notes
No ratings yet
cs614 Notes
2 pages
Data Quality
No ratings yet
Data Quality
14 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
34 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
16-Data Preprocessing
No ratings yet
16-Data Preprocessing
27 pages
Collaboration Diagram
No ratings yet
Collaboration Diagram
31 pages
System Development Life Cycle SDLC
No ratings yet
System Development Life Cycle SDLC
10 pages
Tutorial Week3 Solutions INFO20003
No ratings yet
Tutorial Week3 Solutions INFO20003
10 pages
Agile One Pager PDF
No ratings yet
Agile One Pager PDF
2 pages
Game Script Debugging
No ratings yet
Game Script Debugging
36 pages
Android Bottom Tabs Activity Guide
No ratings yet
Android Bottom Tabs Activity Guide
8 pages
Mobile Application Development With Android
No ratings yet
Mobile Application Development With Android
12 pages
Practical SQL Constraint
No ratings yet
Practical SQL Constraint
9 pages
All Lectures Exam Preparation
No ratings yet
All Lectures Exam Preparation
6 pages
Log
No ratings yet
Log
2 pages
Assignment 1 Front Sheet: Qualification TEC Level 5 HND Diploma in Computing
No ratings yet
Assignment 1 Front Sheet: Qualification TEC Level 5 HND Diploma in Computing
35 pages
Curriculum - 6 - 12 - August - V4
No ratings yet
Curriculum - 6 - 12 - August - V4
7 pages
Chess End Game Pattern
0% (1)
Chess End Game Pattern
5 pages
Instalar Tools Vms Rhev
No ratings yet
Instalar Tools Vms Rhev
16 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
8 pages
LXF - 254 - September 2019
No ratings yet
LXF - 254 - September 2019
100 pages
Workspaceone Adfs Integration
No ratings yet
Workspaceone Adfs Integration
63 pages
Install Step7 Wincc v17 Enus
No ratings yet
Install Step7 Wincc v17 Enus
90 pages
Blitz-Logs 20220531192630
No ratings yet
Blitz-Logs 20220531192630
37 pages
Answers: SAP SD Certification Premium Questions (Vol. 1) : Answers Are Highlighted With Green Color
No ratings yet
Answers: SAP SD Certification Premium Questions (Vol. 1) : Answers Are Highlighted With Green Color
17 pages
DAA Viva Questions
No ratings yet
DAA Viva Questions
4 pages
Looping Statements Essay
No ratings yet
Looping Statements Essay
4 pages
3D Mapping Integration for Mines
100% (2)
3D Mapping Integration for Mines
11 pages
Change Management Policy
No ratings yet
Change Management Policy
7 pages
C++ Chapter-5,6
No ratings yet
C++ Chapter-5,6
13 pages
A Variable Is A Named Memory Location in - The Contents of A Variable Can Change, Thus - User Defined
No ratings yet
A Variable Is A Named Memory Location in - The Contents of A Variable Can Change, Thus - User Defined
16 pages
Stacktical - Dsla White Paper
No ratings yet
Stacktical - Dsla White Paper
28 pages
EY MJL - S4HANA - Final - Commercial - Proposal - 01sep2019
No ratings yet
EY MJL - S4HANA - Final - Commercial - Proposal - 01sep2019
17 pages
Uno Module 1 Lesson 1&2
No ratings yet
Uno Module 1 Lesson 1&2
2 pages
IC-419InstallationandServiceGuide Ver3 0
No ratings yet
IC-419InstallationandServiceGuide Ver3 0
100 pages

Integrating Data From Different Sources

Uploaded by

Integrating Data From Different Sources

Uploaded by

Integrating Data from Different Sources

Explain Integrating Disparate Data Stores in simple language

In summary, these issues—missing values, inconsistencies, and redundancy—can make it harder

Explain this in simple language:

Handling typos, missing fields, and inconsistencies.

What you do:

What you do:

What you do:

Procedures in the Data Cleaning Process

Example: Changing "Jhon" to "John" in names.

Categories of Missing Data

Methods to Handle Missing Data

Categories of Missing Data:

Methods to Handle Missing Data:

Explain this in simple language:

Issues with duplicate observations.

Problems with duplicates:

Detecting Redundancy with Correlation Analysis

By detecting redundant data with correlation, you can:

Remove unnecessary data to simplify the dataset.

Explain this in simple language:

Differences between structured, semi-structured, and unstructured data.

Example: A customer review in paragraph form, photos, or social media posts.

2. Challenges of Integrating Data from Multiple Sources

Structured Data: Neatly organized in rows and columns (e.g., databases).

Explain this in simple language:

Processes of data retrieval, format conversion, and duplicate elimination

Key Steps in Data Migration and Integration

Data Migration: Moving data from one system to another.

The steps involved include:

Data Retrieval: Extracting data from the old system.

Explain this in simple language:

Traditional vs. new Big Data approaches.

Traditional ETL vs. Big Data Approaches

Schema-on-Write vs. Schema-on-Read

1. Schema-on-Write (Traditional ETL)

2. Schema-on-Read (Big Data Approach)

You might also like