KEMBAR78
Introduction To Data Analytics | PDF | Data Analysis | Data
0% found this document useful (0 votes)
9 views96 pages

Introduction To Data Analytics

Uploaded by

omkarmeher8689
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views96 pages

Introduction To Data Analytics

Uploaded by

omkarmeher8689
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Introduction

What is Data?
• Data is a collection of raw, unorganized facts, figures, or symbols that
can be processed and analyzed to extract useful information.
• It can be numerical or non-numerical and can exist in various forms,
including numbers, text, images, and sounds.
• Examples of data include sales figures, customer feedback, website
visitor statistics, and the results of scientific experiments
Data Vs Information Vs Knowledge
What is Data?

• Data refers to raw, unprocessed facts and figures that lack context or
interpretation on their own. Raw form, can be difficult to understand
or apply without extra processing.
• Text, multimedia (pictures, videos, audio), and numerical numbers are
all acceptable formats for data collection. These types of data are
critical in many disciplines, including science, business, and
technology, where they serve as the foundation for analysis and
decision-making.
What is Information?

• Information is data that has been processed, organized, or structured


to convey meaning and significance.
• Unlike raw data, information is more comprehensible and provides
context that aids in understanding the data.
• The transformation from data to information generally involves
several key steps:
1. Data Collection: Gathering raw data from various sources, such as
weather stations, satellites, or sensors.
2. Data Cleaning: Ensuring the data is accurate, consistent, and free from
errors or outliers.
3. Data Analysis: Applying statistical methods and computational algorithms
to identify patterns, correlations, and trends within the data.
What is Knowledge?

• Knowledge is information that has undergone further analysis,


synthesis, and refinement, resulting in a deeper understanding and
more profound insights.
• Knowledge builds on information by adding experience, context,
interpretation, and judgment, allowing it to be applied to solve
problems, develop new products, or create innovative solutions.
• It is the culmination of a continuous learning process, where raw data
is transformed into information and subsequently into knowledge,
empowering you to make informed decisions and take effective
actions.
• The process of transforming information into knowledge involves
several key steps:
1. Critical Analysis: Evaluating and interpreting information to
understand its implications and relevance.
2. Synthesis: Combining different pieces of information to form a
comprehensive understanding or new concepts.
3. Refinement: Continuously updating and improving knowledge
based on new data, insights, and experiences.
4. Application: Using knowledge to address real-world problems,
innovate, and create value
What is Data Analytics
• Data analytics is the process of examining raw data to draw
conclusions, make predictions, and drive informed decision-making.
• It involves collecting, transforming, and organizing data to identify
trends, patterns, and correlations that can be used to solve problems,
improve efficiency, and discover new opportunities.
• Essentially, it's about turning raw data into actionable insights.
• As a data analyst, your role involves analyzing large datasets,
identifying hidden patterns, and transforming raw data into
actionable insights that drive informed decision-making.
• Organizations rely on data analysis to make informed decisions,
enhance efficiency, and predict future outcomes.
The Data Analysis Process

Define the Objective: Identify the goal of the analysis. Understand the
problem you're trying to solve or the question you need to answer.
Collect Data: Gather relevant data from various sources. This could
include internal data, surveys, or external datasets.
Clean the Data: Prepare the data by removing errors, duplicates, and
inconsistencies. This ensures the analysis is based on accurate and
reliable data.
Analyze the Data: Use statistical and analytical techniques to explore
the data. This may involve running queries, creating models, or using
machine learning algorithms to find patterns and trends.
Interpret the Results: Translate the analysis into meaningful insights.
Understand the significance of the findings in the context of the
objective.
Communicate the Findings: Present the results in a clear and concise
manner using visualizations, reports, or presentations to inform
decision-making.
Why is Data Analysis Important?

• Data analysis is crucial because it enables businesses to make


decisions based on concrete, actionable insights rather than
assumptions.
• By analyzing data, companies can uncover patterns and trends that
help them understand customer behavior, optimize operations, and
predict future outcomes.
• It also allows businesses to identify risks at an early stage so that they
can take proactive action before they turn into problems.
• Data analysis to refine their strategies, boost customer engagement,
optimize product offerings, and streamline functionalities
Types of Data Analysis

Descriptive Analysis
• The descriptive analysis type shows you what has already happened.
It's all about summarizing raw data into something easy to
understand.
• For instance, a business might use it to see how much each employee
sold and what the average sales look like. It's like asking: What
happened?
Diagnostic Analysis

• Once you know what happened, diagnostic analysis helps explain


why.
• Say a hospital notices more patients than usual.
• By looking deeper into the data, you might find that many of them
had the same symptoms, helping you figure out the cause. This
analysis answers: Why did it happen?
Predictive Analysis

• Predictive analysis looks at trends from the past to help you guess
what might come next.
• For example, if a store knows that sales usually go up in certain
months, it can predict the same for the next year. The question here
is: What might happen?
Prescriptive Analysis

• This type gives you advice based on all the data you've gathered. If
you know when sales are high, prescriptive analysis suggests how to
boost them even more or improve slower months. It answers: What
should we do next?
• Having explored the various types of data analysis, let's now delve
into the top methods used to perform these analyses effectively.
• Descriptive, which answers the question, “What happened?”
• Diagnostic, which answers the question, “Why did this happen?”
• Predictive, which answers the question, “What might happen in the
future?”
• Prescriptive, which answers the question, “What should we do next?”
Top Data Analysis Methods With Examples

Descriptive Analysis
• Descriptive analysis involves summarizing and organizing data to
describe the current situation. It uses measures like mean, median,
mode, and standard deviation to describe the main features of a data
set.
• Example: A company analyzes sales data to determine the monthly
average sales over the past year. They calculate the mean sales figures
and use charts to visualize the sales trends.
Diagnostic Analysis

• Diagnostic analysis goes beyond descriptive statistics to understand


why something happened. It looks at data to find the causes of
events.
• Example: After noticing a drop in sales, a retailer uses diagnostic
analysis to investigate the reasons. They examine marketing efforts,
economic conditions, and competitor actions to identify the cause.
Predictive Analysis

• Predictive analysis uses historical data and statistical techniques to


forecast future outcomes. It often involves machine learning
algorithms.
• Example: An insurance company uses predictive analysis to assess the
risk of claims by analyzing historical data on customer demographics,
driving history, and claim history.
Prescriptive Analysis

• Prescriptive analysis recommends actions based on data analysis. It


combines insights from descriptive, diagnostic, and predictive
analyses to suggest decision options.
• Example: An online retailer uses prescriptive analysis to optimize
its inventory management. The system recommends the best
products to stock based on demand forecasts and supplier lead times.
Quantitative Analysis

• Quantitative analysis involves using mathematical and statistical


techniques to analyze numerical data.
• Example: A financial analyst uses quantitative analysis to evaluate a
stock's performance by calculating various financial ratios and
performing statistical tests.
Qualitative Research

• Qualitative research focuses on understanding concepts, thoughts, or


experiences through non-numerical data like interviews,
observations, and texts.
• Example: A researcher interviews customers to understand their
feelings and experiences with a new product, analyzing the interview
transcripts to identify common themes.
Time Series Analysis

• Time series analysis involves analyzing data points collected or


recorded at specific intervals to identify trends, cycles, and seasonal
variations.
• Example: A climatologist studies temperature changes over several
decades using time series analysis to identify patterns in climate
change.
Regression Analysis

• Regression analysis assesses the relationship between a dependent


variable and one or more independent variables.
• Example: An economist uses regression analysis to examine the
impact of interest, inflation, and employment rates on economic
growth.
Cluster Analysis

• Cluster analysis groups data points into clusters based on their


similarities.
• Example: A marketing team uses cluster analysis to segment
customers into distinct groups based on purchasing behavior,
demographics, and interests for targeted marketing campaigns.
Applications of Data Analysis
Smart Cities and Urban Planning
• In smart cities, data analysis is used to manage traffic, reduce
congestion, and even lower pollution.
• By collecting data from sensors across the city, traffic lights can adjust
in real time to help improve the flow of vehicles and make cities more
efficient and cleaner.
Agriculture and Precision Farming
• Farmers are now using data to grow crops more effectively and
sustainably. With the help of tools farmers can track soil health,
weather conditions, and crop performance.
• This data helps them make smarter decisions about watering and
fertilizing, leading to better harvests and less waste.
Retail and Consumer Behavior Analysis
• Retailers are using data to understand customer behavior and offer
better shopping experiences.
• Companies like Starbucks use data from their app to track what
people like to buy and send personalized offers to keep customers
coming back. It’s a great way to enhance loyalty and increase sales.
Logistics and Route Optimization
• In logistics, companies like UPS are using data to find the fastest and
most fuel-efficient delivery routes.
• By analyzing traffic patterns and weather, they can adjust their routes
in real-time, cutting down on delivery times and reducing costs while
keeping customers happy with faster service.
Cybersecurity and Threat Detection
• Companies such as CrowdStrike use data to track what is happening
on a network in order to identify cyber threats before they have a
chance to wreak havoc.
• This helps companies protect their data and avoid the problems a
security breach can cause.
Data Analytics Challenges

• Not asking the right questions. The first step in getting actionable
insights is to know what you are trying to discover.
• Data silos. Data often resides in a variety of locations and is overseen
by different stakeholders. Lack of coordination and siloed data make
standardization more difficult .
• A data silo is a repository of data that's controlled by one department or
business unit and isolated from the rest of an organization
Accuracy and quality. Collecting data from a lot of sources increases
the risk that some of the data is lower quality or incomplete.
• The challenge for companies is in determining which data are good
and cleaning the various inputs so that everything is standardized and
usable.
Security and privacy.
• The more data companies collect, the greater the likelihood it
contains sensitive customer data that needs to be protected.
Role of Data Engineers, Data Analysts, Data Scientists,
Business Analysts, and Business Intelligence Analysts
Data Engineer

Building and maintaining the infrastructure (data pipelines, databases,


data warehouses) that allows data to be collected, stored, and
processed efficiently.
Key Responsibilities:
• Designing and implementing data architectures, developing ETL (Extract,
Transform, Load) processes, managing big data, and ensuring data quality and
availability.
Skills:
• Strong programming skills, knowledge of databases, understanding of data
warehousing concepts
• Data warehousing is the process of collecting, integrating, storing,
and managing data from multiple sources in a central repository. It
enables organizations to organize large volumes of historical data for
efficient querying, analysis, and reporting.

• A data pipeline is a series of automated processes that move and


transform data from one or more sources to a destination, often for
analysis or storage
Data Analyst

Exploring and analyzing data to identify trends, patterns, and insights


that can inform business decisions.
Key Responsibilities:
Collecting, cleaning, and organizing data, performing statistical analysis,
creating visualizations, and presenting findings to stakeholders.
Skills:
Proficiency in data analysis tools (SQL, Python, R), statistical analysis, data
visualization.
Data Scientist
Using data to build predictive models and solve complex business
problems using machine learning and advanced analytical techniques.
Key Responsibilities:
Developing and implementing machine learning models, performing hypothesis
testing, and creating data-driven solutions to improve business outcomes.
Skills:
Strong programming skills, knowledge of machine learning algorithms,
statistical modeling, and experience with big data technologies.
Business Analyst

Understanding business needs and recommending solutions based on


data analysis and insights.
Key Responsibilities:
Gathering requirements, analyzing business processes, identifying areas for
improvement, and recommending solutions to enhance business performance.
Skills:
• Strong analytical and problem-solving skills, understanding of business
processes and requirements
Business Intelligence Analyst

Creating reports and dashboards to monitor key performance


indicators (KPIs) and provide insights into business performance.
Key Responsibilities:
Developing and maintaining reports and dashboards, automating
data analysis processes, and generating ad-hoc reports for
decision-making.
Skills:
Proficiency in BI tools (Tableau, Power BI), data analysis, and
reporting
Skill Set required to be data analyst

The data analysis process


• The data analysis process is a multi-step journey that starts with data
collection and ends with actionable insights
Programming languages (Python, R, SQL)

These languages allow you to manipulate data, perform statistical


analyses, and create data visualizations.
• Python. Widely used for data manipulation and analysis, Python
boasts a rich ecosystem of libraries like Pandas and NumPy.
• R. Specialized for statistical analysis; R is another powerful tool often
used in academic research and data visualization.
• SQL. The go-to language for database management, SQL allows you to
query, update, and manipulate structured data
Data Visualization Tools (Tableau, Power BI)
• Data visualization is not just about creating charts; it's about telling a
story with data.
“A picture is worth a thousand words.”
• Tableau. Known for its user-friendly interface, Tableau allows you to
create complex visualizations without any coding.
• Power BI. Developed by Microsoft, Power BI is another powerful tool
for creating interactive reports and dashboards. It integrates
seamlessly with various Microsoft products and allows for real-time
data tracking, making it popular in corporate settings.
Statistical Analysis
Statistical analysis is the backbone of data analytics, providing the
methodologies for making inferences from data.
• Descriptive statistics. Summarize and interpret data to provide a clear
overview of what the data shows.
• Inferential statistics. Make predictions and inferences about a
population based on a sample.
Advanced Data Analyst Skills
Understanding the basics can significantly broaden your capabilities as
a data analyst. This includes:
• Supervised learning. Techniques for building models that can make
predictions based on labeled data.
• Unsupervised learning. Methods for finding patterns in unlabeled
data.
• Natural Language Processing (NLP). A subfield focusing on the
interaction between computers and human language.
Big data technologies
• According to the latest estimates, 402.74 million terabytes of data are
created each day.
• (“Created” includes data that is newly generated, captured, copied, or
consumed).
• As data continues to grow in volume and complexity, big data technologies
like Hadoop and Spark are becoming increasingly important. These
technologies allow you to work on:
• Data storage. Handle large datasets that are beyond the capacity of
traditional databases.
• Data processing. Perform complex computations and analyses on big data.
• Real-time analytics. Analyze data in real time to make immediate business
decisions
Types of Data
Structured data

• Structured data is data that has a standardized format defined by a


schema.
• Structured data tends to be stored in a tabular format, meaning there
are rows and columns.
Benefits of structured data

• Efficient querying and analysis: Because structured data is stored in


fixed fields, it can be easily queried using tools like SQL, which speeds
up decision-making processes.
• High accuracy: Structured data typically follows strict validation rules,
which reduces errors and ensures consistency across datasets.
• Automation-friendly: Many automated tools and algorithms work
well with structured data, making it ideal for analytics, reporting, and
machine learning tasks.
Challenges of structured data

• Limited flexibility: Structured data must conform to a rigid schema,


which can make it difficult to adapt or capture complex, evolving data
types, especially in dynamic environments.
• Cost of maintenance: Maintaining structured data requires constant
updates to data models, databases, and infrastructure, which can be
resource-intensive.
• Scaling issues: As the volume of structured data grows, storage and
processing demands can increase rapidly, potentially leading to
performance bottlenecks if not managed properly.
Unstructured data

Unstructured data does not have any standardized format or data


model. Unstructured data is stored in its native format and there are
many different types.
Common types of unstructured data include text files, photographs,
videos, and audio recordings.
Benefits of unstructured data

• Rich insights: provides deeper insights into customer behavior,


market trends, and organizational performance.
• Versatility: It can come in many formats, such as text, images, audio,
or video, allowing businesses to capture and analyze diverse types of
information from various sources.
• Growth potential: As businesses increasingly rely on data from social
media, IoT devices, and customer interactions, unstructured data
offers opportunities for innovation and competitive advantage.
Challenges of unstructured data

• Difficult to organize: Without a predefined structure, unstructured data is


harder to classify, store, and retrieve, requiring advanced tools for
processing and analysis.
• Complex analysis: Extracting meaningful insights from unstructured data
often involves sophisticated techniques like natural language processing
(NLP) or machine learning, which can be resource-intensive.
• Scalability issues: The sheer volume of unstructured data generated today
can overwhelm storage systems and make it difficult to scale infrastructure
without significant investments.
• Data quality concerns: Unstructured data can vary greatly in terms of
accuracy and relevance, making it harder to ensure the quality of the data
being used for decision-making
semi-structured data?

Semi-structured data sits between structured and unstructured data,


such that a portion of the data has a standardized format and a portion
does not.
Benefits of semi-structured data

• Flexible structure: doesn’t rely on a rigid schema, allowing it to


handle data that evolves over time. Formats like JSON, XML, and
NoSQL databases allow for easy adjustments as data changes.
• Easier to analyze than unstructured data: Semi-structured data
includes tags or markers to indicate elements, which make it simpler
to search and analyze compared to purely unstructured data.
• Supports diverse data types: Semi-structured data can capture
various formats, including documents, emails, and social media posts.
Challenges of semi-structured data

• Inconsistent formats: The lack of a rigid schema can lead to inconsistencies


in how the data is stored or labeled, making it harder to maintain
uniformity across datasets.
• Complexity in querying: Although easier to manage than unstructured
data, semi-structured data still requires specialized tools and techniques to
analyze effectively, which can increase technical complexity.
• Scalability concerns: As the volume of semi-structured data grows,
ensuring consistent performance and efficient storage can become difficult
without significant infrastructure and resources.
• Data quality issues: Without strict validation rules, semi-structured data
may suffer from accuracy and quality problems, requiring more effort in
data cleaning and governance.
Different Sources of Data for Data
Analysis
Internal Data
The data which is generated within an organization is called internal
data.
It is more relevant to access and analyse to find business insights;
insights are used in business decisions.
1. Operational Data
Operational data includes day-to-day business operations like sales
transactions, customer data, inventory records, and production data.
2. Customer Data
It is one of the most crucial data which collected directly from customers using
CRM, feedback forms, surveys, and customer support systems. It is used in data
analysis to find customers opinions or sentiments.
3. Employee Data
It includes Human resources data which can further analysed to get employees
performance, payroll information, and employee satisfaction.
4. Financial Data
It includes financial data generated through financial systems, such as budgets,
profit and loss, balance sheets, and cash flow statements.
5. Marketing Data
It is also considered as internal data collected from marketing campaigns,
website analytics, email marketing, and social media channels.
6. Production Data
Production data also collected from internal sources of an organisation like
manufacturing processes. It also involves machine performance, production
output, and quality control metrics.
External Data

• The data which is collected through outside boundaries of an


organization is known as external data.
• External data is used by organizations to assess and model economic,
political, social, and environmental problems that influence business.
1. Public Data
The data which is collected from public platforms like government
databases, journals, magazines, industry reports, and newspapers etc.
2. Social Media Data
The data available on social media platforms like Twitter, Facebook,
LinkedIn, and Instagram. This encompasses user-generated content,
sentiment analysis, and engagement metrics.
3. Web Scraping
Data extracted from websites using automated scripts like product
reviews, competitors data to do comparison and price comparison.
4. IoT Data:
Data collected from sensors, smart devices, and wearables, providing
real-time data on environmental conditions, usage patterns, and more.
5. Partner Data
Data is shared between business partners, such as suppliers, distributors, or
strategic partnerships, to improve mutual understanding of market
conditions or client wants.
Data Collection
• The process of gathering and analyzing accurate data from various
sources to find answers to research problems, trends and
probabilities, etc., to evaluate possible outcomes is known as data
collection.
• During data collection, researchers must identify the data types, the
sources of data, and the methods being used.
Before an analyst begins collecting data, they must answer three
questions first:
• What’s the goal or purpose of this research?
• What kinds of data are they planning on gathering?
• What methods and procedures will be used to collect, store, and
process the information?
Data Collection Methods
Primary Data Collection

Which involves the collection of original data directly from the source
or through direct interaction with the respondents.
• Surveys and Questionnaires:
Researchers design structured questionnaires or surveys to collect data from
individuals or groups.

• Interviews:
Interviews involve direct interaction between the researcher and the
respondent. They can be conducted in person, over the phone, or through
video conferencing. Interviews can be structured (with predefined questions),
semi-structured (allowing flexibility), or unstructured (more conversational).
Observations:
• Researchers observe and record behaviors, actions, or events in their natural
setting. This method is useful for gathering data on human behavior,
interactions, or phenomena without direct intervention.
Experiments:
• Experimental studies involve manipulating variables to observe their impact
on the outcome. Researchers control the conditions and collect data to
conclude cause-and-effect relationships.
Secondary Data Collection

Which involves using existing data collected by someone else for a


purpose different from the original intent.
• Published Sources: Researchers refer to books, academic journals,
magazines, newspapers, government reports, and other published
materials that contain relevant data.
• Online Databases: Numerous online databases provide access to a
wide range of secondary data, such as research articles, statistical
information, economic data, and social surveys.
• Government and Institutional Records: Government agencies,
research institutions, and organizations often maintain databases or
records that can be used for research purposes.
• Past Research Studies: Previous research studies and their findings
can serve as valuable secondary data sources. Researchers can review
and analyze the data to gain insights or build upon existing
knowledge.
Common Challenges in Data Collection?

Data Quality Issues


The main threat to the broad and successful application of machine
learning is poor data quality. Data quality must be your top priority if
you want to make technologies like machine learning work for you.
Inconsistent Data
When working with various data sources, it's conceivable that the
same information will have discrepancies between sources. The
differences could be in formats, units, or occasionally spellings.
Data Downtime
Schema modifications and migration problems are just two examples
of the causes of data downtime. Due to their size and complexity, data
pipelines can be difficult to manage. Data downtime must be
continuously monitored and reduced through automation.
• Duplicate Data
• Inaccurate Data:
Data inaccuracies can be attributed to several things, including data
degradation, human mistakes, and data drift
• Finding Relevant Data
• Finding relevant data is not so easy. There are several factors that we need to
consider while trying to find relevant data, which include -
• Relevant Domain
• Relevant demographics
• We need to consider Relevant Time periods and many more factors while
trying to find appropriate data.
• Dealing With Big Data
Big data refers to massive data sets with more diversified structures. These
traits typically result in increased challenges while storing, analyzing, and using
additional methods of extracting results

• Low Response and Other Research Issues


Poor design and low response rates were shown to be two issues with data
collecting, particularly in health surveys that used questionnaires. This might
lead to an insufficient or inadequate data supply for the study.
Data Wrangling
Data Wrangling and Cleaning

Before any data analysis can occur, the data must be cleaned and
transformed into a usable format, a process known as data wrangling.
This involves:
• Data cleaning. Identifying and correcting errors, inconsistencies, and
inaccuracies in datasets.
• Data transformation. Converting data into a format that can be easily
analyzed, which may involve aggregating, reshaping etc.
• Data integration. Combining data from different sources and
providing a unified view.
• Data wrangling, or data munging, is a crucial process in the data
analytics workflow that involves cleaning, structuring, and enriching
raw data to transform it into a more suitable format for analysis.

• This process includes cleaning the data by removing or correcting


inaccuracies, inconsistencies, and duplicates.
• It also involves structuring the data, often converting it into a tabular
form that is easier to work with in analytical applications.
How Data Wrangling Works?

Collection
The first step in data wrangling is collecting raw data from various
sources. These sources can include databases, files, external APIs, web
scraping, and many other data streams.
The data collected can be structured (e.g., SQL databases), semi-
structured (e.g., JSON, XML files), or unstructured (e.g., text
documents, images).
Cleaning
• Once data is collected, the cleaning process begins. This step removes errors,
inconsistencies, and duplicates that can skew analysis results. Cleaning might
involve:
• Removing irrelevant data that doesn't contribute to the analysis.
• Correcting errors in data, such as misspellings or incorrect values.
• Dealing with missing values by removing them, attributing them to other data
points, or estimating them through statistical methods.
• Identifying and resolving inconsistencies, such as different formats for dates
or currency.
Structuring
• After cleaning, data needs to be structured or restructured into a
more analysis-friendly format. This often means converting
unstructured or semi-structured data into a structured form, like a
table in a database or a CSV file. This step may involve:
• Parsing data into structured fields.
• Normalizing data to ensure consistent formats and units.
• Transforming data, such as converting text to lowercase, to prepare
for analysis.
Enriching
• Data enrichment involves adding context or new information to the
dataset to make it more valuable for analysis. This can include:
• Merging data from multiple sources to develop a more
comprehensive dataset.
• Creating new variables or features that can provide additional insights
when analyzed.
Validating
• Validation ensures the data's accuracy and quality after it has been
cleaned, structured, and enriched. This step may involve:
• Data integrity checks, such as ensuring foreign keys in a database
match.
• Quality assurance testing to ensure the data meets predefined
standards and rules.
Storing
• The final wrangled data is then stored in a data repository, such as a
database or a data warehouse, making it accessible for analysis and
reporting. This storage not only secures the data but also organizes it
in a way that is efficient for querying and analysis.
Documentation
• Documentation is critical throughout the data wrangling process. It
records what was done to the data, including the transformations and
decisions. This documentation is invaluable for reproducibility,
auditing, and understanding the data analysis process.
Data Integration
• It is the process of collecting and consolidating data from all sources
into one single dataset or data warehouse.
• The ultimate goal of data management is to provide users with
consistent access and delivery of data and to meet the different
needs of all business applications and processes.
• One of the most common use cases of data integration is in the
management of business and customer data.
• It helps to support business intelligence and advanced analytics with
a complete picture of financial risks, key performance indicators
(KPIs), supply chain operations, and other important business
processes
• Another important role of data integration is in the IT environment to
provide access to data stored on legacy systems. There are a number
of modern big data analytics environments (eg: Hadoop) that are not
compatible with the data in legacy systems.
• Data integration can help bridge that gap between valuable legacy
data with popular business intelligence applications.
Challenges to Data Integration

Data From Legacy Systems


The greatest challenge to data integration methods is to integrate the
data stored in legacy systems or mainframes.
These data often have missing markers, such as date and time for
activities, which most modern systems would usually have.
Data From New Systems
• There are a number of new systems today generating different types of data
from a multitude of sources - IoT devices, cloud, sensors, etc.
• Now, this data can also be real-time data or unstructured data, which
provides another challenge.
External Data
• For any organization to flourish, it cannot always depend on its own internal
data. There are a number of external sources that organizations have to take
in in order to stand out from their competition.
• However, most of these external sources of data may not have the same level
of detail or format as internal data, making it very difficult to integrate them.
There are also a number of contracts that may be signed with external
vendors which make it difficult to share the data across the entire
organization.

You might also like