KEMBAR78
Data Analytics | PDF | Conceptual Model | Databases
0% found this document useful (0 votes)
25 views56 pages

Data Analytics

Data management involves the collection, organization, protection, and storage of data to facilitate business analysis and decision-making. It encompasses various techniques such as data preparation, pipelines, governance, and architecture, which are essential for ensuring data reliability, security, and scalability. Effective data management is crucial for organizations to derive insights, comply with regulations, and enhance productivity.

Uploaded by

Sai Nikhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views56 pages

Data Analytics

Data management involves the collection, organization, protection, and storage of data to facilitate business analysis and decision-making. It encompasses various techniques such as data preparation, pipelines, governance, and architecture, which are essential for ensuring data reliability, security, and scalability. Effective data management is crucial for organizations to derive insights, comply with regulations, and enhance productivity.

Uploaded by

Sai Nikhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

DATA ANALYTICS

Data Management
• Data management is the practice of collecting, organizing,
protecting, and storing an organization’s data so it can be
analyzed for business decisions.
• As organizations create and consume data at
unprecedented rates, data management solutions become
essential for making sense of the vast quantities of data.
• Today’s leading data management software ensures that
reliable, up-to-date data is always used to drive decisions.
• The software helps with everything from data preparation
to cataloging, search, and governance, allowing people to
quickly find the information they need for analysis.
Types of Data Management

• Data management plays several roles in an organization’s data


environment, making essential functions easier and less time-
intensive. These data management techniques include the
following:
• Data preparation is used to clean and transform raw data into the
right shape and format for analysis, including making corrections
and combining data sets.
• Data pipelines enable the automated transfer of data from one
system to another.
• ETLs (Extract, Transform, Load) are built to take the data from one
system, transform it, and load it into the organization’s data
warehouse.
• Data catalogs help manage metadata to create a complete picture
of the data, providing a summary of its changes, locations, and
quality while also making the data easy to find.
• Data warehouses are places to consolidate various data
sources, contend with the many data types businesses
store, and provide a clear route for data analysis.
• Data governance defines standards, processes, and policies
to maintain data security and integrity.
• Data architecture provides a formal approach for creating
and managing data flow.
• Data security protects data from unauthorized access and
corruption.
• Data modeling documents the flow of data through an
application or organization.

Why data management is important

• Data management is a crucial first step to employing effective data analysis at


scale, which leads to important insights that add value to your customers and
improve your bottom line. With effective data management, people across an
organization can find and access trusted data for their queries. Some benefits of an
effective data management solution include:
• Visibility
• Data management can increase the visibility of your organization’s data assets,
making it easier for people to quickly and confidently find the right data for their
analysis. Data visibility allows your company to be more organized and productive,
allowing employees to find the data they need to better do their jobs.
• Reliability
• Data management helps minimize potential errors by establishing processes and
policies for usage and building trust in the data being used to make decisions
across your organization. With reliable, up-to-date data, companies can respond
more efficiently to market changes and customer needs.
• Security
• Data management protects your organization and its employees from data
losses, thefts, and breaches with authentication and encryption tools.
Strong data security ensures that vital company information is backed up
and retrievable should the primary source become unavailable.
Additionally, security becomes more and more important if your data
contains any personally identifiable information that needs to be carefully
managed to comply with consumer protection laws.
• Scalability
• Data management allows organizations to effectively scale data and usage
occasions with repeatable processes to keep data and metadata up to
date. When processes are easy to repeat, your organization can avoid the
unnecessary costs of duplication, such as employees conducting the same
research over and over again or re-running costly queries unnecessarily.
• Data management continues to evolve to address
challenges:
• Because data management plays a crucial role in
today’s digital economy, it’s important that systems
continue to evolve to meet your organization’s data
needs. Traditional data management processes make it
difficult to scale capabilities without compromising
governance or security. Modern data management
software must address several challenges to ensure
trusted data can be found.
• Challenge 1: Increased data volumes
• Challenge 2: New roles for analytics
• As your organization increasingly relies on data-driven decision-
making, more of your people are asked to access and analyze data.
When analytics falls outside a person’s skill set, understanding
naming conventions, complex data structures, and databases can be
a challenge. If it takes too much time or effort to convert the data,
analysis won’t happen and the potential value of that data is
diminished or lost.
• Challenge 3: Compliance requirements
• Constantly changing compliance requirements make it a challenge
to ensure people are using the right data. An organization needs its
people to quickly understand what data they should or should not
be using—including how and what personally identifiable
information (PII) is ingested, tracked, and monitored for compliance
and privacy regulations.
data management best practices

• 1. Clearly identify your business goals


• 2. Focus on the quality of data
• 3. Allow the right people to access the data
• 4. Prioritize data security
• https://www.tableau.com/learn/articles/what
-is-data-management
• https://www.geeksforgeeks.org/data-
architecture-design-and-data-management/
Design Data Architecture and manage
the data for analysis
• Data architecture design is set of standards which
are composed of certain policies, rules, models
and standards which manages, what type of data
is collected, from where it is collected, the
arrangement of collected data, storing that data,
utilizing and securing the data into the systems
and data warehouses for further analysis.
• Data is one of the essential pillars of enterprise
architecture through which it succeeds in the
execution of business strategy.
• Data architecture design is important for creating a
vision of interactions occurring between data systems,
like for example if data architect wants to implement
data integration, so it will need interaction between
two systems and by using data architecture the
visionary model of data interaction during the process
can be achieved.
• Data architecture also describes the type of data
structures applied to manage data and it provides an
easy way for data preprocessing. The data architecture
is formed by dividing into three essential models and
then are combined :
• Conceptual model –
It is a business model which uses Entity Relationship (ER) model for
relation between entities and their attributes.
• Logical model –
It is a model where problems are represented in the form of logic
such as rows and column of data, classes, xml tags and other DBMS
techniques.
• Physical model –
Physical models holds the database design like which type of
database technology will be suitable for architecture.
• A data architect is responsible for all the design, creation, manage,
deployment of data architecture and defines how data is to be
stored and retrieved, other decisions are made by internal bodies.
Factors that influence Data
Architecture :
• Business requirements –
These include factors such as the expansion of business, the
performance of the system access, data management, transaction
management, making use of raw data by converting them into
image files and records, and then storing in data warehouses. Data
warehouses are the main aspects of storing transactions in
business.
• Business policies –
The policies are rules that are useful for describing the way of
processing data. These policies are made by internal organizational
bodies and other government agencies.
• Technology in use –
This includes using the example of previously completed data
architecture design and also using existing licensed software
purchases, database technology.
• Business economics –
The economical factors such as business growth
and loss, interest rates, loans, condition of the
market, and the overall cost will also have an
effect on design architecture.
• Data processing needs –
These include factors such as mining of the data,
large continuous transactions, database
management, and other data preprocessing
needs.
What are the characteristics and
components of a data architecture?
• common characteristics of well-designed data architectures include
the following:
• a business-driven focus that's aligned with organizational strategies
and data requirements;
• flexibility and scalability to enable various applications and meet
new business needs for data; and
• strong security protections to prevent unauthorized data access and
improper use of data.
• data architecture components don't include platforms, tools and
other technologies. Instead, a data architecture is a conceptual
infrastructure that's described by a set of diagrams and documents.
Data management teams then use them to guide technology
deployments and how data is managed.
• examples of those components, or artifacts, are as follows:
• data models, data definitions and common vocabularies for data
elements;
• data flow diagrams that illustrate how data flows through systems
and applications;
• documents that map data usage to business processes, such as a
CRUD matrix -- short for create, read, update and delete;
• other documents that describe business goals, concepts and
functions to help align data management initiatives with them;
• policies and standards that govern how data is collected, integrated,
transformed and stored; and
• a high-level architectural blueprint, with different layers for
processes like data ingestion, data integration and data storage.
Data Management
• Data management is the process of managing tasks like extracting data,
storing data, transferring data, processing data, and then securing data
with low-cost consumption.
• Main motive of data management is to manage and safeguard the
people’s and organization data in an optimal way so that they can easily
create, access, delete, and update the data.
• Because data management is an essential process in each and every
enterprise growth, without which the policies and decisions can’t be made
for business advancement. The better the data management the better
productivity in business.
• Large volumes of data like big data are harder to manage traditionally so
there must be the utilization of optimal technologies and tools for data
management such as Hadoop, Scala, Tableau, AWS, etc. Which can further
used for big data analysis in achieving improvements in patterns.
• Data management can be achieved by training the employees necessarily
and maintenance by DBA, data analyst, and data architects.
Data architecture vs. data modeling

• Data modeling focuses on the details of specific data assets.


It creates a visual representation of data entities, their
attributes and how different entities relate to each other.
That helps in scoping the data requirements for
applications and systems and then designing database
structures for the data, a process that's done through a
progression of conceptual, logical and physical data models.
• Data architecture takes a more global view of an
organization's data to create a framework for data
management and usage. But, as consultant Loshin wrote in
his article comparing the two, data modeling and data
architecture complement each other. Data models are a
crucial element in data architectures, and an established
data architecture simplifies data modeling,
Data Architecture Framework

• There are multiple enterprise architecture frameworks that are


used as the foundation for building the data architecture
framework of an organization.
• DAMA-DMBOK 2
• This refers to DAMA International's Data Management Body of
Knowledge – a framework designed specifically for data
management. It includes standard definitions of data management
terminology, functions, deliverables, roles, and also presents
guidelines on data management principles.
• Zachman Framework for Enterprise Architecture
• John Zachman created this enterprise ontology at IBM during the
1980s. The 'data' column of this framework includes multiple layers
like key architectural standards for the business, a semantic model
or conceptual/enterprise data model, an enterprise or logical data
model, a physical data model, and actual databases.
• The Open Group Architecture Framework
(TOGAF)
• TOGAF is the most used enterprise
architecture methodology that offers a
framework for designing, planning,
implementing, and managing data
architecture best practices. It helps define
business goals and align them with
architecture objectives.
What Does a Data architect do?

• As the mastermind behind data architecture, the data


architect creates blueprints for data flow and data
management. They assess the organization's potential data
sources, and devise plans to centralize, integrate, protect
and sustain them. Thus, employees can access critical
information wherever they want, whenever they want.
• The role of a data architect requires:
• Collaboration with IT teams to devise data strategy
• Build data inventory needed to implement architecture
• Research data acquisition opportunities
• Identify and evaluate data management technologies in use
• Develop data models, etc.
understand various
sources of Data like Sensors/Signals/GPS
• Data collection is the process of acquiring,
collecting, extracting, and storing the voluminous
amount of data which may be in the structured or
unstructured form like text, video, audio, XML
files, records, or other image files used in later
stages of data analysis.
• The actual data is then further divided mainly
into two types known as:
• Primary data
• Secondary data
• 1.Primary data:
• The data which is Raw, original, and extracted
directly from the official sources is known as
primary data. This type of data is collected
directly by performing techniques such as
questionnaires, interviews, and surveys. The data
collected must be according to the demand and
requirements of the target audience on which
analysis is performed otherwise it would be a
burden in the data processing.
• Few methods of collecting primary data:
• Features of Primary Data
• Primary Data has the following characteristics –
• Such data is being collected for the first time.
• Primary Data is original and thereby more reliable than other types
of data
• This kind of data has not been used for any statistical analysis
before.

Features of Secondary Data
• Secondary Data consists of the following features –
• Secondary data is considered as ‘second-hand information’.
• Secondary data is not original.
• This kind of data has gone through statistical analysis at least once.
• Secondary data is not reliable.
• Sources of data are of two types; they are as
follows –
• Statistical Data
• This type of data source refers to the collection of
data that are used for official purposes, such as
population census, official surveys, etc.
• Non-Statistical Data
• This type of data source refers to the collection of
data that are used for various administrative
purposes, mainly in the private sector.
• 1. Interview method:
• The data collected during this process is through
interviewing the target audience by a person called
interviewer and the person who answers the interview
is known as the interviewee. Some basic business or
product related questions are asked and noted down in
the form of notes, audio, or video and this data is
stored for processing. These can be both structured
and unstructured like personal interviews or formal
interviews through telephone, face to face, email, etc.
• 2. Survey method:
• The survey method is the process of research
where a list of relevant questions are asked and
answers are noted down in the form of text,
audio, or video. The survey method can be
obtained in both online and offline mode like
through website forms and email. Then that
survey answers are stored for analyzing data.
Examples are online surveys or surveys through
social media polls.
• 3. Observation method:
• The observation method is a method of data collection
in which the researcher keenly observes the behavior
and practices of the target audience using some data
collecting tool and stores the observed data in the form
of text, audio, video, or any raw formats. In this
method, the data is collected directly by posting a few
questions on the participants. For example, observing a
group of customers and their behavior towards the
products. The data obtained will be sent for processing.
• 4. Experimental method:
• The experimental method is the process of collecting data through
performing experiments, research, and investigation. The most
frequently used experiment methods are CRD, RBD, LSD, FD.
• CRD- Completely Randomized design is a simple experimental
design used in data analytics which is based on randomization and
replication. It is mostly used for comparing the experiments.
• RBD- Randomized Block Design is an experimental design in which
the experiment is divided into small units called blocks. Random
experiments are performed on each of the blocks and results are
drawn using a technique known as analysis of variance (ANOVA).
RBD was originated from the agriculture sector.
• LSD – Latin Square Design is an experimental design
that is similar to CRD and RBD blocks but contains rows
and columns. It is an arrangement of NxN squares with
an equal amount of rows and columns which contain
letters that occurs only once in a row. Hence the
differences can be easily found with fewer errors in the
experiment. Sudoku puzzle is an example of a Latin
square design.
• FD- Factorial design is an experimental design where
each experiment has two factors each with possible
values and on performing trail other combinational
factors are derived.
• 2. Secondary data:
• Secondary data is the data which has already been
collected and reused again for some valid purpose. This
type of data is previously recorded from primary data
and it has two types of sources named internal source
and external source.
• Internal source:
• These types of data can easily be found within the
organization such as market record, a sales record,
transactions, customer data, accounting resources, etc.
The cost and time consumption is less in obtaining
internal sources.
• External source:
• The data which can’t be found at internal
organizations and can be gained through external
third party resources is external source data. The
cost and time consumption is more because this
contains a huge amount of data. Examples of
external sources are Government publications,
news publications, Registrar General of India,
planning commission, international labor bureau,
syndicate services, and other non-governmental
publications.
• Other sources:
• Sensors data: With the advancement of IoT devices, the sensors of
these devices collect data which can be used for sensor data
analytics to track the performance and usage of products.
• Satellites data: Satellites collect a lot of images and data in
terabytes on daily basis through surveillance cameras which can be
used to collect useful information.
• Web traffic: Due to fast and cheap internet facilities many formats
of data which is uploaded by users on different platforms can be
predicted and collected with their permission for data analysis. The
search engines also provide their data through keywords and
queries searched mostly.
• https://www.vedantu.com/commerce/sources-of-data
• https://www.futurelearn.com/info/courses/data-analytics-python-
statistics-and-analytics-fundamentals/0/steps/186574
Data Quality(noise, outliers, missing
values, duplicate data)
• Noise refers to measurement error in data
values • Could be random error or systematic
error
• Outliers are data objects with characteristics
that are considerably different than most of
the other data objects in the data set •Could
indicate “interesting” cases, or could indicate
errors in the data
Data missing
• Reasons for missing values
• •Information is not collected (e.g., people decline
to give their age) •Attributes may not be
applicable to all cases (e.g., annual income is not
applicable to children)
• Ways to handle missing values
Eliminate entities with missing values •Estimate
attributes with missing values •Ignore the missing
values during analysis •Replace with all possible
values (weighted by their probabilities) •Impute
missing values
Duplicate data
• Data set may include data entities that are
duplicates, or almost duplicates of one
another
• Major issue when merging data from
heterogeneous sources •Example: same
person with multiple email addresses
• •Data cleaning
• •Finding and dealing with duplicate entities
•Finding and correcting measurement error
What Is Data Processing: Cycle, Types,
Methods, Steps and Examples
• Data in its raw form is not useful to any organization. Data
processing is the method of collecting raw data and
translating it into usable information. It is usually
performed in a step-by-step process by a team of data
scientists and data engineers in an organization. The raw
data is collected, filtered, sorted, processed, analyzed,
stored, and then presented in a readable format.
• Data processing is essential for organizations to create
better business strategies and increase their competitive
edge. By converting the data into readable formats like
graphs, charts, and documents, employees throughout the
organization can understand and use the data.
• The data processing cycle consists of a series of
steps where raw data (input) is fed into a system
to produce actionable insights (output). Each step
is taken in a specific order, but the entire process
is repeated in a cyclic manner. The first data
processing cycle's output can be stored and fed
as the input for the next cycle, as the illustration
below shows us.

• Step 1: Collection
• The collection of raw data is the first step of the
data processing cycle. The type of raw data
collected has a huge impact on the output
produced. Hence, raw data should be gathered
from defined and accurate sources so that the
subsequent findings are valid and usable. Raw
data can include monetary figures, website
cookies, profit/loss statements of a company,
user behavior, etc.
• Step 2: Preparation
• Data preparation or data cleaning is the process of sorting
and filtering the raw data to remove unnecessary and
inaccurate data. Raw data is checked for errors, duplication,
miscalculations or missing data, and transformed into a
suitable form for further analysis and processing. This is
done to ensure that only the highest quality data is fed into
the processing unit.
• The purpose of this step to remove bad data (redundant,
incomplete, or incorrect data) so as to begin assembling
high-quality information so that it can be used in the best
possible way for business intelligence.
• Step 3: Input
• In this step, the raw data is converted into machine
readable form and fed into the processing unit. This can be
in the form of data entry through a keyboard, scanner or
any other input source.
• Step 4: Data Processing
• In this step, the raw data is subjected to various data
processing methods using machine learning and artificial
intelligence algorithms to generate a desirable output. This
step may vary slightly from process to process depending
on the source of data being processed (data lakes, online
databases, connected devices, etc.) and the intended use of
the output
• Step 5: Output
• The data is finally transmitted and displayed to the user
in a readable form like graphs, tables, vector files,
audio, video, documents, etc. This output can be
stored and further processed in the next data
processing cycle.
• Step 6: Storage
• The last step of the data processing cycle is storage,
where data and metadata are stored for further use.
This allows for quick access and retrieval of information
whenever needed, and also allows it to be used as
input in the next data processing cycle directly.
• Types of Data Processing:There are different
types of data processing based on the source
of data and the steps taken by the processing
unit to generate an output. There is no one-
size-fits-all method that can be used for
processing raw data.
• Examples of Data Processing
• Data processing occurs in our daily lives whether we may
be aware of it or not. Here are some real-life examples of
data processing:
• A stock trading software that converts millions of stock data
into a simple graph
• An e-commerce company uses the search history of
customers to recommend similar products
• A digital marketing company uses demographic data of
people to strategize location-specific campaigns
• A self-driving car uses real-time data from sensors to detect
if there are pedestrians and other cars on the road
• Electronic Data Processing:
• Data is processed with modern technologies
using data processing software and programs.
A set of instructions is given to the software to
process the data and yield output.
• This method is the most expensive but
provides the fastest processing speeds with
the highest reliability and accuracy of output.
• The Future of Data Processing
• The future of data processing can best be summed up in one short
phrase: cloud computing.
• While the six steps of data processing remain immutable, cloud
technology has provided spectacular advances in data processing
technology that has given data analysts and scientists the fastest,
most advanced, cost-effective, and most efficient data processing
methods today.
• The cloud lets companies blend their platforms into one centralized
system that’s easy to work with and adapt. Cloud technology allows
seamless integration of new upgrades and updates to legacy
systems while offering organizations immense scalability.
• Cloud platforms are also affordable and serve as a great equalizer
between large organizations and smaller companies.
• Moving From Data Processing to Analytics
• If we had to pick one thing that stands out at the most
significant game-changer in today’s business world, it’s big
data. Although it involves handling a staggering amount of
information, the rewards are undeniable. That’s why
companies that want to stay competitive in the 21st-
century marketplace need an effective data processing
strategy.
• Analytics, the process of finding, interpreting, and
communicating meaningful patterns in data, is the next
logical step after data processing. Whereas data processing
changes data from one form to another, analytics takes
those newly processed forms and makes sense of them.

You might also like