Roles of Data Scientists in Business and Society
• The Data Scientist is responsible for advising the business on the potential of data,
to provide new insights into the business’s mission, and through the use of
advanced statistical analysis, data mining, and data visualization techniques, to
create solutions that enable enhanced business performance.
• The Data Scientist combines data, computational science, and technology with
consumer-oriented business knowledge in the business setting, to drive high-value
insights into the business and drive high-impact through the business levers at the
business’s disposal.
• The Data Scientist plays a strategic role in the development of new approaches to
understand the business’s consumer trends and behaviors as well as approaches to
solve complex business issues.
Roles of Data Scientists in Business and Society
• The Data Scientist also takes initiative to experiment with various technologies
and tools with vision of creating innovative data driven insights for the business
and society.
• The Data Science helps to address real life social issues with the help of data
science.
• Nonprofits and nonprofit government organizations leverage benefits from the
data science. These organizations gathering information to cater society's needs.
• For nonprofits to run at its utmost level, resources such as time, money, and products should
be used wisely. Data science enables these organizations to determine effective and
economical methods to offer those in need without running out of such valuable resources.
Module 2
Data Structure
• In computer science, a data structure is a particular way of organizing
and storing data in a computer such that it can be accessed and
modified efficiently.
• More precisely, a data structure is a collection of data values, the
relationships among them, and the functions or operations that can be
applied to the data.
Structured Data
Structured data is data that adheres to a pre-defined data model and is therefore
straightforward to analyze. Structured data conforms to a tabular format with
relationship between the different rows and columns.
Common examples of structured data are Excel files or SQL databases.
Quantity, barcodes, and weblog statistics.
Unstructured data
It is information that either does not have a predefined data model or is not
organized in a pre-defined manner. Unstructured information is typically text-
heavy, but may contain data such as dates, numbers, and facts as well.
This results in irregularities and ambiguities that make it difficult to understand
using traditional programs as compared to data stored in structured databases.
Common examples of unstructured data include audio, video files or No-SQL databases.
Semi-structured Data
Semi-structured data is a form of structured data that does not conform
with the formal structure of data models associated with relational
databases or other forms of data tables.
But, nonetheless it contain tags or other markers to separate semantic
elements and enforce hierarchies of records and fields within the data.
Therefore, it is also known as self-describing structure.
An example of semi-structured data is delimited files. It contains elements that
can break down the data into separate hierarchies.
Similarly, in digital photographs, the image does not have a pre-defined
structure itself. Still, if it is taken from a smartphone, it would have structured
attributes like geotag, device ID, and Date Time stamp.
What is Database
“A database is a collection of related data which represents some
elements of the real world. It is designed to be built and populated with
data for a specific task. It is also a building block of the data solution.”
What is a Data Warehouse?
“A data warehouse is an information system which stores historical and
commutative data from single or multiple sources. It is designed to
analyze, report, integrate transaction data from different sources.”
Data Warehouse eases the analysis and reporting process of an
organization. It is also a single version of truth for the organization for
decision making and forecasting process.
Key Difference
• Database is a collection of related data that represents some elements
of the real world whereas Data warehouse is an information system
that stores historical and commutative data from single or multiple
sources.
• Database is designed to record data whereas the Data warehouse is
designed to analyze data.
• Database is application-oriented-collection of data whereas Data
Warehouse is the subject-oriented collection of data.
• Database uses Online Transactional Processing (OLTP) whereas Data
warehouse uses Online Analytical Processing (OLAP).
Key Difference
• Database tables and joins are complicated because they are normalized
whereas Data Warehouse tables and joins are easy because they are
denormalized.
• Modeling techniques are used for designing Database whereas data
modeling techniques are used for designing Data Warehouse.
Why use a Database?
• It offers the security of data and its access
• A database offers a variety of techniques to store and retrieve data.
• Database act as an efficient handler to balance the requirement of
multiple applications using the same data
• A DBMS offers integrity constraints to get a high level of protection to
prevent access to prohibited data.
• A database allows you to access concurrent data in such a way that
only a single user can access the same data at a time.
Why use Data Warehouse
• Data warehouse helps business users to access critical data from some
sources all in one place.
• It provides consistent information on various cross-functional activities
• Helps you to integrate many sources of data to reduce stress on the
production system.
• Data warehouse helps you to reduce TAT (total turnaround time) for
analysis and reporting.
• Data warehouse helps users to access critical data from different sources
in a single place so, it saves user's time of retrieving data information
from multiple sources. You can also access data from the cloud easily.
Why use Data Warehouse
• Data warehouse allows you to stores a large amount of historical data
to analyze different periods and trends to make future predictions.
• Enhances the value of operational business applications and customer
relationship management systems
• Separates analytics processing from transactional databases,
improving the performance of both systems
• Stakeholders and users may be overestimating the quality of data in
the source systems. Data warehouse provides more accurate reports.
Characteristics of Database
• Offers security and removes redundancy
• Allow multiple views of the data
• Database system follows the ACID compliance ( Atomicity,
Consistency, Isolation, and Durability).
• Allows insulation between programs and data
• Sharing of data and multiuser transaction processing
• Relational Database support multi-user environment
Characteristics of Data Warehouse
• A data warehouse is subject oriented as it offers information related to
theme instead of companies' ongoing operations.
• The data also needs to be stored in the Data warehouse in common and
unanimously acceptable manner.
• The time horizon for the data warehouse is relatively extensive
compared with other operational systems.
• A data warehouse is non-volatile which means the previous data is not
erased when new information is entered in it.
Applications of Database
Sector Usage
Banking Use in the banking sector for customer information, account-related
activities, payments, deposits, loans, credit cards, etc.
Airlines Use for reservations and schedule information.
Universities To store student information, course registrations, colleges, and
results.
Telecommunication It helps to store call records, monthly bills, balance maintenance,
etc.
Finance Helps you to store information related stock, sales, and purchases
of stocks and bonds.
Sales & Production Use for storing customer, product and sales details.
Manufacturing It is used for the data management of the supply chain and for
tracking production of items, inventories status.
HR Management Detail about employee's salaries, deduction, generation of
paychecks, etc.
Applications of Data Warehousing
Sector Usage
Airline It is used for airline system management operations like crew
assignment, analyzes of route, frequent flyer program discount
schemes for passenger, etc.
Banking It is used in the banking sector to manage the resources available
on the desk effectively.
Healthcare sector Data warehouse used to strategize and predict outcomes, create
patient's treatment reports, etc. Advanced machine learning, big
data enable data warehouse systems can predict ailments.
Insurance sector Data warehouses are widely used to analyze data patterns,
customer trends, and to track market movements quickly.
Retail chain It helps you to track items, identify the buying pattern of the
customer, promotions and also used for determining pricing
policy.
Telecommunication In this sector, data warehouse used for product promotions, sales
decisions and to make distribution decisions.
Relational Vs Non Relational Database
• A relational database is structured, meaning the data is organized in
tables. Many times, the data within these tables have relationships
with one another, or dependencies.
• A non relational database is document-oriented, meaning, all
information gets stored in more of a laundry list order. Within a single
construct, or document, you will have all of your data listed out.
Relational Vs Non Relational Database
SQL Databases (Relational)
SQL is short for Structured Query Language, basically meaning a very
firm way of sorting through data in the form of tables, columns, and rows.
• For example, if you are looking to sort data regarding what the weather is at a
certain time of the day during a certain day, it would be structured as the
following:
Table: Weather
Columns: Days of the Week
Rows: Time of Day
Data Points: Degrees Fahrenheit
In this structure, all queries would be related to this table and the structure
of the table would allow for easy sorting, filtering, computations, etc.
Relational Database
A relational database works by linking information from multiple tables through the use of
“keys.” A key is a unique identifier which can be assigned to a row of data contained within a
table.
This unique identifier, called a “primary key,” can then be included in a record located in
another table when that record has a relationship to the primary record in the main table.
When this unique primary key is added to a record in another table, it is called a “foreign key”
in the associated table.
The connection between the primary and foreign key then creates the “relationship” between
records contained across multiple tables.
Relational Database
The Employees table contains a single row representing an employee with each employee
assigned a unique id (primary key). In this case, the primary key is named Employee Id.
The second table, Sales, contains individual sales records that are then associated with the
employee that made the sale.
Because an employee can make multiple sales, their unique Employee Id (primary key),
can appear multiple times in the Sales table as a foreign key.
Relational Database
• Some popular SQL database systems include:
Oracle
Microsoft SQL Server
PostgreSQL
MySQL
MariaDB
NoSQL Databases (Non-Relational Databases)
• In contrast to a relational database, a NoSQL database is one that is less
structured/confined in format, and thus, allows for more flexibility and
adaptability.
• If you are going to be dealing with a dataset that isn’t clearly defined, meaning
not organized or structured, you likely won’t have the luxury of establishing
defined tables and relationships amongst the dataset.
Non Relational Database
• For example, Facebook Messenger uses a NoSQL database, because the
information that is being gathered isn’t structured enough to be segmented
into tables and define relationships between each other.
• With tons of unstructured information, it needs to be held in a non-relational
database. Think of the information as being stored on one large word
document. Everything is there. As more information gets entered, the
document gets longer. If you want to find and pull data, you have to in
essence ‘control/command + F’ and search for the data itself.
• Some popular NoSQL databases include:
MongoDB
Cassandra
Redis
Apache HBase
Amazon DynamoDB
Relational Vs Non Relational Database
• Final Showdown: Pros and Cons of Relational and Non-Relational Databases
• Now we answer the question you’re really looking for. Which type of database
should you use?
• Well, there are some questions you should ask yourself that are outlined below. If
you answer yes to the relational questions, then use a SQL database. If you answer
yes to the non-relational questions, then use a NoSQL database.
Pros of a Relational Database
• Data is easily structured into categories.
• Your data is consistent in input, meaning, and easy to navigate.
• Relationships can be easily defined between data points.
Pros of a Non-Relational Database
• Data is not confined to a structured group.
• You can perform functions that allow for greater flexibility.
• Your data and analysis can be more dynamic and allow for more variant inputs.
RDBMS
RDBMS stands for Relational Database Management System. RDBMS is the basis for
SQL, and for all modern database systems like MS SQL Server, IBM DB2, Oracle,
MySQL, and Microsoft Access. A Relational database management system (RDBMS) is a
database management system (DBMS) that is based on the relational model as introduced
by E. F. Codd.
Table
•The data in an RDBMS is stored in database objects which are called as tables. This table
is basically a collection of related data entries and it consists of numerous columns and
rows.
RDBMS
Field
•Every table is broken up into smaller entities called fields. The fields in the
CUSTOMERS table consist of ID, NAME, AGE, ADDRESS and SALARY.
Record or a Row
•A record is also called as a row of data is each individual entry that exists in a table.
For example, there are 7 records in the above CUSTOMERS table. Following is a
single row of data or record in the CUSTOMERS table −
• +----+----------+-----+-----------+----------+
• | 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
• +----+----------+-----+-----------+----------+
RDBMS
Column
•A column is a vertical entity in a table that contains all information associated with
a specific field in a table.
•For example, a column in the CUSTOMERS table is ADDRESS, which represents
location description and would be as shown below −
RDBMS
Database Normalization
Database normalization is the process of efficiently organizing data in a database. There are
two reasons of this normalization process −
∙ Eliminating redundant data, for example, storing the same data in more than one table.
∙ Ensuring data dependencies make sense.
•Both these reasons are worthy goals as they reduce the amount of space a database
consumes and ensures that data is logically stored. Normalization consists of a series of
guidelines that help in creating a good database structure.
When comparing relational and non-relational databases, it’s important to first note that
these two very different types of databases are equally useful in their own right—but for
contrasting reasons and use-cases. One type of database is not better than the other type,
and both relational and non-relational databases have their place.
Columnar Database
• A columnar database is a database management system (DBMS) that stores data
in columns instead of rows.
• The goal of a columnar database is to efficiently write and read data to and from
hard disk storage in order to speed up the time it takes to return a query.
• In a columnar database, all the column 1 values are physically together, followed
by all the column 2 values, etc.
• The data is stored in record order, so the 100th entry for column 1 and the 100th
entry for column 2 belong to the same input record.
• This allows individual data elements, such as customer name for instance, to be
accessed in columns as a group, rather than individually row-by-row.
Columnar Database
• Here is an example of a simple database table with 4 columns and 3 rows.
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
• In a row-oriented database management system, the data would be stored like
this: 1,Doe,John,8000; 2,Smith,Jane,4000; 3,Beck,Sam,1000;
• In a column-oriented database management system, the data would be stored like
this: 1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
Columnar Database
• One of the main benefits of a columnar database is that data can be
highly compressed.
• The compression permits columnar operations — like MIN, MAX, SUM,
COUNT and AVG— to be performed very rapidly.
• Another benefit is that because a column-based DBMSs is self-indexing, it uses
less disk space than a relational database management system (RDBMS)
containing the same data.
Data Mining
• The process of extracting information to identify patterns, trends, and useful data
that would allow the business to take the data-driven decision from huge sets of
data is called Data Mining.
• In other words, we can say that Data Mining is the process of investigating hidden
patterns of information to various perspectives for categorization into useful data,
which is collected and assembled in particular areas such as data warehouses,
efficient analysis, data mining algorithm, helping decision making and other data
requirement to eventually cost-cutting and generating revenue.
• Data mining is the act of automatically searching for large stores of information to
find trends and patterns that go beyond simple analysis procedures. Data mining
utilizes complex mathematical algorithms for data segments and evaluates the
probability of future events. Data Mining is also called Knowledge Discovery of
Data (KDD).
Data Mining
• Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful
information.
• Data Mining is similar to Data Science carried out by a person, in a specific
situation, on a particular data set, with an objective.
• This process includes various types of services such as text mining, web mining,
audio and video mining, pictorial data mining, and social media mining. It is done
through software that is simple or highly specific.
• By outsourcing data mining, all the work can be done faster with low operation
costs. Specialized firms can also use new technologies to collect data that is
impossible to locate manually. There are many powerful instruments and
techniques available to mine data and find better insight from it.
Data Mining
Data Mining Techniques
Clustering
• Clustering is a division of information into groups of connected objects.
Describing the data by a few clusters mainly loses certain confine details, but
accomplishes improvement.
• It is the task of grouping a set of objects in such a way that objects in the same
group (called a cluster) are more similar (in some sense) to each other than to
those in other groups (clusters).
• It models data by its clusters. Data modeling puts clustering from a historical
point of view rooted in statistics, mathematics, and numerical analysis.
• From a machine learning point of view, clusters relate to hidden patterns, the
search for clusters is unsupervised learning, and the subsequent framework
represents a data concept.
Clustering
• From a practical point of view, clustering plays an extraordinary role in data
mining applications. For example, scientific data exploration, text mining,
information retrieval,, CRM, Web analysis, computational biology, medical
diagnostics, and much more.
• In other words, we can say that Clustering analysis is a data mining technique to
identify similar data.
• This technique helps to recognize the differences and similarities between the
data.
• Clustering is very similar to the classification, but it involves grouping chunks of
data together based on their similarities.
Clustering
Association
• Association is a data mining technique related to statistics. It indicates that certain
data (or events found in data) are linked to other data or data-driven events. It is
similar to the notion of co-occurrence in machine learning, in which the likelihood
of one data-driven event is indicated by the presence of another.
• The statistical concept of correlation is also similar to the notion of association.
This means that the analysis of data shows that there is a relationship between two
data events: such as the fact that the purchase of hamburgers is frequently
accompanied by that of French fries.
• This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
Association
• Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of
databases. Association rule mining has several applications and is commonly used
to help sales correlations in data or medical data sets.
• The way the algorithm works is that you have various data, For example, a list of
grocery items that you have been buying for the last six months. It calculates a
percentage of items being purchased together.
Association
These are three major measurements technique:
Lift:
This measurement technique measures the accuracy of the confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
Support:
This measurement technique measures how often multiple items are purchased and compared it to the
overall dataset.
(Item A + Item B) / (Entire dataset)
Confidence:
This measurement technique measures how often item B is purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
Decision Tree
• A decision tree is a structure that includes a root node, branches, and leaf nodes.
• Each internal node denotes a test on an attribute, each branch denotes the outcome
of a test, and each leaf node holds a class label.
• The topmost node in the tree is the root node.
• The following decision tree is for the concept buy computer that indicates whether
a customer at a company is likely to buy a computer or not.
• Each internal node represents a test on an attribute. Each leaf node represents a
class.
• The benefits of having a decision tree are as follows −
• It does not require any domain knowledge.
• It is easy to comprehend.
Decision Tree
Decision Tree
The benefits of having a decision tree are as follows −
• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
Decision Tree
The benefits of having a decision tree are as follows −
• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
Analytical Methodology
• In terms of methodology, analytics differs significantly from the traditional
statistical approach of experimental design. Analytics starts with data. Normally
we model the data in a way to explain a response.
• The objectives of this approach is to predict the response behavior or understand
how the input variables relate to a response. Normally in statistical experimental
designs, an experiment is developed and data is retrieved as a result.
• This allows to generate data in a way that can be used by a statistical model,
where certain assumptions hold such as independence, normality, and
randomization.
• Normally once the business problem is defined, a research stage is needed to
design the methodology to be used. However general guidelines are relevant to be
mentioned and apply to almost all problems.
Analytical Methodology
• One of the most important tasks in big data analytics is statistical modeling,
meaning supervised and unsupervised classification or regression problems.
• Once the data is cleaned and preprocessed, available for modeling, care should be
taken in evaluating different models with reasonable loss metrics and then once
the model is implemented, further evaluation and results should be reported.
• A common pitfall in predictive modeling is to just implement the model and never
measure its performance.
Analytical Methodology
• Preparing objectives & identifying data requirements,
• Data Collection,
• Understanding data
• Data preparation – Data Cleansing, Normalization