KEMBAR78
TLMweek 1 Intro Ds | PDF | Machine Learning | Big Data
0% found this document useful (0 votes)
29 views11 pages

TLMweek 1 Intro Ds

Uploaded by

oneaboveall0016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views11 pages

TLMweek 1 Intro Ds

Uploaded by

oneaboveall0016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

TLM

INTRODUCTION TO DATA SCIENCE

Data science can be seen as the interdisciplinary field that deals with the creation of insights or data
products from a given set of data files (usually in unstructured form), using analytics methodologies.
The data it handles is often what is commonly known as “big data,” although it is often applied to
conventional data streams, such as the ones usually encountered in the databases, the spreadsheets,
and the text documents of a business. We’ll take a closer look into big data in the next section.

Data science is not a guaranteed tool for finding the answers to the questions we have about the
data, though it does a good job at shedding some light on what we are investigating. For example, we
may be interested in figuring out the answer to “How can we predict customer attrition based on the
demographics data we have on them?” This is something that may not be possible with that data
alone.

However, investigating the data may help us come up with other questions, like “Can demographics
data supplement a prediction system of attrition, based on the orders they have made?” Also, it is as
good as the data we have, so it doesn’t make sense to expect breathtaking insights if the data we
have is of low quality.

The term “data science” combines two key elements: “data” and “science.”
1. Data: It refers to the raw information that is collected, stored, and processed. In today’s digital
age, enormous amounts of data are generated from various sources such as sensors, social
media, transactions, and more. This data can come in structured formats (e.g., databases) or
unstructured formats (e.g., text, images, videos).

2. Science: It refers to the systematic study and investigation of phenomena using scientific
methods and principles. Science involves forming hypotheses, conducting experiments,
analyzing data, and drawing conclusions based on evidence.
What is data science used for?

Data science is used to study data in four main ways:

1. Descriptive analysis

Descriptive analysis examines data to gain insights into what happened or what is happening in
the data environment. It is characterized by data visualizations such as pie charts, bar charts, line
graphs, tables, or generated narratives. For example, a flight booking service may record data like the
number of tickets booked each day. Descriptive analysis will reveal booking spikes, booking slumps,
and high-performing months for this service.

2. Diagnostic analysis

Diagnostic analysis is a deep-dive or detailed data examination to understand why something


happened. It is characterized by techniques such as drill-down, data discovery, data mining, and
correlations. Multiple data operations and transformations may be performed on a given data set to
discover unique patterns in each of these techniques. For example, the flight service might drill down
on a particularly high-performing month to better understand the booking spike. This may lead to the
discovery that many customers visit a particular city to attend a monthly sporting event.

3. Predictive analysis

Predictive analysis uses historical data to make accurate forecasts about data patterns that may occur
in the future. It is characterized by techniques such as machine learning, forecasting, pattern
matching, and predictive modeling. In each of these techniques, computers are trained to reverse
engineer causality connections in the data. For example, the flight service team might use data
science to predict flight booking patterns for the coming year at the start of each year. The computer
program or algorithm may look at past data and predict booking spikes for certain destinations in
May. Having anticipated their customer’s future travel requirements, the company could start
targeted advertising for those cities from February.

4. Prescriptive analysis

Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely to
happen but also suggests an optimum response to that outcome. It can analyze the potential
implications of different choices and recommend the best course of action. It uses graph analysis,
simulation, complex event processing, neural networks, and recommendation engines from machine
learning.

Back to the flight booking example, prescriptive analysis could look at historical marketing campaigns
to maximize the advantage of the upcoming booking spike. A data scientist could project booking
outcomes for different levels of marketing spend on various marketing channels. These data forecasts
would give the flight booking company greater confidence in their marketing decisions.

Applications of data Science:


Healthcare

Finance

Marketing

Retail

Transportation

Education

Entertainment

Manufacturing
Energy

Government
DATA SCIENCE VS. BUSINESS INTELLIGENCE

Data Science: Data science is basically a field in which information and knowledge are extracted
from the data by using various scientific methods, algorithms, and processes. It can thus be
defined as a combination of various mathematical tools, algorithms, statistics, and machine
learning techniques which are thus used to find the hidden patterns and insights from the data
which help in the decision-making process. Data science deals with both structured as well as
unstructured data. It is related to both data mining and big data. Data science involves studying
the historic trends and thus using its conclusions to redefine present trends and also predict
future trends. Technologies.

Business Intelligence: Business intelligence (BI) is a set of technologies, applications, and


processes that are used by enterprises for business data analysis. It is used for the conversion of
raw data into meaningful information which is thus used for business decision-making and
profitable actions. It deals with the analysis of structured and sometimes unstructured data
which paves the way for new and profitable business opportunities. It supports decision-making
based on facts rather than assumption-based decision-making. Thus it has a direct impact on the
business decisions of an enterprise. Business intelligence tools enhance the chances of an
enterprise to enter a new market as well as help in studying the impact of marketing efforts.

S.
No. Factor Data Science Business Intelligence

It is a field that uses mathematics, It is basically a set of technologies,


statistics and various other tools applications and processes that are
1. Concept
to discover the hidden patterns in used by the enterprises for business
the data. data analysis.

2. Focus It focuses on the future. It focuses on the past and present.

It deals with both structured as It mainly deals only with structured


3. Data
well as unstructured data. data.

Data science is much more flexible It is less flexible as in case of business


4. Flexibility as data sources can be added as intelligence data sources need to be
per requirement. pre-planned.

It makes use of the scientific


5. Method It makes use of the analytic method.
method.

It has a higher complexity in


It is much simpler when compared to
6. Complexity comparison to business
data science.
intelligence.

7. Expertise It’s expertise is data scientist. It’s expertise is the business user.

It deals with the questions of what It deals with the question of what
8. Questions
will happen and what if. happened.

The data to be used is


9. Storage Data warehouse is utilized to hold data.
disseminated in real-time clusters.
S.
No. Factor Data Science Business Intelligence

The ELT (Extract-Load-Transform) The ETL (Extract-Transform-Load)


Integration process is generally used for the process is generally used for the
10.
of data integration of data for data science integration of data for business
applications. intelligence applications.

It’s tools are InsightSquared Sales


It’s tools are SAS, BigML,
11. Tools Analytics, Klipfolio, ThoughtSpot, Cyfe,
MATLAB, Excel, etc.
TIBCO Spotfire, etc.

Companies can harness their


Business Intelligence helps in
potential by anticipating the future
performing root cause analysis on a
12. Usage scenario using data science in
failure or to understand the current
order to reduce risk and increase
status.
income.

Business Intelligence has lesser business


Greater business value is achieved
value as the extraction process of
Business with data science in comparison to
13. business value carries out statically by
Value business intelligence as it
plotting charts and KPIs (Key
anticipates future events.
Performance Indicator).

The technologies such as Hadoop


are available and others are The sufficient tools and technologies
Handling
14. evolving for handling are not available for handling large data
data sets
understanding Its Its arge data sets.
sets.

BUSINESS INTELLIGENCE (BI) VS. STATISTICS


As for business intelligence, although it too deals with business data (almost exclusively), it does so
through rudimentary data analysis methods (mainly statistics), data visualization, and other
techniques, such as reports and presentations, with a focus on business applications. Also, it handles
mainly conventional sized data, almost always structured, with little to no need for in-depth data
analytics. Moreover, business intelligence is primarily concerned with getting useful information from
the data and doesn’t involve the creation of data products (unless you count fancy plots as data
products). Business intelligence is not a kind of data science, nor is it a scientific field. Business
intelligence is essential in many organizations, but if you are after hard-to-find insights or have
challenging data streams in your company’s servers, then business intelligence is not what you are
after. Nevertheless, business intelligence is not completely unrelated to data science either. Given
some training and a lot of practice, a business intelligence analyst can evolve into a data scientist

Statistics is a field that is similar to data science and business intelligence, but it has its own domain.
Namely, it involves doing basic manipulations on a set of data (usually tidy and easy to work with) and
applying a set of tests and models to that data. It’s like a conventional vehicle that you drive on city
roads. It does a decent job, but you wouldn’t want to take that vehicle to the country roads or off-
road. For this kind of terrain you’ll need something more robust and better-equipped for messy data:
data science. If you have data that comes straight from a database, it’s fairly clean, and all you want to
do is create a simple regression model or check to see if February sales are significantly different from
January sales, analyzing statistics will work. That’s why statisticians remain in business, even if most
of the methods they use are not as effective as the techniques a data scientist employs. Scientists
make use of statistics, though it is not formally a scientific field. This is an important point. In fact,
even mathematicians look down on the field of statistics, for the simple reason that it fails to create
robust theories that can be generalized to other aspects of Mathematics. So, even though statistical
techniques are employed in various areas, they are inherently inferior to most principles of
Mathematics and of Science. Also, statistics is not a fool-proof framework when it comes to drawing
inferences about the data. Despite the confidence metrics it provides, its results are only as good as
the assumptions it makes about the distribution of each variable, and how well these assumptions
hold. This is why many scientists also employ simulation methods to ensure that the conclusion their
statistical models come up with are indeed viable and robust enough to be used in the real world.

Factor Business Intelligence (BI) Statistics


Convert raw data into meaningful
Summarize, analyze, and infer properties of
Purpose information for decision-making and
a population based on sample data.
identifying business opportunities.
Understanding patterns, testing
Past and present data to support
Focus hypotheses, and making predictions based
decision-making and strategic planning.
on sample data.
Primarily structured data, but can also Structured data, but also deals with semi-
Data Type handle semi-structured and unstructured structured and unstructured data in some
data. advanced applications.
Uses analytic methods and data Uses mathematical theories, probabilistic
Methodology visualization techniques to identify trends models, and statistical tests to draw
and insights. conclusions.
Data warehousing, dashboards, reporting Statistical software like SPSS, R, SAS, and
Tools
tools like Power BI, Tableau, and SQL. Python libraries like NumPy and Pandas.
Visual reports, dashboards, KPIs, and Statistical summaries, p-values, confidence
Output
actionable business insights. intervals, and regression models.
Business analysts, BI developers, data
Expertise Statisticians, data scientists, researchers.
engineers.
ETL (Extract-Transform-Load) process, Data sampling, hypothesis testing, and
Data Handling
real-time data processing. inferential statistics.
Less complex, focuses on user-friendly More complex, involving in-depth
Complexity
tools and interfaces for business users. mathematical and statistical analysis.
Time Retrospective (what happened) and Predictive (what will happen) and
Orientation descriptive (what is happening). inferential (what is likely to be true).
Less flexible, relies on predefined data More flexible, adaptable to various types
Flexibility
sources and structures. of data and methodologies.
Direct impact on business decisions, Provides insights for making informed
Business
strategic planning, and performance decisions and understanding underlying
Value
measurement. data patterns.

BIG DATA

Big data refers to extremely large datasets that are complex, grow rapidly, and require advanced
techniques and technologies for storage, analysis, and processing. Here’s an overview of big data, its
characteristics, and its applications:

Characteristics of Big Data

1. Volume: The sheer amount of data generated every second from various sources, such as social
media, sensors, transactions, and more.

2. Velocity: The speed at which new data is generated and the pace at which it must be processed
to be useful.

3. Variety: The different types of data, including structured, semi-structured, and unstructured data
(e.g., text, images, videos, sensor data).
4. Veracity: The quality and accuracy of the data, which can vary and affect the reliability of
analysis.

5. Value: The potential insights and benefits that can be derived from analyzing the data.

6. Variability: Variability often applies to sets of big data, which might have multiple meanings or be
formatted differently in separate data sources.
Technologies and Tools for Big Data

1. Storage Solutions: Distributed file systems like Hadoop Distributed File System (HDFS) and
cloud storage solutions such as Amazon S3.

2. Processing Frameworks:

Hadoop: An open-source framework for distributed storage and processing.

Spark: A fast and general-purpose cluster computing system for big data processing.

3. Data Management: NoSQL databases (e.g., MongoDB, Cassandra) for handling unstructured
data.

4. Data Integration: Tools like Apache Kafka and Apache NiFi for data ingestion and streaming.

5. Analytics and Machine Learning: Tools and frameworks like Apache Mahout, TensorFlow, and
H2O.ai for big data analytics and machine learning.

6. Visualization: Tools like Tableau, Power BI, and D3.js for visualizing large datasets.

Applications of Big Data

1. Healthcare: Predictive analytics for patient care, personalized medicine, and epidemic outbreak
prediction.

2. Finance: Fraud detection, algorithmic trading, and risk management.

3. Retail: Customer behavior analysis, inventory management, and personalized marketing.

4. Telecommunications: Network optimization, predictive maintenance, and customer churn


analysis.

5. Government: Smart city initiatives, public safety, and efficient resource management.

6. Entertainment: Content recommendation systems, audience analysis, and sentiment analysis on


social media.

Big data benefits


Organizations that use and manage large data volumes correctly can reap many benefits, such as the
following:

Enhanced decision-making. An organization can glean important insights, risks, patterns or


trends from big data. Large data sets are meant to be comprehensive and encompass as much
information as the organization needs to make better decisions. Big data insights let business
leaders quickly make data-driven decisions that impact their organizations.
Better customer and market insights. Big data that covers market trends and consumer habits
gives an organization the important insights it needs to meet the demands of its intended
audiences. Product development decisions, in particular, benefit from this type of insight.

Cost savings. Big data can be used to pinpoint ways businesses can enhance operational
efficiency. For example, analysis of big data on a company's energy use can help it be more
efficient.

Positive social impact. Big data can be used to identify solvable problems, such as improving
healthcare or tackling poverty in a certain area.

Big data challenges


There are common challenges for data experts when dealing with big data. They include the
following:

Architecture design. Designing a big data architecture focused on an organization's processing


capacity is a common challenge for users. Big data systems must be tailored to an organization's
particular needs. These types of projects are often do-it-yourself undertakings that require IT
and data management teams to piece together a customized set of technologies and tools.

Skill requirements. Deploying and managing big data systems also requires new skills compared
to the ones that database administrators and developers focused on relational software typically
possess.

Costs. Using a managed cloud service can help keep costs under control. However, IT managers
still must keep a close eye on cloud computing use to make sure costs don't get out of hand.

Migration. Migrating on-premises data sets and processing workloads to the cloud can be a
complex process.

Accessibility. Among the main challenges in managing big data systems is making the data
accessible to data scientists and analysts, especially in distributed environments that include a
mix of different platforms and data stores. To help analysts find relevant data, data management
and analytics teams are increasingly building data catalogs that incorporate metadata
management and data lineage functions.

Integration. The process of integrating sets of big data is also complicated, particularly when data
variety and velocity are factors.

Introduction to Machine Learning


What is Machine Learning?
Machine learning is a subset of artificial intelligence (AI) that involves the development of algorithms
and statistical models that enable computers to perform tasks without explicit instructions. Instead of
being programmed to follow specific instructions, machine learning algorithms identify patterns in
data and make predictions or decisions based on that data.

Types of Machine Learning


1. Supervised Learning: In supervised learning, the model is trained on a labeled dataset, which
means that each training example is paired with an output label. The goal is for the model to
learn to predict the output from the input data. Examples include regression and classification
tasks.
2. Unsupervised Learning: In unsupervised learning, the model is trained on an unlabeled dataset,
and it tries to learn the underlying structure of the data. Examples include clustering and
association tasks.

3. Semi-Supervised Learning: This approach combines a small amount of labeled data with a large
amount of unlabeled data during training. It falls between supervised and unsupervised learning.

4. Reinforcement Learning: In reinforcement learning, the model learns by interacting with an


environment and receiving rewards or penalties based on its actions. The goal is to learn a
strategy that maximizes cumulative rewards.
Applications of Machine Learning

Machine learning is used in various domains to solve complex problems and automate tasks. Some
common applications include:

1. Natural Language Processing (NLP): Used in language translation, sentiment analysis, and
chatbots.

2. Computer Vision: Applied in facial recognition, image classification, and autonomous vehicles.

3. Healthcare: Used for disease prediction, personalized treatment plans, and medical imaging
analysis.

4. Finance: Used for fraud detection, algorithmic trading, and credit scoring.

5. Marketing: Applied in customer segmentation, recommendation systems, and targeted


advertising.

Life Cycle of Machine Learning


Gathering Data

Data preparation

Data Wrangling

Analyse Data

Train the model

Test the model

Deployment
Types of Machine Learning Algorithms

Machine learning algorithms can be broadly classified into several categories based on their
learning styles and the nature of tasks they are designed to solve. Here are the primary types of
machine learning algorithms:

1. Supervised Learning Algorithms

Supervised learning algorithms are trained on labeled data. This means that each training example
is paired with an output label. The algorithm learns to predict the output from the input data.

Linear Regression: Used for predicting continuous values. It models the relationship between
a dependent variable and one or more independent variables using a linear equation.

Logistic Regression: Used for binary classification problems. It predicts the probability of a
binary outcome using a logistic function.

Support Vector Machines (SVM): Used for both classification and regression tasks. It finds the
optimal hyperplane that separates data points of different classes with the maximum margin.

Decision Trees: Used for classification and regression tasks. It splits the data into subsets
based on the value of input features, forming a tree-like structure.
Random Forests: An ensemble learning method that combines multiple decision trees to
improve predictive performance and reduce overfitting.

k-Nearest Neighbors (k-NN): A simple, instance-based learning algorithm that classifies a data
point based on the majority class among its k nearest neighbors.

Naive Bayes: Based on Bayes' theorem, it assumes independence between features and is
used for classification tasks.
2. Unsupervised Learning Algorithms

Unsupervised learning algorithms are trained on unlabeled data. They try to learn the underlying
structure of the data without any explicit instructions on what to predict.

K-Means Clustering: Partitions data into k clusters based on feature similarity. Each data
point is assigned to the nearest cluster center.

Hierarchical Clustering: Builds a hierarchy of clusters either by merging smaller clusters into
larger ones (agglomerative) or by splitting larger clusters into smaller ones (divisive).

Principal Component Analysis (PCA): Reduces the dimensionality of data by transforming it


into a set of linearly uncorrelated components called principal components.

t-Distributed Stochastic Neighbor Embedding (t-SNE): Used for dimensionality reduction and
visualization of high-dimensional data.

Association Rule Learning: Identifies interesting relationships between variables in large


databases. Apriori and Eclat are common algorithms used for this purpose.

3. Semi-Supervised Learning Algorithms

Semi-supervised learning algorithms use a combination of a small amount of labeled data and a
large amount of unlabeled data. This approach helps improve learning accuracy when labeled data
is scarce.

Self-Training: Uses a model trained on labeled data to predict labels for the unlabeled data.
The model is then retrained on the combined dataset.

Co-Training: Utilizes two or more models trained on different views of the data to label the
unlabeled data, iteratively improving each other.

4. Reinforcement Learning Algorithms

Reinforcement learning algorithms learn by interacting with an environment. The algorithm, known
as an agent, takes actions and receives rewards or penalties based on the outcomes of those
actions. The goal is to learn a strategy that maximizes cumulative rewards.

Q-Learning: A value-based method that aims to learn the value of taking a particular action in
a particular state.

Deep Q-Networks (DQN): Combines Q-learning with deep neural networks to handle high-
dimensional state spaces.

Policy Gradient Methods: Directly optimizes the policy by adjusting the parameters in the
direction that maximizes expected rewards.

Proximal Policy Optimization (PPO): An advanced policy gradient method that balances
exploration and exploitation to improve training stability.
5. Ensemble Learning Algorithms

Ensemble learning algorithms combine the predictions of multiple base models to produce a final
prediction. This approach often improves the accuracy and robustness of the model.

Bagging (Bootstrap Aggregating): Builds multiple models from different subsamples of the
training dataset and aggregates their predictions. Random Forest is a popular bagging
algorithm.

Boosting: Builds models sequentially, each trying to correct the errors of the previous one.
Common boosting algorithms include AdaBoost, Gradient Boosting, and XGBoost.

Stacking: Combines multiple models by training a meta-model to make the final prediction
based on the outputs of the base models.

You might also like