KEMBAR78
Unit 2 Data Science | PDF | Machine Learning | Cloud Computing
0% found this document useful (0 votes)
3 views12 pages

Unit 2 Data Science

The document outlines various tools used in Data Science, categorized into programming, visualization, machine learning, big data, cloud, and deployment tools. Each category includes key features, popular tools, and use cases, emphasizing the importance of open-source tools like Jupyter and Google Colab for collaboration and efficiency. It also maps tools to the Data Science workflow stages, highlighting their specific applications in data collection, cleaning, modeling, and deployment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views12 pages

Unit 2 Data Science

The document outlines various tools used in Data Science, categorized into programming, visualization, machine learning, big data, cloud, and deployment tools. Each category includes key features, popular tools, and use cases, emphasizing the importance of open-source tools like Jupyter and Google Colab for collaboration and efficiency. It also maps tools to the Data Science workflow stages, highlighting their specific applications in data collection, cleaning, modeling, and deployment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Tools for Data Science

Data Science involves collecting, processing, analyzing, and visualizing data to extract
meaningful insights.
To perform these tasks efficiently, different categories of tools are used at various stages of the
Data Science workflow.

1. Programming Tools 🧑‍💻

Programming tools are the backbone of Data Science. They are used for data manipulation,
analysis, automation, and modeling.

Key Features

• Handle large datasets efficiently.


• Provide statistical and mathematical libraries.
• Offer automation and integration with other tools.
• Support data cleaning, preprocessing, and modeling.

Popular Programming Tools


Tool Description Use Cases

Open-source, versatile, easy to learn, rich libraries Data cleaning, ML models,


Python
like NumPy, Pandas, Matplotlib, Scikit-learn visualization

Hypothesis testing,
R Statistical computing and data visualization tool regression, advanced
statistics

Fetching, updating, and


SQL Structured Query Language for databases
analyzing database data

High-performance language for big data (works


Scala Streaming data processing
with Apache Spark)
Tool Description Use Cases

High-performance computing and numerical Scientific computing and


Julia
analysis large-scale data

Example

• Using Python (Pandas) to clean raw data.


• Using R (ggplot2) to create advanced statistical graphs.

2. Data Visualization Tools 📊

Visualization tools help in representing data graphically to make insights clear and
understandable.

Key Features

• Provide charts, plots, dashboards, and heatmaps.


• Allow interactive exploration of datasets.
• Help in data storytelling for business decisions.

Popular Visualization Tools


Tool Description Use Cases

Python library for static Line charts, bar graphs, scatter


Matplotlib
visualizations plots

Seaborn Built on Matplotlib, better styling Heatmaps, correlation plots

Business intelligence tool for Sales forecasting, KPI


Tableau
dashboards dashboards

Microsoft’s interactive visualization


Power BI Real-time business analytics
tool

Google Data Online reporting and


Free dashboarding tool
Studio dashboards

Plotly Interactive visualization library 3D plots, live dashboards


Example

• Using Tableau to create an interactive sales performance dashboard.


• Using Seaborn to visualize a correlation heatmap in Python.

3. Machine Learning (ML) Tools 🤖

Machine learning tools are used to train models, make predictions, and automate decision-
making.

Key Features

• Provide frameworks and libraries for ML models.


• Support supervised, unsupervised, and deep learning.
• Integrate with Python, R, and other programming tools.

Popular ML Tools
Tool Description Use Cases

Scikit- Regression, classification,


Python library for basic ML models
learn clustering

Open-source deep learning framework by Image recognition, NLP, AI


TensorFlow
Google chatbots

Keras Simplified API for TensorFlow Fast neural network prototyping

PyTorch Facebook’s deep learning framework NLP, computer vision, research

H2O.ai AutoML tool Automated model building

RapidMiner Visual ML platform Predictive analytics

Example

• Using TensorFlow to build an image recognition system.


• Using Scikit-learn for spam email classification.
4. Big Data Tools 🗄️

Big Data tools handle large-scale, high-volume, and high-velocity datasets that traditional
databases cannot process.

Key Features

• Manage structured, semi-structured, and unstructured data.


• Enable distributed computing.
• Perform real-time data processing.

Popular Big Data Tools


Tool Description Use Cases

Apache Open-source framework for distributed data Batch processing of


Hadoop storage & processing petabytes of data

Apache Faster alternative to Hadoop with in-memory


Real-time data analytics
Spark processing

Hive Data warehouse built on Hadoop SQL-like queries on big data

Kafka Real-time data streaming tool Event-driven applications

Flink Stream processing framework IoT and real-time analytics

Handling unstructured big


MongoDB NoSQL database
data

Example

• Using Apache Spark to process Twitter data streams in real-time.


• Using Hadoop for distributed storage of e-commerce logs.

5. Cloud Tools ☁️

Cloud platforms provide scalable infrastructure and services for data storage, computation, and
model deployment.
Key Features

• On-demand storage and compute power.


• Integration with ML, AI, and big data frameworks.
• Enable collaboration and remote access.

Popular Cloud Tools


Tool Description Use Cases

AWS (Amazon Web Popular cloud platform offering ML, big Hosting ML models,
Services) data, and storage services data lakes

Google Cloud
AI and ML-friendly cloud services TensorFlow integration
Platform (GCP)

Business analytics,
Microsoft Azure Cloud computing with strong ML tools
ML Ops

Big data + ML
Databricks Unified data analytics platform
integration

Handling structured
Snowflake Cloud data warehouse
data

Example

• Using AWS S3 for data storage.


• Using GCP AI Platform to train and deploy ML models.

6. Deployment Tools 🚀

Deployment tools help in deploying ML models and data science applications for real-world
use.

Key Features

• Enable continuous integration (CI) and continuous deployment (CD).


• Support API creation for models.
• Ensure scalability and monitoring.
Popular Deployment Tools
Tool Description Use Cases

Containerization platform for packaging Deploying ML models as


Docker
apps containers

Kubernetes Container orchestration platform Scaling ML services

Flask /
Python frameworks for building ML APIs Serving predictions via APIs
FastAPI

Building dashboards and ML


Streamlit Interactive app development framework
apps

MLflow End-to-end ML lifecycle management Model tracking and deployment

Example

• Using Docker + Kubernetes to deploy a fraud detection ML model at scale.


• Using Streamlit to build an interactive loan prediction app.

Summary Table 🧩
Category Purpose Examples

Programming Data cleaning, analysis, modeling Python, R, SQL

Visualization Graphical data representation Tableau, Power BI, Matplotlib

Machine Learning Training and testing models TensorFlow, Scikit-learn, PyTorch

Big Data Handling large-scale datasets Hadoop, Spark, Kafka

Cloud Storage, computation, scalability AWS, GCP, Azure

Deployment Making models usable in production Docker, Flask, Kubernetes


"Tools for Data Science: Mapping Tools to Data Science Workflow Stages & Open-Source
Ecosystem (Jupyter, Google Colab)"

This will cover:

1. Mapping tools to Data Science workflow stages (with diagrams + examples).


2. Introduction to open-source ecosystem.
3. Detailed explanation of Jupyter & Google Colab.

Let’s start step by step.

Tools for Data Science


Data Science is a multistage process that involves data collection, cleaning, analysis,
visualization, modeling, deployment, and monitoring.
Different tools are mapped to each workflow stage based on their capabilities.

1. Mapping Tools to Data Science Workflow Stages 🗺️

The Data Science workflow has seven key stages:


(1) Data Collection → (2) Data Cleaning → (3) Data Exploration & Visualization → (4)
Data Modeling (ML) → (5) Model Evaluation → (6) Deployment → (7) Monitoring &
Maintenance.

Below is a detailed mapping:

Workflow Stage Purpose Popular Tools Examples / Use Cases


1. Data Collecting raw data SQL, APIs, Web Extracting customer
Collection from various Scraping tools, purchase history from
sources Pandas, Google MySQL, scraping news
BigQuery data using Python
(BeautifulSoup)
2. Data Cleaning Handling missing, Python (Pandas, Removing null values,
& Preprocessing duplicate, and NumPy), R, correcting inconsistent
inconsistent data OpenRefine, Excel data formats
3. Data Understanding Matplotlib, Creating heatmaps to find
Exploration & data patterns and Seaborn, Tableau, correlations, visualizing
Visualization relationships Power BI, Plotly sales trends
4. Data Modeling Building predictive Scikit-learn, Training ML models for
(Machine and analytical TensorFlow, predicting house prices
Learning) models PyTorch, Keras,
H2O.ai
5. Model Measuring model Scikit-learn, Using accuracy, F1-score,
Evaluation performance MLflow, Keras and confusion matrices to
Metrics evaluate a model
6. Deployment Making models Flask, FastAPI, Deploying a spam
available for real- Docker, detection model via a web
world use Kubernetes, app
Streamlit
7. Monitoring & Tracking Prometheus, Monitoring data drift,
Maintenance performance and Grafana, MLflow, updating ML models over
updating models Airflow time

Example Workflow

If you're building a movie recommendation system:

• Collect movie ratings from a database using SQL.


• Clean the dataset using Pandas.
• Visualize trends using Seaborn.
• Train a recommendation model using Scikit-learn.
• Deploy the model using Flask on a website.
• Monitor performance using MLflow.

2. Open-Source Ecosystem for Data Science 🌐

Open-source tools are freely available, community-driven, and widely used in Data Science.
They allow researchers and engineers to collaborate, share code, and build solutions faster.

Advantages of Open-Source Ecosystem

• Free & accessible: No licensing costs.


• Community support: Large developer community for help and contributions.
• Integration: Works well with other tools and platforms.
• Transparency: Source code is open for review and customization.
• Innovation: Regular updates and features contributed by global experts.
Popular Open-Source Tools
Tool Category Use Case
Jupyter Notebook Interactive Writing, running, and sharing Python
environment code
Google Colab Cloud-based Run Python/ML code without local
notebooks setup
Apache Spark Big data processing Large-scale distributed data analysis
Scikit-learn Machine learning Predictive modeling
TensorFlow / PyTorch Deep learning AI, NLP, image processing
Pandas & NumPy Data manipulation Data cleaning and transformation
Matplotlib / Seaborn / Visualization Graphs, charts, dashboards
Plotly

3. Jupyter Notebook 📝

Jupyter Notebook is one of the most widely used open-source tools in Data Science.

Definition

Jupyter Notebook is an interactive web-based environment for writing and


running Python code, visualizations, and documentation in one place.

Key Features

• Supports live code execution.


• Allows combining code, text, formulas, and visuals in one file.
• Compatible with Python, R, Julia, Scala, etc..
• Supports data visualization libraries like Matplotlib, Seaborn, and Plotly.
• Produces interactive dashboards.

Advantages

• Great for experimentation and analysis.


• Easy data visualization integration.
• Widely used in education and research.
• Notebooks can be shared via .ipynb files or GitHub.

Use Cases

• Exploratory Data Analysis (EDA)


• Prototyping ML models
• Visualization and reporting
• Interactive learning

Example
import pandas as pd
import seaborn as sns
df = pd.read_csv("sales.csv")
sns.barplot(x="month", y="revenue", data=df)

This would produce an interactive bar chart inside the notebook.

4. Google Colab ☁️

Google Colab (Colaboratory) is a cloud-based open-source platform for running Jupyter


notebooks online.

Definition

Google Colab is an online Python notebook environment hosted by Google


that provides free GPU and TPU support for data science and machine
learning.

Key Features

• Cloud-based: No installation required.


• Free GPU/TPU access: Speeds up ML and deep learning tasks.
• Supports Jupyter notebooks: Fully compatible with .ipynb.
• Integration with Google Drive: Easy sharing and collaboration.
• Pre-installed ML and data libraries like TensorFlow, Keras, Scikit-learn.

Advantages

• Beginner-friendly: No setup needed.


• Scalable: Handles large datasets easily.
• Collaborative: Multiple users can work together in real time.
• Ideal for deep learning experiments.

Use Cases

• Training deep learning models using TensorFlow or PyTorch.


• Sharing ML projects with teams.
• Running data science experiments directly in the cloud.

Example
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)

You can train a neural network on Colab without any local hardware setup.

5. Jupyter vs Google Colab 🆚


Aspect Jupyter Notebook Google Colab

Installation Requires local setup No installation (cloud-based)

Hardware Uses your PC resources Uses Google’s free GPU/TPU

Storage Local storage Integrated with Google Drive

Collaboration Manual file sharing Real-time sharing like Google Docs

Best For Offline experimentation Cloud-based ML and big data projects

Final Summary 🧩

• The Data Science workflow involves data collection, cleaning, analysis,


visualization, modeling, deployment, and monitoring.
• Each stage uses specific tools:
o Pandas, SQL, Spark → Data collection & cleaning
o Seaborn, Tableau → Visualization
o TensorFlow, PyTorch, Scikit-learn → Modeling
o Flask, Docker → Deployment
• Open-source tools like Jupyter and Google Colab make Data Science easier,
faster, and collaborative.
• Jupyter is best for offline analysis, while Google Colab is ideal for cloud-based
ML projects.

You might also like