KEMBAR78
What Is Data Science by IBM | PDF | Cloud Computing | Databases
0% found this document useful (0 votes)
80 views9 pages

What Is Data Science by IBM

This is the written material from the course named What is Data Science by IBM

Uploaded by

Ahmer Essa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views9 pages

What Is Data Science by IBM

This is the written material from the course named What is Data Science by IBM

Uploaded by

Ahmer Essa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Categories of Data Science Tools!

Introduction:
Data scientists perform various tasks to transform raw data into useful insights.
Key tasks: data management, data integration and transformation, data visualization, model building, model
deployment, and model monitoring and assessment.
Supporting elements: data asset management, code asset management, execution environments, and development
environments.

1. Data Management
Collects, stores, and retrieves data securely, efficiently, and cost-effectively.
Sources: social media, e-commerce, sensors, etc.
Persistent storage ensures data is available when needed.

2. Data Integration and Transformation


ETL Process:
Extraction: Extract data from multiple repositories (databases, data cubes, flat files) into a central repository like a
Data Warehouse.
Transformation: Convert data values, structure, and format (e.g., converting height and weight to metric units).
Loading: Load transformed data back to the Data Warehouse.

3. Data Visualization
Graphically represents data to aid decision-making.
Forms of visualizations: bar charts, treemaps, line charts, map charts, etc.
Enhances data comprehension for decision-makers.

4. Model Building
Involves training data with machine learning algorithms to analyze patterns and make predictions.
Example tool: IBM Watson Machine Learning.

5. Model Deployment
Integrates the developed model into a production environment.
Models are accessed via APIs by third-party applications, enabling business users to make data-driven decisions.
Example tool: SPSS Collaboration and Deployment Services.

6. Model Monitoring and Assessment


Ensures continuous quality of models by checking accuracy, fairness, and robustness.
Tools like Fiddler monitor deployed model performance.
Evaluation metrics: F1 score, true positive rate, sum of squared errors.
Example tool: IBM Watson OpenScale.
Supporting Tools and Platforms:
1. Code Asset Management
Manages code inventory with version control for updates, bug fixes, and feature improvements.
Centralized repositories allow collaborative code management.
Example platform: GitHub.

2. Data Asset Management (DAM)


Organizes and manages data collected from various sources.
Controls access, versioning, replication, and backup.
Enables collaboration and proper data organization.

3. Development Environments (IDEs)


Provide tools for developing, executing, testing, and deploying code.
Include testing and simulation tools for real-world emulation.
Example: IBM Watson Studio.

4. Execution Environments
Compile and verify source code with necessary libraries and system resources.
Cloud-based environments offer flexibility and are not tied to specific hardware.
Example: IBM Watson Studio.

Integrated Tools:
Tools like IBM Watson Studio and IBM Cognos Dashboard Embedded offer comprehensive support for deep learning
and machine learning model development, covering all the previous components.

Summary:
Data Science Task Categories: Data Management, Data Integration and Transformation, Data Visualization, Model
Building, Model Deployment, and Model Monitoring and Assessment.
Supported by: Data Asset Management, Code Asset Management, Execution Environments, and Development
Environments

Open-Source Tools for Data Science! Part 1

1. Data Management Tools:


Relational Databases: MySQL, PostgreSQL
NoSQL Databases: MongoDB, Apache CouchDB, Apache Cassandra
File-Based Tools: Hadoop File System, Ceph (cloud file system)
Text Data Storage: Elastic Search (creates search indices for fast document retrieval)

2. Data Integration and Transformation Tools:


ETL and ELT Processes: Extraction, Transformation, and Loading (ETL) or Extraction, Loading, and Transformation
(ELT)
Tools:
Apache AirFlow (originally by Airbnb)
KubeFlow (executes data science pipelines on Kubernetes)
Apache Kafka (originated from LinkedIn)
Apache Nifi (visual editor for data flows)
Apache SparkSQL (scalable SQL)
NodeRED (visual editor, runs on low resources like Raspberry Pi)

3. Data Visualization Tools:


Libraries with User Interfaces:
Pixie Dust (facilitates plotting in Python)
Hue (creates visualizations from SQL queries)
Web Applications:
Kibana (limited to Elasticsearch data)
Apache Superset (data exploration and visualization)

4. Model Tools:
Model Deployment:
Apache PredictionIO (supports Apache Spark ML)
Seldon (supports multiple frameworks, runs on Kubernetes and Redhat OpenShift)
MLeap (deploys SparkML models)
TensorFlow Service (serves TensorFlow models, also supports TensorFlow lite and TensorFlow.js)
Model Monitoring:
ModelDB (metadata storage for models, supports Apache Spark ML and scikit-learn)
Prometheus (generic monitoring tool)
IBM AI Fairness 360 (detects and mitigates bias)
IBM Adversarial Robustness 360 Toolbox (detects adversarial attacks)
IBM AI Explainability 360 (explains model decisions)

5. Code Asset Management Tools:


Version Management:
Git (standard for version control)
GitHub (prominent Git service)
GitLab (open-source platform, can be self-hosted)
Bitbucket

6. Data Asset Management Tools:


Data Governance and Lineage:
Apache Atlas (supports data management tasks)
ODPi Egeria (open ecosystem for metadata repositories)
Kylo (data management software platform)

7. Model Development:
 IBM Watson Studio: Engineered as an integrated environment, Watson Studio simplifies developing, training,
and deploying models. It boasts support for multiple languages and frameworks, such as Python, R, and
TensorFlow, alongside collaboration features, data preparation tools, and versatile deployment options.
 IBM AutoAI: A notable feature embedded within Watson Studio; IBM AutoAI streamlines the machine
learning model construction process. By dynamically exploring various algorithms and hyperparameters, it
aims to identify the optimal model for a given dataset.
 IBM Watson OpenScale: As a platform for overseeing and managing AI models in production, Watson
OpenScale plays a pivotal role in ensuring model fairness, explainability, and bias mitigation. It furnishes
insights into model performance and drift over time, facilitating informed decision-making.
 IBM Watson Machine Learning: Watson Machine Learning, available as a service on the IBM Cloud platform,
enables users to scale their training and deployment of machine learning models. It seamlessly supports
popular frameworks like TensorFlow, PyTorch, and scikit-learn, and offers APIs for seamless integration with
other applications.

Open-Source Tools for Data Science! Part 2


Objectives:
Compare and contrast different open-source tools.
Describe the relevant features of open-source tools.

Development Environments:
Jupyter Notebooks and Jupyter Lab:
Jupyter Notebooks: Interactive programming tool originally for Python, now supports 100+ languages through
kernels. Combines documentation, code, output, shell commands, and visualizations in a single document.
Jupyter Lab: Next version of Jupyter Notebooks, more modern and modular. Allows opening and arranging various
file types on the canvas.

Apache Zeppelin:
Inspired by Jupyter Notebooks.
Integrated plotting capabilities without the need for external libraries.
Extensible with additional libraries.

RStudio:
Oldest development environment for statistics and data science, originating in 2011.
Exclusively runs R and its libraries but supports Python development.
Unifies programming, execution, debugging, remote data access, data exploration, and visualization.

Spyder:
Mimics RStudio for Python.
Integrates code, documentation, and visualizations into a single canvas.
Widely used in the Python community, though not as comprehensive as RStudio.

Cluster Execution Environments:


Apache Spark:
Widely used across industries, especially Fortune 500 companies.
Key feature: linear scalability (performance roughly doubles with doubling servers).
Primarily a batch data processing engine.

Apache Flink:
Developed after Apache Spark.
Focuses on real-time data stream processing.
Supports both batch and stream data processing, but specializes in the latter.

Ray:
Latest in execution environments.
Focuses on large-scale deep learning model training.
Fully Integrated and Visual Tools:

KNIME:
Originated from the University of Konstanz in 2004.
Visual user interface with drag-and-drop capabilities.
Built-in visualization, extensible with R and Python programming.
Connectors available for Apache Spark.

Orange:
Easier to use but less flexible than KNIME.
Also provides visual interface for data integration, transformation, and model building.

Key Takeaways:
Jupyter is highly versatile and supports a broad range of programming languages and functionalities through its
Notebooks and Lab versions.
Apache Zeppelin offers similar functionalities to Jupyter with built-in plotting capabilities.
RStudio is the go-to for R users and offers a comprehensive suite of tools for data science tasks.
Spyder serves as an alternative for Python users, integrating key features into a unified interface.
Apache Spark and Apache Flink are essential for handling large-scale data processing, with Spark being more popular
for batch processing and Flink for stream processing.
Ray is emerging as a key player for deep learning model training.
KNIME and Orange provide visual tools that reduce the need for programming knowledge, streamlining data science
tasks through intuitive interfaces.

UNGRADED PLUGIN
https://www.coursera.org/learn/open-source-tools-for-data-science/ungradedWidget/ZBelA/open-source-tool-board

Commercial Tools for Data Science!


Overview
This section covers commercial tools used in various stages of the data science lifecycle: data management, data
integration and transformation, data visualization, model building, deployment, monitoring, assessment, code asset
management, data asset management, and development environments.

Data Management Tools


Oracle Database: Industry-standard for enterprise data management.
Microsoft SQL Server: Another widely used enterprise data management tool.
IBM Db2: Known for robust performance and security in enterprise environments.
Key Benefits:
Commercial support from vendors and partners.
Reliability and stability for critical enterprise data.

Data Integration and Transformation Tools


Informatica PowerCenter: Leader in ETL tools as per Gartner Magic Quadrant.
IBM InfoSphere DataStage: Another top ETL tool.
Other Notable Tools: SAP, Oracle, SAS, Talend, and Microsoft products.
Watson Studio Desktop: Includes Data Refinery for spreadsheet-style data integration and transformation.

Data Visualization Tools


Tableau: Leading BI tool for creating visual reports and live dashboards.
Microsoft Power BI: Widely used for business intelligence and data visualization.
IBM Cognos Analytics: Another major BI tool.
Watson Studio Desktop: Offers visualization targeting data scientists, showing relationships between data columns.

Model Building Tools


SPSS Modeler: Prominent tool for machine learning and data mining.
SAS Enterprise Miner: Another significant tool for model building.
Integration: SPSS Modeler is also part of Watson Studio Desktop.

Model Deployment Tools


SPSS Collaboration and Deployment Services: Used for deploying models created with SPSS tools.
Model Export: Models can be exported in PMML (Predictive Model Markup Language) for compatibility with other
software.

Model Monitoring Tools


Current Landscape: Model monitoring is emerging, with limited commercial tools available. Open-source tools are
currently more prevalent.
Code Asset Management Tools
Industry Standard: Open-source tools like Git and GitHub are the de facto standards.

Data Asset Management Tools


Informatica Enterprise Data Governance: Comprehensive tool for data governance.
IBM Information Governance Catalog:
Features: Data dictionary, data stewardship, data lineage, metadata annotation, regulatory compliance.
Data Lineage: Tracks transformation steps and references source data.
Policies and Rules: Manage data privacy and retention requirements.

Development Environment
Watson Studio: Fully integrated environment for data science.
Versions: Available in both cloud and desktop versions.
Features: Combines Jupyter Notebooks with graphical tools for enhanced performance.
Watson Open Scale: Complements Watson Studio by covering the entire data science lifecycle.
Deployment: Can be deployed locally, on Kubernetes, or RedHat OpenShift.
H2O Driverless AI: Another example of a fully integrated commercial tool covering the complete data science
lifecycle.

Key Takeaways
Commercial tools provide robust support for various data science tasks.
Leading tools in different categories ensure enterprise-grade performance, reliability, and support.
Watson Studio and H2O Driverless AI exemplify comprehensive solutions for data science lifecycle management.

Cloud-Based Tools for Data Science!

Fully Integrated Visual Tools:


Watson Studio: Supports the complete development life cycle for data science, machine learning, and AI tasks. Offers
large-scale execution of data science workflows in compute clusters.
Microsoft Azure Machine Learning: Cloud-hosted offering supporting the complete development life cycle for data
science, machine learning, and AI tasks.
H2O Driverless AI: Covers the complete data science life cycle with one-click deployment for standard cloud service
providers.
Data Management Tools:
Oracle Database, Microsoft SQL Server, IBM Db2: Industry-standard data management tools available as Software-
as-a-Service (SaaS) offerings in the cloud.
Amazon Web Services DynamoDB: NoSQL database allowing storage and retrieval of data in key-value or document
store format.
Cloudant (based on Apache CouchDB): Database as a service offering for storage and retrieval of data with complex
operational tasks managed by the cloud provider.

Data Integration Tools:


Informatica Cloud Data Integration: Widely used for extract, transform, and load (ETL) processes in the cloud.
IBM Data Refinery (part of Watson Studio): Enables transformation of raw data into consumable, quality information
through a spreadsheet-like user interface.

Data Visualization Tools:


Datameer: Cloud-based data visualization tool offering visual reporting and live dashboards.
IBM Cognos Business Intelligence Suite: Provides data exploration and visualization functionality as a cloud solution.
Watson Studio: Offers various visualizations to depict data relationships and patterns for better understanding.

Model Building and Deployment:


Watson Machine Learning: Service for training and building models using various open-source libraries. Also deploys
models and makes them available to consumers using a REST interface.
Amazon SageMaker Model Monitor: Cloud tool for continuously monitoring deployed machine learning and deep
learning models.
SPSS Collaboration and Deployment Services: Example of tightly integrated model deployment tool in commercial
software, supporting various model formats such as Predictive Model Markup Language (PMML).

Model Monitoring:
Watson OpenScale: Tool for monitoring deployed machine learning and AI models continuously, ensuring model
performance and fairness.

Key Takeaways:
Cloud-based tools offer comprehensive support for data science tasks, covering data management, integration,
visualization, model building, deployment, and monitoring.
Integration within cloud tools enables seamless execution of multiple tasks, enhancing efficiency and productivity for
data scientists and developers.
These tools leverage cloud infrastructure to provide scalable and flexible solutions for data science projects, catering
to diverse business needs and requirements.

Module 1 Summary
 The Data Science Task Categories include:
o Data Management - storage, management and retrieval of data
o Data Integration and Transformation - streamline data pipelines and automate data processing
tasks
o Data Visualization - provide graphical representation of data and assist with communicating
insights
o Modelling - enable Building, Deployment, Monitoring and Assessment of Data and Machine
Learning models
 Data Science Tasks support the following:
o Code Asset Management - store & manage code, track changes and allow collaborative
development
o Data Asset Management - organize and manage data, provide access control, and backup
assets
o Development Environments - develop, test and deploy code
o Execution Environments - provide computational resources and run the code

The data science ecosystem consists of many open source and commercial options, and include both traditional
desktop applications and server-based tools, as well as cloud-based services that can be accessed using web-
browsers and mobile interfaces.

Data Management Tools: include Relational Databases, NoSQL Databases, and Big Data platforms:

 MySQL, and PostgreSQL are examples of Open Source Relational Database Management Systems
(RDBMS), and IBM Db2 and SQL Server are examples of commercial RDBMSes and are also available
as Cloud services.
 MongoDB and Apache Cassandra are examples of NoSQL databases.
 Apache Hadoop and Apache Spark are used for Big Data analytics.
Data Integration and Transformation Tools: include Apache Airflow and Apache Kafka.

Data Visualization Tools: include commercial offerings such as Cognos Analytics, Tableau and PowerBI and can
be used for building dynamic and interactive dashboards.

Code Asset Management Tools: Git is an essential code asset management tool. GitHub is a popular web-based
platform for storing and managing source code. Its features make it an ideal tool for collaborative software
development, including version control, issue tracking, and project management.

Development Environments: Popular development environments for Data Science include Jupyter Notebooks
and RStudio.

 Jupyter Notebooks provides an interactive environment for creating and sharing code, descriptive text,
data visualizations, and other computational artifacts in a web-browser based interface.
 RStudio is an integrated development environment (IDE) designed specifically for working with the R
programming language, which is a popular tool for statistical computing and data analysis.

You might also like