KEMBAR78
With Automated ML, is Everyone an ML Engineer? | PPTX
With AutomatedML,
Is Everyonean ML Engineer?
Dan Sullivan- DevFest Northeast 2020 – October 31,2020
Bio
• Principle Engineer, PEAK6 Technologies
• Author
• Instructor
• Udemy
• Google Cloud
• LinkedIn Learning
• Data Science
• Machine Learning
• Databases & Data Modeling
Overview
• Machine Learning Workflow
• Formulating an ML Problem
• Building ML Models in GCP
• Data Engineering
• Monitoring and Evaluating Fairness
1. Formulating an ML Problem
0. Machine Learning Workflow
• Formulate problem
• Identity data sources
• Prepare data
• Train, evaluate, and tune model
• Deploy model
• Use model in production
• Monitor and Evaluate Fairness
https://thenounproject.com/term/workflow/2409348/
Define the Problem to
Be Solved
• Informal description
• What is the value of
solving the problem?
• How can the problem
be solved
• Regression
• Classification
https://static.thenounproject.com/png/230138-200.png
IdentifyData Sources
• Amount of data available
• Quality
• Rate of generation
• Requirements to access
• Limitations on the use of data
https://commons.wikimedia.org/wiki/File:Data_types_-_en.svg
2. BuildingML Modelsin GCP
Services for BuildingModels
• Cloud AutoML
• AI Platform Training
• Kubeflow
• Dataproc with Spark ML
• BigQuery ML
CloudAutoML
• Designed for model builders with limited
ML experience
• GUI for training, evaluating, and tuning
• Services for sight, language, and
structured data
• AutoML Tables uses structured data to
build regression and classification
models
AI Platform Training
• Trains and runs models built in
• Tensorflow
• Scikit Learn
• XGBoost
• Hosted frameworks but can run custom containers
• Service provisions compute resources needed for a job and
then executes the job
Kubeflow
• Kubeflow is machine learning toolkit for
Kubernetes
• Packages models like applications
• Compose, deploy, and manage ML
workflows
Dataproc and Spark ML
• Dataproc is managed Spark and
Hadoop service
• Spark ML is machine learning
library
• ML Algorithms
• Feature engineering
• Pipelines
• Persistence
• Utilities
BigQueryML
• BigQuery is serverless analytical database
• BigQuery ML brings machine learning
functions to SQL
• Key advantages are:
• Ability to train and run models in
BigQuery
• Use SQL, not Python or ML frameworks
3. Data Engineering
CloudComposer
• Managed Apache Airflow Service
• Executes workflows defined in directed
acyclic graphs (DAGs)
• Accessed through console or command
line (gcloud composer environments)
Apache AirflowDAG
• Workflow is a collection of tasks with
dependencies.
• DAGs stored in Cloud Storage
• Supports custom plugins for
operators, hooks, and interfaces
• Python dependencies (packages)
DAGs are Python Programs
Source: https://cloud.google.com/composer/docs/how-to/using/writing-dags
CloudComposer Environments
• Deployed in environments, which are collections of
GCP service components based on Kubernetes Engine
• Uses a combination of tenant and customer project
resources
Architecture
Source: https://cloud.google.com/composer/docs/concepts/overview
CloudData Fusion
• Managed service based on open source CDAP
data analytics platform
• Code-free ETL/ELT development tool
• Over 160 connectors and transformations
• Drag-and-drop ETL/ELT construction
ExecutionEnvironment
• Cloud Data Fusion deployed as an instance
• Two editions
• Basic – visual designer, transformations, SDK, etc.
• Enterprise – Basic plus streaming pipelines,
integration metadata repository, high availability,
triggers, schedules, etc.
Visual Interface
Isource: https://cloud.google.com/data-fusion/docs/quickstart
4. Monitoringand Evaluating Fairness
Objective
• Understand performance of model
• Metrics to monitor:
• Traffic patterns
• Error rates
• Latency
• Resource utilization
• Configure alerts in Cloud Monitoring
MonitoringAI Platforms
• AI Platform exports metrics to
Cloud Monitoring
• Metrics
• Error count
• Latencies
• Accelerator utilization
• Memory utilization
• CPU utilization
• Network
• Prediction counts
MonitoringML ModelBest Practices
• Monitor for data skew
• Watch for changes in dependencies
• Models are refreshed as needed
• Assess model prediction quality
• Test for unfairness
Fairness
• Anti-classification
• Protected attributes not used in model
• Example: gender
• Classification parity
• Measures of predictive performance are equal across
groups
• Calibration
• Outcomes are independent of protected attributes
Fairness Resources
• Google’s Machine Learning Fairness
• https://developers.google.com/machine-
learning/fairness-overview
• AI Fairness 360 https://github.com/IBM/AIF360
• FairML https://github.com/adebayoj/fairml
• What-If Tool https://pair-code.github.io/what-if-tool/
QuickSummary
• Machine learning workflows
are multi-step
• Automated ML addresses
some, but not all steps
• Lots of data engineering and
monitoring still required

With Automated ML, is Everyone an ML Engineer?

  • 1.
    With AutomatedML, Is EveryoneanML Engineer? Dan Sullivan- DevFest Northeast 2020 – October 31,2020
  • 2.
    Bio • Principle Engineer,PEAK6 Technologies • Author • Instructor • Udemy • Google Cloud • LinkedIn Learning • Data Science • Machine Learning • Databases & Data Modeling
  • 4.
    Overview • Machine LearningWorkflow • Formulating an ML Problem • Building ML Models in GCP • Data Engineering • Monitoring and Evaluating Fairness
  • 5.
  • 6.
    0. Machine LearningWorkflow • Formulate problem • Identity data sources • Prepare data • Train, evaluate, and tune model • Deploy model • Use model in production • Monitor and Evaluate Fairness https://thenounproject.com/term/workflow/2409348/
  • 7.
    Define the Problemto Be Solved • Informal description • What is the value of solving the problem? • How can the problem be solved • Regression • Classification https://static.thenounproject.com/png/230138-200.png
  • 8.
    IdentifyData Sources • Amountof data available • Quality • Rate of generation • Requirements to access • Limitations on the use of data https://commons.wikimedia.org/wiki/File:Data_types_-_en.svg
  • 9.
  • 10.
    Services for BuildingModels •Cloud AutoML • AI Platform Training • Kubeflow • Dataproc with Spark ML • BigQuery ML
  • 11.
    CloudAutoML • Designed formodel builders with limited ML experience • GUI for training, evaluating, and tuning • Services for sight, language, and structured data • AutoML Tables uses structured data to build regression and classification models
  • 12.
    AI Platform Training •Trains and runs models built in • Tensorflow • Scikit Learn • XGBoost • Hosted frameworks but can run custom containers • Service provisions compute resources needed for a job and then executes the job
  • 13.
    Kubeflow • Kubeflow ismachine learning toolkit for Kubernetes • Packages models like applications • Compose, deploy, and manage ML workflows
  • 14.
    Dataproc and SparkML • Dataproc is managed Spark and Hadoop service • Spark ML is machine learning library • ML Algorithms • Feature engineering • Pipelines • Persistence • Utilities
  • 15.
    BigQueryML • BigQuery isserverless analytical database • BigQuery ML brings machine learning functions to SQL • Key advantages are: • Ability to train and run models in BigQuery • Use SQL, not Python or ML frameworks
  • 16.
  • 17.
    CloudComposer • Managed ApacheAirflow Service • Executes workflows defined in directed acyclic graphs (DAGs) • Accessed through console or command line (gcloud composer environments)
  • 18.
    Apache AirflowDAG • Workflowis a collection of tasks with dependencies. • DAGs stored in Cloud Storage • Supports custom plugins for operators, hooks, and interfaces • Python dependencies (packages)
  • 19.
    DAGs are PythonPrograms Source: https://cloud.google.com/composer/docs/how-to/using/writing-dags
  • 20.
    CloudComposer Environments • Deployedin environments, which are collections of GCP service components based on Kubernetes Engine • Uses a combination of tenant and customer project resources
  • 21.
  • 22.
    CloudData Fusion • Managedservice based on open source CDAP data analytics platform • Code-free ETL/ELT development tool • Over 160 connectors and transformations • Drag-and-drop ETL/ELT construction
  • 23.
    ExecutionEnvironment • Cloud DataFusion deployed as an instance • Two editions • Basic – visual designer, transformations, SDK, etc. • Enterprise – Basic plus streaming pipelines, integration metadata repository, high availability, triggers, schedules, etc.
  • 24.
  • 25.
  • 26.
    Objective • Understand performanceof model • Metrics to monitor: • Traffic patterns • Error rates • Latency • Resource utilization • Configure alerts in Cloud Monitoring
  • 27.
    MonitoringAI Platforms • AIPlatform exports metrics to Cloud Monitoring • Metrics • Error count • Latencies • Accelerator utilization • Memory utilization • CPU utilization • Network • Prediction counts
  • 28.
    MonitoringML ModelBest Practices •Monitor for data skew • Watch for changes in dependencies • Models are refreshed as needed • Assess model prediction quality • Test for unfairness
  • 29.
    Fairness • Anti-classification • Protectedattributes not used in model • Example: gender • Classification parity • Measures of predictive performance are equal across groups • Calibration • Outcomes are independent of protected attributes
  • 30.
    Fairness Resources • Google’sMachine Learning Fairness • https://developers.google.com/machine- learning/fairness-overview • AI Fairness 360 https://github.com/IBM/AIF360 • FairML https://github.com/adebayoj/fairml • What-If Tool https://pair-code.github.io/what-if-tool/
  • 31.
    QuickSummary • Machine learningworkflows are multi-step • Automated ML addresses some, but not all steps • Lots of data engineering and monitoring still required