KEMBAR78
GCCF Unit 5 | PDF | Apache Hadoop | Machine Learning
0% found this document useful (0 votes)
40 views80 pages

GCCF Unit 5

The lecture notes discuss introducing big data managed services in Google Cloud including tools like Cloud Dataproc, Cloud Dataflow, and BigQuery. It also covers introducing machine learning services like AI Platform and Cloud AutoML and Google's pre-trained machine learning APIs.

Uploaded by

dhanushbabu363
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views80 pages

GCCF Unit 5

The lecture notes discuss introducing big data managed services in Google Cloud including tools like Cloud Dataproc, Cloud Dataflow, and BigQuery. It also covers introducing machine learning services like AI Platform and Cloud AutoML and Google's pre-trained machine learning APIs.

Uploaded by

dhanushbabu363
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Please read this disclaimer before

proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document through
email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from your
system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
20CS929

GOOGLE CLOUD COMPUTING


FOUNDATIONS

Department : CSE

Batch/Year : 2021-2025 / III Year

Created by:

Ms. A. Jasmine Gilda, Assistant Professor / CSE


Ms. T. Sumitha, Assistant Professor / CSE

Date : 04.09.2023
1. CONTENTS

S. No. Contents

1 Contents

2 Course Objectives

3 Pre Requisites

4 Syllabus

5 Course outcomes

6 CO- PO/PSO Mapping

7 Lecture Plan

8 Activity based learning

9 Lecture Notes

10 Assignments

11 Part A Questions & Answers

12 Part B Questions

13 Supportive online Certification courses

14 Real time Applications

15 Contents beyond the Syllabus

16 Assessment Schedule

17 Prescribed Text Books & Reference Books

18 Mini Project Suggestions


2. COURSE OBJECTIVES

 To describe the different ways a user can interact with Google


Cloud.
 To discover the different compute options in Google Cloud and
implement a variety of structured and unstructured storage
models.
 To confer the different application managed service options in
the cloud and outline how security in the cloud is administered
in Google Cloud.

 To demonstrate how to build secure networks in the cloud


and identify cloud automation and management tools.
 To determine a variety of managed big data services in the
cloud.
3. PRE REQUISITES

• Pre-requisite Chart

20CS929 - GOOGLE CLOUD


COMPUTING FOUNDATIONS

20CS404 – OPERATING 20IT403 – DATABASE


SYSTEMS MANAGEMENT SYSTEMS
4. SYLLABUS
GOOGLE CLOUD COMPUTING FOUNDATIONS L T P C
20CS929
3 0 0 3
UNIT I INTRODUCTION TO GOOGLE CLOUD 9
Cloud Computing - Cloud Versus Traditional Architecture - IaaS, PaaS, and SaaS - Google Cloud
Architecture - The GCP Console - Understanding projects - Billing in GCP - Install and configure
Cloud SDK - Use Cloud Shell - GCP APIs - Cloud Console Mobile App.

UNIT II COMPUTE AND STORAGE 9


Compute options in the cloud - Exploring IaaS with Compute Engine - Configuring elastic apps
with autoscaling - Exploring PaaS with App Engine - Event driven programs with Cloud Functions
- Containerizing and orchestrating apps with Google Kubernetes Engine - Storage options in the
cloud - Structured and unstructured storage in the cloud - Unstructured storage using Cloud
Storage - SQL managed services - Exploring Cloud SQL - Cloud Spanner as a managed service
-NoSQL managed service options - Cloud Datastore, a NoSQL document store - Cloud Bigtable
as a NoSQL option.
UNIT III APIs AND SECURITY IN THE CLOUD 9
The purpose of APIs - Cloud Endpoints - Using Apigee Edge - Managed message services - Cloud
Pub/Sub - Introduction to security in the cloud - The shared security model - Encryption options
- Authentication and authorization with Cloud IAM - Identify Best Practices for Authorization
using Cloud IAM.
UNIT IV NETWORKING, AUTOMATION AND MANGAEMENT TOOLS 9
Introduction to networking in the cloud - Defining a Virtual Private Cloud - Public and private IP
address basics - Google’s network architecture - Routes and firewall rules in the cloud - Multiple
VPC networks - Building hybrid clouds using VPNs, interconnecting, and direct peering - Different
options for load balancing - Introduction to Infrastructure as Code - Cloud Deployment Manager
- Public and private IP address basics - Monitoring and managing your services, applications,
and infrastructure - Stackdriver.
UNIT V BIG DATA AND MACHINE LEARNING SERVICES 9
Introduction to big data managed services in the cloud - Leverage big data operations with
Cloud Dataproc - Build Extract, Transform, and Load pipelines using Cloud Dataflow - BigQuery,
Google’s Enterprise Data Warehouse - Introduction to machine learning in the cloud - Building
bespoke machine learning models with AI Platform - Cloud AutoML - Google’s pre-trained
machine learning APIs.
TOTAL: 45 PERIODS
5. COURSE OUTCOME

At the end of this course, the students will be able to:


CO1: Describe the different ways a user can interact with Google
Cloud.

CO2: Discover the different compute options in Google Cloud and


implement a variety of structured and unstructured storage
models.

CO3: Discuss the different application managed service options in


the cloud and outline how security in the cloud is administered in
Google Cloud.

CO4: Demonstrate how to build secure networks in the cloud and


identify cloud automation and management tools.

CO5: Discover a variety of managed big data services in the cloud.


6. CO - PO / PSO MAPPING

PROGRAM OUTCOMES PSO


K3,K
CO HKL K3 K4 K5 K5 4,K5 A3 A2 A3 A3 A3 A3 A2

PSO3
PSO2
PSO1
PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PO-
1 2 3 4 5 6 7 8 9 10 11 12

C306.1 K3 2 1 1 - - - - - - 2 2 2 2 2 2

C306.2 K3 3 3 3 - - - - 2 2 2 2 2 2 2 2

C306.3 K3 3 3 3 - - 2 - 2 2 2 2 2 2 2 2

C306.4 K3 3 3 3 - - - - 2 2 2 2 2 2 2 2

C306.5 K3 3 3 3 - - 2 - - 2 2 2 2 2 2 2

Correlation Level:
1. Slight (Low)
2. Moderate (Medium)
3. Substantial (High)
If there is no correlation, put “-“.
7. LECTURE PLAN

Number Actual
Sl. Proposed Taxonomy Mode of
Topic of Lecture CO
No. Date Level Delivery
Periods Date
Introduction to big
1 data managed services 1 14.10.2023 CO5 K2 PPT
in the cloud
Leverage big data
2 operations with Cloud 1 16.10.2023 CO5 K3 PPT
Dataproc
Build Extract,
Transform, and Load
3 pipelines using Cloud 1 18.10.2023 CO5 K3 PPT
Dataflow
BigQuery, Google’s
4 Enterprise Data 1 19.10.2023 CO5 K3 PPT
Warehouse
Introduction to
5 machine learning in 1 20.10.2023 CO5 K2 PPT
the cloud
Building bespoke
machine learning
6 models with AI 1 21.10.2023 CO5 K3 PPT
Platform

7 Cloud AutoML 1 02.11.2023 CO5 K3 PPT

Google’s pre-trained
8 machine learning APIs 1 03.11.2023 CO5 K3 PPT

9 Demo 1 04.11.2023 CO5 K3 PPT


8. ACTIVITY BASED LEARNING

Prepare a comparison chart for various Big Data tools and


ML tools available in Google Cloud.
9. UNIT V - LECTURE NOTES
BIG DATA AND MACHINE LEARNING SERVICES

INTRODUCTION TO BIG DATA MANAGED SERVICES IN THE CLOUD

What is Big Data?

Big Data is a collection of datasets that is large and complex, making it very difficult to
process using legacy data processing applications.

So, legacy or traditional systems cannot process a large amount of data in one go.

Tools for big data can help with the volume of the data collected, the speed at which that
data becomes available to an organization for analysis, and the complexity or varieties of
that data.

Data can be a company’s most valuable asset. Using big data to reveal insights can help
understand the areas that affect your business—from market conditions and customer
purchasing behaviors to your business processes.

Big Data Overview:

Big data consists of petabytes (more than 1 million gigabytes) and exabytes (more than
1 billion gigabytes), as opposed to the gigabytes common for personal devices.

As big data emerged, so did computing models with the ability to store and manage it.
Centralized or distributed computing systems provide access to big data. Centralized
computing means the data is stored on a central computer and processed by computing
platforms like BigQuery.

Distributed computing means big data is stored and processed on different computers,
which communicate over a network. A software framework like Hadoop makes it possible
to store the data and run applications to process it.
There are benefits to using centralized computing and analyzing big data where it lives,
rather than extracting it for analysis from a distributed system. Insights are accessible to
every user in your company—and integrated into daily workflows—when big data is
housed in one place and analyzed by one platform.

Characteristics of big data:

Big data is different from typical data assets because of its volume complexity and need
for advanced business intelligence tools to process and analyze it. The attributes that
define big data are volume, variety, velocity, and variability. These big data attributes are
commonly referred to as the four v’s.

Volume: The key characteristic of big data is its scale—the volume of data that is
available for collection by your enterprise from a variety of devices and sources.

Variety: Variety refers to the formats that data comes in, such as email messages, audio
files, videos, sensor data, and more. Classifications of big data variety include structured,
semi-structured, and unstructured data.

Velocity: Big data velocity refers to the speed at which large datasets are acquired,
processed, and accessed.

Variability: Big data variability means the meaning of the data constantly changes.
Therefore, before big data can be analyzed, the context and meaning of the datasets
must be properly understood.

Types of Big Data:

Big Data is essentially classified into three types:

 Structured Data
 Unstructured Data
 Semi-structured Data
The above three types of Big Data are technically applicable at all levels of analytics. It is
critical to understand the source of raw data and its treatment before analysis while
working with large volumes of big data. Because there is so much data, extraction of
information needs to be done efficiently to get the most out of the data.

OVERVIEW OF BIG DATA MANAGED SERVICES:

 Dataproc
 Dataflow
 BigQuery
Google Cloud Dataproc

Cloud Dataproc is a managed Apache Hadoop and Spark service for batch processing,
querying, streaming and machine learning. Users can quickly spin up Hadoop or Spark
clusters and resize them at any time without compromising data pipelines through
automation and orchestration. It can be fully integrated with other Google big data
services, such as BigQuery and Bigtable, as well as Stackdriver Logging and Monitoring.

Google Cloud Dataflow:

Cloud Dataflow is a serverless stream and batch processing service. Users can build a
pipeline to manage and analyze data in the cloud, while Cloud Dataflow automatically
manages the resources. It was built to integrate with other Google services, including
BiqQuery and Cloud Machine Learning, as well as third-party products, such as Apache
Spark and Apache Beam.

Google BigQuery:

BigQuery is a data warehouse that processes and analyzes large data sets using SQL
queries. These services can capture and examine streaming data for real-time analytics.
It stores data with Google's Capacitor columnar data format, and users can load data via
streaming or batch loads. To load, export, query and copy data, use the classic web UI,
the web UI in the GCP Console, the bq command-line tool or client libraries. Since
BigQuery is a serverless offering, enterprises only pay for the storage and compute they
consume.

LEVERAGE BIG DATA OPERATIONS WITH CLOUD DATAPROC

DATAPROC:

Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-
source data tools for batch processing, querying, streaming, and machine learning.

Dataproc is a fully managed service for hosting open-source distributed processing


platforms such as Apache Spark, Presto, Apache Flink and Apache Hadoop on Google
Cloud. Unlike on-premise clusters, Dataproc provides organizations the flexibility to
provision and configure clusters of varying size on demand.

Dataproc automation helps to create clusters quickly, manage them easily, and save
money by turning clusters off when you don't need them.

In addition, Dataproc has powerful features to enable your organization to lower costs,
increase performance and streamline operational management of workloads running on
the cloud.

Hadoop and Spark:

Apache Hadoop and Apache Spark are open-source technologies that often are the
foundation of big data processing.

Apache Hadoop is a set of tools and technologies which enables a cluster of computers
to store and process large volumes of data. The Apache Hadoop software library is a
framework that allows for the distributed processing of large data sets across clusters of
computers using simple programming models.
Apache Spark is a unified analytics engine for large-scale data processing and achieves
high performance for both batch and stream data.

Cloud Dataproc features:

Cost effective: Dataproc is priced at 1 cent per virtual CPU per cluster per hour, on top
of any other Google Cloud resources you use. In addition, Dataproc clusters can include
preemptible instances that have lower compute prices. You use and pay for things only
when you need them.

Fast and scalable: Dataproc clusters are quick to start, scale, and shut down, and each
of these operations takes 90 seconds or less, on average. Clusters can be created and
scaled quickly with many virtual machine types, disk sizes, number of nodes, and
networking options.

Open-source ecosystem: You can use Spark and Hadoop tools, libraries, and
documentation with Dataproc.

Fully managed: You can easily interact with clusters and Spark or Hadoop jobs, without
the assistance of an administrator or special software, through the Cloud Console, the
Google Cloud SDK, or the Dataproc REST API.

Image versioning: Dataproc’s image versioning feature lets you switch between
different versions of Apache Spark, Apache Hadoop, and other tools.

Built-in integration: The built-in integration with Cloud Storage, BigQuery, and Cloud
Bigtable ensures that data will not be lost. This, together with Cloud Logging and Cloud
Monitoring, provides a complete data platform and not just a Spark or Hadoop cluster.

Cluster web interfaces:

Some of the core open-source components included with Dataproc clusters, such as
Apache Hadoop and Apache Spark, provide web interfaces. These interfaces can be used
to manage and monitor cluster resources and facilities, such as the YARN resource
manager, the Hadoop Distributed File System (HDFS), MapReduce, and Spark.

Connecting to web interfaces:

You can connect to web interfaces running on a Dataproc cluster using the Dataproc
Component Gateway, your project's Cloud Shell, or the Google Cloud CLI gcloud
command-line tool.

Dataproc components:

When you create a cluster, standard Apache Hadoop ecosystem components are
automatically installed on the cluster. You can install additional components, called
"optional components" on the cluster when you create the cluster. Adding optional
components to a cluster has the following advantages:

 Faster cluster startup times


 Tested compatibility with specific Dataproc versions
 Use of a cluster parameter instead of an initialization action script
 Optional components are integrated with other Dataproc components.

Available optional components:


Optional component COMPONENT_NAME in gcloud commands
and API requests
Anaconda ANACONDA
Docker DOCKER
Druid DRUID
Flink FLINK
HBase HBASE
Hive WebHCat HIVE_WEBHCAT
Jupyter Notebook JUPYTER
Presto PRESTO
Ranger RANGER
Solr SOLR
Zeppelin Notebook ZEPPELIN
Zookeeper ZOOKEEPER

Adding optional components:


To create a Dataproc cluster and install one or more optional components on the cluster,
use the gcloud beta dataproc clusters create cluster-name command with the --optional-
components flag.

gc l o ud dataproc c l u s t e r s c reat e clusĞer-name \

- - o p t onal-components=CO PONENT-NA E ( s ) \

. . . oĞher filags

Dataproc compute options:

Dataproc clusters are built on Compute Engine instances. Machine types define the
virtualized hardware resources available to an instance. Compute Engine offers both
predefined machine types and custom machine types. Dataproc clusters can use both
predefined and custom types for both master and/or worker nodes.

Dataproc supports the following Compute Engine predefined machine types in clusters:

 General purpose machine types, which include N1, N2, N2D, and E2 machine
types:

 Dataproc also supports N1, N2, N2D, and E2 custom machine types.
 Compute-optimized machine types, which include C2 machine types.
 Memory-optimized machine types, which include M1 and M2 machine types.
Custom machine types:
Custom machine types are ideal for the following workloads:

 Workloads that are not a good fit for the predefined machine types.
 Workloads that require more processing power or more memory, but don't
need all of the upgrades that are provided by the next machine type level.

Dataproc Data Storage:

Dataproc integrates with Apache Hadoop and the Hadoop Distributed File System (HDFS).
The following features and considerations can be important when selecting compute and
data storage options for Dataproc clusters and jobs:

 HDFS with Cloud Storage: Dataproc uses the Hadoop Distributed File System
(HDFS) for storage. Additionally, Dataproc automatically installs the HDFS-
compatible Cloud Storage connector, which enables the use of Cloud Storage
in parallel with HDFS. Data can be moved in and out of a cluster through
upload/download to HDFS or Cloud Storage.

 VM disks:
 By default, when no local SSDs are provided, HDFS data and intermediate
shuffle data is stored on VM boot disks, which are Persistent Disks.
 If you use local SSDs, HDFS data and intermediate shuffle data is stored on
the SSDs.
 Persistent Disk size and type affect performance and VM size, whether using
HDFS or Cloud Storage for data storage.
 VM Boot disks are deleted when the cluster is deleted.
Cloud Dataproc use cases:

1. Log processing
2. Ad-hoc data analysis
3. Machine learning
Dataproc Workflow Templates:

The Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism for
managing and executing workflows. A Workflow Template is a reusable workflow
configuration. It defines a graph of jobs with information on where to run those jobs.

Key Points:

Instantiating a Workflow Template launches a Workflow. A Workflow is an operation that


runs a Directed Acyclic Graph (DAG) of jobs on a cluster.

If the workflow uses a managed cluster, it creates the cluster, runs the jobs, and then
deletes the cluster when the jobs are finished.

If the workflow uses a cluster selector, it runs jobs on a selected existing cluster.

Workflows are ideal for complex job flows. You can create job dependencies so that a job
starts only after its dependencies complete successfully.

When you create a workflow template Dataproc does not create a cluster or submit jobs
to a cluster. Dataproc creates or selects a cluster and runs workflow jobs on the cluster
when a workflow template is instantiated.

Kinds of Workflow Templates:

Managed cluster:

A workflow template can specify a managed cluster. The workflow will create an
"ephemeral" cluster to run workflow jobs, and then delete the cluster when the workflow
is finished.

Cluster selector:

A workflow template can specify an existing cluster on which to run workflow jobs by
specifying one or more user labels previously attached to the cluster. The workflow will
run on a cluster that matches all of the labels. If multiple clusters match all labels,
Dataproc selects the cluster with the most YARN available memory to run all workflow
jobs. At the end of workflow, Dataproc does not delete the selected cluster.

Parameterized:

If you will run a workflow template multiple times with different values, use parameters
to avoid editing the workflow template for each run:

 define parameters in the template, then


 pass different values for the parameters for each run.
Inline:

Workflows can be instantiated inline using the gcloud command with workflow template
YAML files or by calling the Dataproc InstantiateInline API. Inline workflows do not create
or modify workflow template resources.

BUILD EXTRACT, TRANSFORM, AND LOAD PIPELINES USING CLOUD


DATAFLOW

DATAFLOW:

Dataflow is a serverless, fast and cost-effective service that supports both stream and
batch processing. It provides portability with processing jobs written using the open-
source Apache Beam libraries and removes operational overhead from your data
engineering teams by automating the infrastructure provisioning and cluster
management.

The Apache Beam SDK is an open-source programming model that enables you to develop
both batch and streaming pipelines. Pipelines are created with an Apache Beam program
and then run them on the Dataflow service.
Apache Beam Programming Model:

Apache Beam is an open source, unified model for defining both batch- and streaming-
data parallel-processing pipelines. The Apache Beam programming model simplifies the
mechanics of large-scale data processing. Using one of the Apache Beam SDKs, you build
a program that defines the pipeline. Then, one of Apache Beam's supported distributed
processing backends, such as Dataflow, executes the pipeline.

Basic concepts:
Pipelines
A pipeline encapsulates the entire series of computations involved in reading
input data, transforming that data, and writing output data. The input source and
output sink can be the same or of different types, allowing you to convert data

from one format to another. Apache Beam programs start by constructing


a Pipeline object, and then using that object as the basis for creating the pipeline's
datasets. Each pipeline represents a single, repeatable job.

PCollection
A PCollection represents a potentially distributed, multi-element dataset that
acts as the pipeline's data. Apache Beam transforms use PCollection objects as
inputs and outputs for each step in your pipeline. A PCollection can hold a dataset
of a fixed size or an unbounded dataset from a continuously updating data source.

Transforms
A transform represents a processing operation that transforms data. A
transform takes one or more PCollections as input, performs an operation that you
specify on each element in that collection and produces one or more PCollections
as output. A transform can perform nearly any kind of processing operation,
including performing mathematical computations on data, converting data from
one format to another, grouping data together, reading and writing data, filtering
data to output only the elements you want or combining data elements into single
values.

ParDo
ParDo is the core parallel processing operation in the Apache Beam SDKs,
invoking a user-specified function on each of the elements of the
input PCollection. ParDo collects the zero or more output elements into an
output PCollection. The ParDo transform processes elements independently and
possibly in parallel.

Pipeline I/O
Apache Beam I/O connectors let you read data into your pipeline and write
output data from your pipeline. An I/O connector consists of a source and a sink.
All Apache Beam sources and sinks are transforms that let your pipeline work with
data from several different data storage formats. You can also write a custom I/O
connector.

Aggregation
Aggregation is the process of computing some value from multiple input
elements. The primary computational pattern for aggregation in Apache Beam is
to group all elements with a common key and window. Then, it combines each
group of elements using an associative and commutative operation.

User-defined functions (UDFs)


Some operations within Apache Beam allow executing user-defined code as a
way of configuring the transform. For ParDo, user-defined code specifies the
operation to apply to every element, and for Combine, it specifies how values
should be combined. A pipeline might contain UDFs written in a different language
than the language of your runner. A pipeline might also contain UDFs written in
multiple languages.
Runner
Runners are the software that accepts a pipeline and executes it. Most runners
are translators or adapters to massively parallel big-data processing systems.
Other runners exist for local testing and debugging.

Source
A transform that reads from an external storage system. A pipeline typically
reads input data from a source. The source has a type, which may be different
from the sink type, so you can change the format of data as it moves through the
pipeline.

Sink
A transform that writes to an external data storage system, like a file or a database.
Streaming pipelines:

Unbounded PCollections, or unbounded collections, represent data in streaming pipelines.


An unbounded collection contains data from a continuously updating data source such as
Pub/Sub.

To aggregate elements in unbounded collections, windows, watermarks, and triggers can


be used.

The concept of windows also applies to bounded PCollections that represent data in batch
pipelines.

Windows and windowing functions:

Windowing functions divide unbounded collections into logical components, or windows.


Windowing functions group unbounded collections by the timestamps of the individual
elements. Each window contains a finite number of elements.
You set the following windows with the Apache Beam SDK or Dataflow SQL streaming
extensions:

 Tumbling windows (called fixed windows in Apache Beam)


 Hopping windows (called sliding windows in Apache Beam)
 Session windows
Watermarks:

A watermark is a threshold that indicates when Dataflow expects all of the data in a
window to have arrived. If new data arrives with a timestamp that's in the window but
older than the watermark, the data is considered late data.

Dataflow tracks watermarks because of the following:

 Data is not guaranteed to arrive in time order or at predictable intervals.


 Data events are not guaranteed to appear in pipelines in the same order that they
were generated.

The data source determines the watermark. You can allow late data with the Apache
Beam SDK. Dataflow SQL does not process late data.

Triggers:

Triggers determine when to emit aggregated results as data arrives. By default, results
are emitted when the watermark passes the end of the window.

You can use the Apache Beam SDK to create or modify triggers for each collection in a
streaming pipeline. You cannot set triggers with Dataflow SQL.

The Apache Beam SDK can set triggers that operate on any combination of the following
conditions:
 Event time, as indicated by the timestamp on each data element.
 Processing time, which is the time that the data element is processed at any given
stage in the pipeline.

 The number of data elements in a collection.


Dataflow Templates:

Dataflow templates allow you to package a Dataflow pipeline for deployment. Anyone
with the correct permissions can then use the template to deploy the packaged pipeline.
You can create your own custom Dataflow templates, and Google provides pre-built
templates for common scenarios.

Advantages of pipeline templates

Templates have several advantages over directly deploying a pipeline to Dataflow:

 Templates separate pipeline design from deployment. For example, a developer


can create a template, and a data scientist can deploy the template at a later time.
 Templates can have parameters that let you customize the pipeline when you
deploy the template.
 You can deploy a template by using the Google Cloud console, the Google Cloud
CLI, or REST API calls. You don't need a development environment, or any
pipeline dependencies installed on your local machine.
 A template is a code artifact that can be stored in a source control repository and
used in continuous integration (CI/CD) pipelines.

Dataflow supports two types of template: Flex templates, which are newer, and classic
templates. If you are creating a new Dataflow template, we recommend creating it as a
Flex template.
Template workflow

Using Dataflow templates involves the following high-level steps:

1. Developers set up a development environment and develop their pipeline. The


environment includes the Apache Beam SDK and other dependencies.

2. Depending on the template type (Flex or classic):


 For Flex templates, the developers package the pipeline into a Docker
image, push the image to Container Registry or Artifact Registry, and upload
a template specification file to Cloud Storage.
 For classic templates, developers run the pipeline, create a template file,
and stage the template to Cloud Storage.

3. Other users submit a request to the Dataflow service to run the template.

4. Dataflow creates a pipeline from the template.

BIGQUERY:

 BigQuery is a fully managed enterprise data warehouse that helps you manage
and analyze your data with built-in features like machine learning, geospatial
analysis, and business intelligence.
 BigQuery's serverless architecture allows to use SQL queries to answer your
organization's biggest questions with zero infrastructure management.
 BigQuery maximizes flexibility by separating the compute engine that analyzes
your data from your storage choices.
 BigQuery interfaces include Google Cloud console interface and the BigQuery
command-line tool.
 Developers and data scientists can use client libraries with familiar programming
including Python, Java, JavaScript, and Go, as well as BigQuery's REST API and
RPC API to transform and manage data.
BigQuery storage:

 BigQuery stores data using a columnar storage format that is optimized for
analytical queries.
 BigQuery presents data in tables, rows, and columns and provides full support for
database transaction semantics (ACID).
 BigQuery storage is automatically replicated across multiple locations to provide
high availability.

BigQuery administration:

 BigQuery provides centralized management of data and compute resources while


Identity and Access Management (IAM) helps to secure those resources with the
access model that's used throughout Google Cloud.

Loading data into BigQuery:

There are several ways to ingest data into BigQuery:

 Batch load a set of data records.


 Stream individual records or batches of records.
 Use queries to generate new data and append or overwrite the results to a table.
 Use a third-party application or service.

Batch loading:

With batch loading, you load the source data into a BigQuery table in a single batch
operation.

Options for batch loading in BigQuery include the following:

 Load jobs. Load data from Cloud Storage or from a local file by creating a load
job.
 SQL. The LOAD DATA SQL statement loads data from one or more files into a new
or existing table.
 BigQuery Data Transfer Service. Use Big Query Data Transfer Service to
automate loading data from Google Software as a Service (SaaS) apps or from
third-party applications and services.
 BigQuery Storage Write API. The Storage write API lets you batch-process an
arbitrarily large number of records and commit them in a single atomic operation.
If the commit operation fails, you can safely retry the operation. Unlike BigQuery
load jobs, the Storage Write API does not require staging the data to intermediate
storage such as Cloud Storage.
 Other managed services. Use other managed services to export data from an
external data store and import it into BigQuery.

Batch loading can be done as a one-time operation or on a recurring schedule.

Streaming:

With streaming, you continually send smaller batches of data in real time, so the data is
available for querying as it arrives. Options for streaming in BigQuery include the
following:

 Storage Write API. The Storage Write API supports high-throughput streaming
ingestion with exactly-once delivery semantics.
 Dataflow. Use Dataflow with the Apache Beam SDK to set up a streaming
pipeline that writes to BigQuery.
 BigQuery Connector for SAP. The BigQuery Connector for SAP enables near
real time replication of SAP data directly into BigQuery.
Generated data:

You can use SQL to generate data and store the results in BigQuery. Options for
generating data include:

 Use data manipulation language (DML) statements to perform bulk inserts into an
existing table or store query results in a new table.

 Use a CREATE TABLE ... AS statement to create a new table from a query result.
 Run a query and save the results to a table. You can append the results to an
existing table or write to a new table.
BigQuery analytics:

BigQuery is optimized to run analytic queries on large datasets, including terabytes of


data in seconds and petabytes in minutes. Understanding its capabilities and how it
processes queries can help you maximize your data analysis investments.

Analytic workflows:

BigQuery supports several data analysis workflows:

Ad hoc analysis. BigQuery uses Google Standard SQL, the SQL dialect in BigQuery, to
support ad hoc analysis.

Geospatial analysis. BigQuery uses geography data types and Google Standard SQL
geography functions to let you analyze and visualize geospatial data.

Machine learning. BigQuery ML uses Google Standard SQL queries to let you create
and execute machine learning (ML) models in BigQuery.

Business intelligence. BigQuery BI Engine is a fast, in-memory analysis service that


lets you build rich, interactive dashboards and reports without compromising
performance, scalability, security, or data freshness.

Queries:
The primary unit of analysis in BigQuery is the SQL query. BigQuery has two SQL dialects:
Google Standard SQL and legacy SQL. Google Standard SQL is the preferred dialect.

 Data sources:

BigQuery lets you query the following types of data sources:

Data stored in BigQuery - load data into BigQuery for analysis.

External data - query various external data sources such other Google Cloud storage
services (like Cloud Storage) or database services (like Cloud Spanner or Cloud SQL).

Multi-cloud data - query data that's stored in other public clouds such as AWS or Azure.

Public datasets - can analyze any of the datasets that are available in the public dataset
marketplace.

 Types of queries:

After you load your data into BigQuery, you can query the data using one of the following
query job types:

 Interactive query jobs. By default, BigQuery runs interactive (on-demand)


query jobs as soon as possible.

 Batch query jobs. With these jobs, BigQuery queues each batch query on your
behalf and then starts the query when idle resources are available, usually within
a few minutes.
INTRODUCTION TO MACHINE LEARNING IN THE CLOUD

Machine learning allows businesses to enable the data to teach the system how to solve
the problem at hand with machine learning algorithms and how to get better over time.

Machine learning (ML) is a subfield of artificial intelligence (AI). The goal of ML is to make
computers learn from the data that you give them. Instead of writing code that describes
the action the computer should take, your code provides an algorithm that adapts based
on examples of intended behavior. The resulting program, consisting of the algorithm
and associated learned parameters, is called a trained model.

● Artificial intelligence, or AI, is an umbrella term that includes anything related to


computers mimicking human intelligence. For example, in an online word
processor, robots perform human actions for spelling and grammar checks.

● Machine learning is a toolset, like Newton’s laws of mechanics. Just as you can use
Newton’s laws to learn how long it will take a ball to fall to the ground if you drop
it off a cliff, you can use machine learning to solve certain kinds of problems at
scale by using data examples, but without the need for custom code.

● You might have also heard the term deep learning, or deep neural networks. Deep
learning is a subset of machine learning that adds layers in between input data
and output results to make a machine learn at more depth. It’s a type of machine
learning that works even when the data is unstructured, like images, speech,
video, natural language text, and so on. Image classification is a type of deep
learning. A machine can learn how to classify images into categories when it is
shown lots of examples.

The basic difference between machine learning and other techniques in AI is that in
machine learning, machines learn. They don’t start out intelligent, they become
intelligent.
So, how do machines become intelligent? Intelligence requires training. To train a
machine learning model, examples are required.

For example, to train a model to estimate how much you’ll owe in taxes, you must show
the model many, many examples of tax returns or if you want to train a model to estimate
trip time between one location and another, you’ll need to show it many examples of
previous journeys.

The first stage of ML is to train an ML model with examples.

Imagine you work for a manufacturing company, and you want to train a machine learning
model to detect defects in the parts before they are assembled into products. You’d start
by creating a dataset of images of parts. Some of those parts would be good and some
parts would be defective. For each image, you will assign a corresponding label and use
that set of examples to train the model.

An important detail to emphasize is that a machine learning model is only as good as the
data used to train it. And a good model requires a lot of training data of historical
examples of rejected parts and parts in good condition. With these elements, you can
train a model to categorize parts as defective or not.

The basic reason why ML models need high-quality data is because they don’t have
human general knowledge; data is the only thing they have access to.
After the model has been trained, it can be used to make predictions on data it has never
seen before.

In this example, the input for the trained model is an image of a part. Because the model
has been trained on specific examples of good and defective parts, it can correctly predict
that this part is in good condition.

Standard algorithms use cases

Algorithms, or ML models, are standard. That means that they exist independently of the
use case. Although detecting manufacturing defects in images and detecting diseased
leaves in images are two different use cases, the same algorithm, which is an image
classification network, works for both.
Similarly, standard algorithms predict the future value of a time series or to transcribe
human speech or text. ResNet, for example, is a standard algorithm for image
classification. It’s not essential to understand how an image classification algorithm
works, only that it’s the algorithm that we should use if we want to classify images of
automotive parts.

ML OPTIONS ON GOOGLE CLOUD

Google Cloud offers four options for building machine learning models.

1. The first option is BigQuery ML. You’ll remember from the previous module of
this course that BigQuery ML is a tool for using SQL queries to create and execute
machine learning models in BigQuery. If you already have your data in BigQuery
and your problems fit the predefined ML models, this could be your best choice.

2. The second option is AutoML, which is a no-code solution, so you can build your
own machine learning models on Vertex AI through a point-and-click interface.

3. The third option is custom training, through which you can code your own
machine learning environment, the training, and the deployment, which provides
you with flexibility and control over the ML pipeline.

4. And finally, there are pre-built APIs, which are application programming
interfaces. This option lets you use machine learning models that have already
been built and trained by Google, so you don’t have to build your own machine
learning models if you don’t have enough training data or sufficient machine
learning expertise in-house.

Big Query ML

BigQuery ML lets you create and execute machine learning models in BigQuery using
standard SQL queries. BigQuery ML democratizes machine learning by letting SQL
practitioners build models using existing SQL tools and skills. BigQuery ML increases
development speed by eliminating the need to move data.

BigQuery ML functionality is available by using:

 The Google Cloud console


 The bq command-line tool
 The BigQuery REST API
 An external tool such as a Jupyter notebook or business intelligence platform

Machine learning on large datasets requires extensive programming and knowledge of


ML frameworks. These requirements restrict solution development to a very small set of
people within each company, and they exclude data analysts who understand the data
but have limited machine learning knowledge and programming expertise.

BigQuery ML empowers data analysts to use machine learning through existing SQL tools
and skills. Analysts can use BigQuery ML to build and evaluate ML models in BigQuery.
Analysts don't need to export small amounts of data to spreadsheets or other applications
or wait for limited resources from a data science team.

Supported models in BigQuery ML


A model in BigQuery ML represents what an ML system has learned from the training
data.

BigQuery ML supports the following types of models:


Linear regression for forecasting; for example, the sales of an item on a given day.
Labels are real-valued (they cannot be +/- infinity or NaN).

Binary logistic regression for classification; for example, determining whether a


customer will make a purchase. Labels must only have two possible values.

Multiclass logistic regression for classification. These models can be used to


predict multiple possible values such as whether an input is "low-value," "medium-value,"
or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass
logistic regression training uses a multinomial classifier with a cross-entropy loss function.

K-means clustering for data segmentation; for example, identifying customer


segments. K-means is an unsupervised learning technique, so model training does not
require labels nor split data for training or evaluation.

Matrix Factorization for creating product recommendation systems. You can create
product recommendations using historical customer behavior, transactions, and product
ratings and then use those recommendations for personalized customer experiences.

Time series for performing time-series forecasts. You can use this feature to create
millions of time series models and use them for forecasting. The model automatically
handles anomalies, seasonality, and holidays.

Boosted Tree for creating XGBoost based classification and regression models.

Deep Neural Network (DNN) for creating TensorFlow-based Deep Neural Networks
for classification and regression models.

AutoML Tables to create best-in-class models without feature engineering or model


selection. AutoML Tables searches through a variety of model architectures to decide the
best model.
TensorFlow model importing. This feature lets you create BigQuery ML models from
previously trained TensorFlow models, then perform prediction in BigQuery ML.

Autoencoder for creating Tensorflow-based BigQuery ML models with the support of


sparse data representations. The models can be used in BigQuery ML for tasks such as
unsupervised anomaly detection and non-linear dimensionality reduction.

Model Selection Guide


Advantages of BigQuery ML

BigQuery ML has the following advantages over other approaches to using ML with a
cloud-based data warehouse:

1. BigQuery ML democratizes the use of ML by empowering data analysts, the

primary data warehouse users, to build and run models using existing business
intelligence tools and spreadsheets. Predictive analytics can guide business
decision-making across the organization.
2. There is no need to program an ML solution using Python or Java. Models are

trained and accessed in BigQuery using SQL—a language data analysts know.
3. BigQuery ML increases the speed of model development and innovation by

removing the need to export data from the data warehouse. Instead, BigQuery ML

brings ML to the data. The need to export and reformat data has the following
disadvantages:
 Increases complexity because multiple tools are required.
 Reduces speed because moving and formatting large amounts data for
Python-based ML frameworks takes longer than model training in BigQuery.
 Requires multiple steps to export data from the warehouse, restricting the
ability to experiment on your data.

 Can be prevented by legal restrictions such as HIPAA guidelines.


Steps to create Machine Learning models in BigQuery:

1. Create your dataset


2. Create your model
3. Get training statistics
4. Evaluate your model
5. Use your model to predict outcomes
VERTEX AI

Build, deploy, and scale machine learning (ML) models faster, with fully managed ML
tools for any use case. Vertex AI allows users to build machine learning models with
either AutoML, a no-code solution, or Custom Training, a code-based solution. AutoML is
easy to use and lets data scientists spend more time turning business problems into ML
solutions, but custom training enables data scientists to have full control over the
development environment and process.

Vertex AI brings AutoML and AI Platform together into a unified API, client library, and
user interface. AutoML lets you train models on image, tabular, text, and video datasets
without writing code, while training in AI Platform lets you run custom training code. With
Vertex AI, both AutoML training and custom training are available options. Whichever
option you choose for training, you can save models, deploy models, and request
predictions with Vertex AI.

Google’s solution to many of the production and ease-of-use challenges is Vertex AI, a
unified platform that brings all the components of the machine learning ecosystem and
workflow together.

Vertex AI is a unified platform

Vertex AI is a unified platform, it means having one digital experience to create, manage,
and deploy models over time, and at scale. In Vertex AI Machine Learning process carried
out by 4 stages

● Data readiness:

During the data readiness stage, users can upload data from wherever it’s
stored: Cloud Storage, BigQuery, or a local machine.

● Feature readiness:
Then, during the feature readiness stage, users can create features, which
are the processed data that will be put into the model, and then share them with
others by using the feature store.

● Training and hyperparameter tuning:

After that, it’s time for training and hyperparameter tuning. This means that
when the data is ready, users can experiment with different models and adjust
hyperparameters.

● Deployment and model monitoring:

And finally, during deployment and model monitoring, users can set up the
pipeline to transform the model into production by automatically monitoring and
performing continuous improvements.

Benefits of Vertex AI

 It’s seamless. Vertex AI provides a smooth user experience from uploading and
preparing data all the way to model training and production.
 It’s scalable. The machine learning operations (MLOps) provided by Vertex AI helps
to monitor and manage the ML production and therefore scale the storage and
computing power automatically.
 It’s sustainable. All the artifacts and features created with Vertex AI can be reused
and shared.

 And it’s speedy. Vertex AI produces models that have 80% fewer lines of code
than competitors

Components of Vertex AI

This section describes the pieces that make up Vertex AI and the primary purpose of each
piece.
 Model training

You can train models on Vertex AI by using AutoML, or if you need the wider range of
customization options available in AI Platform Training, use custom training.

In custom training, you can select from among many different machine types to power
your training jobs, enable distributed training, use hyperparameter tuning, and accelerate
with GPUs.

 Model deployment for prediction

You can deploy models on Vertex AI and get an endpoint to serve predictions on Vertex
AI whether or not the model was trained on Vertex AI.

 Vertex AI Data Labeling

Data Labeling jobs let you request human labeling for a dataset that you plan to use to
train a custom machine learning model. You can submit a request to label your video,
image, or text data.

To submit a labeling request, you provide a representative sample of labeled data, specify
all the possible labels for your dataset, and provide some instructions for how to apply
those labels. The human labelers follow your instructions, and when the labeling request
is complete, you get your annotated dataset that you can use to train a machine learning
model.

 Vertex AI Feature Store

Vertex AI Feature Store is a fully managed repository where you can ingest, serve, and
share ML feature values within your organization. Vertex AI Feature Store manages all
the underlying infrastructure. For example, it provides storage and compute resources
for you and can easily scale as needed.

Tools to interact with Vertex AI


 The Google Cloud console
You can deploy models to the cloud and manage your datasets, models, endpoints,
and jobs on the Google Cloud console. This option provides a user interface for
working with your machine learning resources. As part of Google Cloud, your
Vertex AI resources are connected to useful tools like Cloud Logging and Cloud
Monitoring. The best place to start using the Google Cloud console is the
Dashboard page of the Vertex AI section

 Cloud Client Libraries


Vertex AI provides client libraries for some languages to help you make calls to
the Vertex AI API. The client libraries provide an optimized developer experience
by using each supported language's natural conventions and styles.
Alternatively, you can use the Google API Client Libraries to access the Vertex AI
API by using other languages, such as Dart. When using the Google API Client
Libraries, you build representations of the resources and objects used by the API.
This is easier and requires less code than working directly with HTTP requests.

 REST API
The Vertex AI REST API provides RESTful services for managing jobs, models, and
endpoints, and for making predictions with hosted models on Google Cloud.
 Vertex AI Workbench notebooks instances
Vertex AI Workbench is a Jupyter notebook-based development environment for
the entire data science workflow. You can interact with Vertex AI services from
within a Vertex AI Workbench instance's Jupyter notebook. Vertex AI Workbench
integrations and features can make it easier to access your data, process data
faster, schedule notebook runs, and more.

 Deep Learning VM Images


Deep Learning VM Images is a set of virtual machine images optimized for data
science and machine learning tasks. All images come with key ML frameworks and
tools pre-installed. You can use them out of the box on instances with GPUs to
accelerate your data processing tasks.
Deep Learning VM images are available to support many combinations of
framework and processor. There are currently images supporting TensorFlow
Enterprise, TensorFlow, PyTorch, and generic high-performance computing, with
versions for both CPU-only and GPU-enabled workflows.

 Deep Learning Containers


Deep Learning Containers are a set of Docker containers with key data science
frameworks, libraries, and tools pre-installed. These containers provide you with
performance-optimized, consistent environments that can help you prototype and
implement workflows quickly.

Vertex AI workflow

Vertex AI uses a standard machine learning workflow:

 Gather your data: Determine the data you need for training and testing your model
based on the outcome you want to achieve.

 Prepare your data: Make sure your data is properly formatted and labeled.
 Train: Set parameters and build your model.
 Evaluate: Review model metrics.
 Deploy and predict: Make your model available to use.

But before you start gathering your data, you need to think about the problem you are
trying to solve. This will inform your data requirements.

Vertex AI provides two solutions in one

Vertex AI allows users to build machine learning models with either AutoML, a no-code
solution, or Custom Training, a code-based solution. AutoML is easy to use and lets
data scientists spend more time turning business problems into ML solutions, but custom
training enables data scientists to have full control over the development environment
and process.

AUTO ML

AutoML is a no-code solution to build ML models through Vertex AI. It allows users to
train high-quality custom machine learning models with minimal effort or machine
learning expertise.

If you've worked with ML models before, you know that training and deploying ML models
can be incredibly time consuming, because you need to repeatedly add new data and
features, try different models, and tune parameters to achieve the best result.

To solve this problem, when AutoML was first announced in January of 2018, the goal
was to automate machine learning pipelines to save data scientists from manual work,
such as tuning hyperparameters and comparing against multiple models.

Two technologies Used:

 Transfer Learning:

In transfer learning, you build a knowledge base in the field. You can think of this like
gathering many books to create a library.

Transfer learning is a powerful technique that lets people with smaller datasets, or less
computational power, achieve state-of-the-art results by using pre-trained models that
have been trained on similar, larger datasets. Because the model learns through transfer
learning, it doesn’t have to learn from the beginning, so it can generally reach higher
accuracy with much less data and computation time than models that don’t use transfer
learning.

 Neural architect search:

The second technology is neural architect search. The goal of neural architect
search is to find the optimal model for the relevant project.

AutoML is powered by the latest machine-learning research, so although a model


performs training, the AutoML platform actually trains and evaluates multiple models and
compares them to each other. This neural architecture search produces an ensemble of
ML models and chooses the best one.

AutoML supports four types of data: image, tabular, text, and video. For each data
type, AutoML solves different types of problems, called objectives.

Benefits of AutoML

 Using these technologies has produced a tool that can significantly benefit data
scientists
 One of the biggest benefits is that it’s a no-code solution. That means it can
train high-quality custom machine learning models with minimal effort and requires
little machine learning expertise.
 This allows data scientists to focus their time on tasks like defining business
problems or evaluating and improving model results.
 Others might find AutoML useful to quickly prototype models and explore new
datasets before investing in development. This might involve using it to identify
the best features in a dataset

Difference between Traditional ML Development and Auto ML


Auto ML, which is short for automated machine learning.

In traditional ML models, training and deploying ML can be incredibly time consuming,


because you need to repeatedly add new data and features, try different models, and
tune parameters to achieve the best result.

To solve this problem, when AutoML was first announced in January of 2018, the goal
was to automate machine learning pipelines to save data scientists from manual work,
such as tuning hyperparameters and comparing against multiple models.

The AutoML product offerings are:

 Sight — AutoML Vision, AutoML Video Intelligence


 Language — AutoML Natural Language, AutoML Translation
 Structured data — AutoML Tables

How AutoML works?

Here’s the components you need:

Google Account + Enable Billing

Cloud AutoML and Storage API

AutoML Tables takes your dataset and starts training for multiple model architectures at
the same time. This let's cloud AutoML determine the best model architecture fo your
data quickly, without having to serially iterate over the many possible model
architectures. The model architectures to test include:

 Linear
 Feedforward deep neural network
 Gradient Boosted Decision Tree
 AdaNet
 Ensembles of various model architectures

As new model architectures come out of the research community, Google will add those
as well.

Dataset → AutoML → Generate Predictions with a REST API call. Image Source, Cloud
AutoML

Data Preparation

Before we can start training, we must ensure that the training data is in a format that is
supported by the platform. AutoML Tabular supports two ways of importing training data;
if we are already utilising Google’s BigQuery platform, we can import the data straight
from a BigQuery table, or we can upload a comma-separated values (CSV) file to Cloud
Storage.

If you are dealing with structured data, AutoML Tables is recommended. Depending on
use cases, AutoML Tables will create the necessary model to solve the problem:

 A binary classification model predicts a binary outcome (one of two classes). Use
this for yes or no questions, for example, predicting whether a customer would
buy a subscription (or not). All else being equal, a binary classification problem
requires less data than other model types.
 A multi-class classification model predicts one class from three or more discrete
classes. Use this to categorize things. For the retail example, you’d want to build
a multi-class classification model to segment customers into different personas.
 A regression model predicts a continuous value. For the retail example, you’d also
want to build a regression model to forecast customer spending over the next
month.

Creating a Dataset

Importing a dataset can be done through the Google Cloud Platform (GCP) Console. In
the Vertex AI section there is a Datasets page where we can manage our datasets. From
there we can create a tabular dataset and select a data source for the dataset.
Sample structured data that you can use with Cloud AutoML

A visualisation of the splits, Image Source: About Train, Validation and Test Sets in
Machine Learning

On the other hand, working with unstructured data is more complicated. The Vision API
classifies images into thousands of predefined categories, detects individual objects and
faces within images, and finds and reads printed words contained within images. If you
want to detect individual objects, faces, and text in your dataset, or your need for image
classification is quite generic, try out the Vision API and see if it works for you.
The Vision API classification images

Coupled with image labeling in the Vision API, AutoML Vision enables you to perform
supervised learning, which involves training a computer to recognize patterns from
labeled data. Using supervised learning, we can train a model to recognize the patterns
and content that we care about in images.

Training a Model

Once the dataset has been created we can start training our first model.
The model outputs a series of numbers that communicate how strongly it associates each
label with that example. If the number is high, the model has high confidence that the
label should be applied to that document.

After you have created (trained) a model, you can request a prediction for an image using
the predict method. The available options for online (individual) prediction are:

 Web UI
 Command-Line
 Python
 Java
 Node.js
Deployment, Testing and Explainability

The trained model that is produced by AutoML can then be deployed in two ways; we
can export the model as a saved TensorFlow (TF) model which we can then serve
ourselves in a Docker container, or we can deploy the model directly to a GCP endpoint.

CUSTOM TRAINING

Custom training allows the ML environment to control the entire ML development process
by developing customers own code, starting from data preparation to model deployment.

If you want to code your machine learning model, you can use this option by building a
custom training solution with Vertex AI Workbench.

Workbench is a single development environment for the entire data science


workflow, from exploring, to training, and then deploying a machine learning model with
code.
Before any coding begins, customer must determine what environment want for their ML
training code to use.

Choose a training code structure

First, determine what structure you want your ML training code to take. You can provide
training code to Vertex AI in one of the following forms:

 Pre-built container
 Custom container

Pre-built container

Imagine that a container is a kitchen. A pre-built container would represent a fully


furnished room with cabinets and appliances (which represent the dependencies) that
includes all the cookware (which represents the libraries) you need to make a meal.

So, if your ML training needs a platform like TensorFlow, Pytorch, scikit-learn, or XGboost,
and Python code to work with the platform, a pre-built container is probably your best
solution

A Python training application to use with a pre-built container.


Create a Python source distribution with code that trains an ML model and exports it to
Cloud Storage. This training application can use any of the dependencies included in the
pre-built container that you plan to use it with.

Custom container

A custom container, alternatively, is like an empty room with no cabinets, appliances, or


cookware.

Create a Docker container image with code that trains an ML model and exports it to
Cloud Storage. Include any dependencies required by your code in the container image.

Use this option if you want to use dependencies that are not included in one of the Vertex
AI pre-built containers for training. For example, if you want to train using a Python ML
framework that is not available in a pre-built container, or if you want to train using a
programming language other than Python, then this is the better option.

Pre-built APIs

When using AutoML, you define a domain-specific labeled training dataset that is used to
create the custom ML model you require. However, if you don’t need a domain-specific
dataset, Google’s suite of pre-built ML APIs might meet your needs.

In this section, you’ll explore some of those APIs.

Good machine learning models require lots of high-quality training data. You should aim
for hundreds of thousands of records to train a custom model. If you don't have that kind
of data, pre-built APIs are a great place to start.

Pre-built APIs are offered as services. Often, they can act as building blocks to create
the application you want without the expense or complexity of creating your own
models. They save the time and effort of building, curating, and training a new dataset,
so you can directly deal with predictions.
Examples of Pre-built API

● The Speech-to-Text API converts audio to text for data processing.

● The Cloud Natural Language API recognizes parts of speech called entities and
sentiment.

● The Cloud Translation API converts text from one language to another.

● The Text-to-Speech API converts text into high-quality voice audio.

● The Vision API works with and recognizes content in static images.

● And the Video Intelligence API recognizes motion and action in video.

And Google has already done a lot of work to train these models by using Google datasets.
For example

● The Vision API is based on Google’s image datasets.

● The Speech-to-Text API is trained on YouTube captions.

● And the Translation API is built on Google Translate.

You recall that how well a model is trained depends on how much data is available to
train it. As you might expect, Google has many images, text, and ML researchers to train
its pre-built models. This means less work for you.

1.Speech-to-Text API

Speech-to-Text enables easy integration of Google speech recognition technologies into


developer applications. Send audio and receive a text transcription from the Speech-to-
Text API service.

Speech requests
Speech-to-Text has three main methods to perform speech recognition. These are listed
below:

Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text
API, performs recognition on that data, and returns results after all audio has been
processed. Synchronous recognition requests are limited to audio data of 1 minute or less
in duration.

Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-
Text API and initiates a Long Running Operation. Using this operation, you can
periodically poll for recognition results. Use asynchronous requests for audio data of any
duration up to 480 minutes.

Streaming Recognition (gRPC only) performs recognition on audio data provided


within a gRPC bi-directional stream. Streaming requests are designed for real-time
recognition purposes, such as capturing live audio from a microphone. Streaming
recognition provides interim results while audio is being captured, allowing result to
appear, for example, while a user is still speaking.

Requests contain configuration parameters as well as audio data. The following sections
describe these type of recognition requests, the responses they generate, and how to
handle those responses in more detail.

Speech-to-Text API recognition

A Speech-to-Text API synchronous recognition request is the simplest method for


performing recognition on speech audio data. Speech-to-Text can process up to 1 minute
of speech audio data sent in a synchronous request. After Speech-to-Text processes and
recognizes all of the audio, it returns a response.

A synchronous request is blocking, meaning that Speech-to-Text must return a


response before processing the next request. Speech-to-Text typically processes audio
faster than realtime, processing 30 seconds of audio in 15 seconds on average. In cases
of poor audio quality, your recognition request can take significantly longer.

2. Cloud Natural Language API

Provides natural language understanding technologies, such as sentiment analysis, entity


recognition, entity sentiment analysis, and other text annotations, to developers.

Service: language.googleapis.com

To call this service, we recommend that you use the Google-provided client libraries. If
your application needs to use your own libraries to call this service, use the following
information when you make the API requests.

Discovery document

A Discovery Document is a machine-readable specification for describing and consuming


REST APIs. It is used to build client libraries, IDE plugins, and other tools that interact
with Google APIs. One service may provide multiple discovery documents. This service
provides the following discovery documents

https://language.googleapis.com/$discovery/rest?version=v1

https://language.googleapis.com/$discovery/rest?version=v1beta2

Service endpoint

A service endpoint is a base URL that specifies the network address of an API service.
One service might have multiple service endpoints. This service has the following service
endpoint and all URIs below are relative to this service endpoint
https://language.googleapis.com.

3. Text-to-Speech API
Text-to-Speech converts text or Speech Synthesis Markup Language (SSML) input into
audio data of natural human speech like MP3 or LINEAR16 (the encoding used in WAV
files).

Text-to-Speech allows developers to create natural-sounding, synthetic human speech as


playable audio. You can use the audio data files you create using Text-to-Speech to power
your applications or augment media like videos or audio recordings.

Text-to-Speech is ideal for any application that plays audio of human speech to users. It
allows you to convert arbitrary strings, words, and sentences into the sound of a person
speaking the same things.

Imagine that you have a voice assistant app that provides natural language feedback to
your users as playable audio files. Your app might take an action and then provide human
speech as feedback to the user.

Speech synthesis

The process of translating text input into audio data is called synthesis and the output of
synthesis is called synthetic speech.

Text-to-Speech takes two types of input:

 Raw text
 SSML-formatted data (discussed below).

To create a new audio file, you call the synthesize endpoint of the API.

The speech synthesis process generates raw audio data as a base64-encoded string. You
must decode the base64-encoded string into an audio file before an application can play
it. Most platforms and operating systems have tools for decoding base64 text into
playable media files.

Voices
Text-to-Speech creates raw audio data of natural, human speech. That is, it creates audio
that sounds like a person talking. When you send a synthesis request to Text-to-Speech,
you must specify a voice that 'speaks' the words.

Text-to-Speech has a wide selection of custom voices available for you to use. The
voices differ by language, gender, and accent (for some languages). For example, you
can create audio that mimics the sound of a female English speaker with a British accent.
You can also convert the same text into a different voice, say a male English speaker with
an Australian accent.

4.Cloud Translation API

Cloud Translation enables your websites and applications to dynamically translate text
programmatically through an API. Translation uses a Google pre-trained or a custom
machine learning model to translate text. By default, Translation uses a Google pre-
trained Neural Machine Translation (NMT) model, which Google updates on semi-regular
cadence when more training data or better techniques become available.

Benefits

Cloud Translation can translate text for more than 100 language pairs. If you don't know
the language of your source text, Cloud Translation can detect it for you.

Cloud Translation scales seamlessly and allows unlimited character translations per day.
However, there are restrictions on content size for each request and request rates.
Additionally, you can use quota limits to manage your budget.

Cloud Translation editions

Cloud Translation offers two editions:

Cloud Translation - Basic (v2)

Cloud Translation - Advanced (v3)


Both editions support language detection and translation. They also use the same Google
pre-trained model to translate content. Cloud Translation - Advanced includes features
such as batch requests, AutoML custom models, and glossaries.

Cloud Translation - Basic (v2)

Cloud Translation - Basic enables you to dynamically translate between languages by


using Google's pre-trained Neural Machine Translation (NMT) model. Basic is optimized
for simplicity and scale. It's a good fit for applications that handle primarily casual user-
generated content, such as chat, social media, or comments.

Cloud Translation - Advanced (v3)

Cloud Translation - Advanced has all of the capabilities of Cloud Translation - Basic and
more. Cloud Translation - Advanced edition is optimized for customization and long form
content use cases. Customization features include glossaries and custom model selection.

Pricing

Translation charges you on a monthly basis based on the number of characters that
you send.

5. Vision API

Cloud Vision allows developers to easily integrate vision detection features within
applications, including image labeling, face and landmark detection, optical character
recognition (OCR), and tagging of explicit content.

Vision API currently allows you to use the following features:

 Text detection
Optical character recognition (OCR) for an image; text recognition and conversion
to machine-coded text. Identifies and extracts UTF-8 text in an image.
 Document text detection (dense text / handwriting)
Optical character recognition (OCR) for a file (PDF/TIFF) or dense text image;
dense text recognition and conversion to machine-coded text.

 Landmark detection
Landmark Detection detects popular natural and human-made structures within
an image.

 Logo Detection
Logo Detection detects popular product logos within an image.
 Label detection
Provides generalized labels for an image.
For each label returns a textual description, confidence score, and topicality rating.
 Image properties
Returns dominant colors in an image.
Each color is represented in the RGBA color space, has a confidence score, and
displays the fraction of pixels occupied by the color [0, 1].

 Object localization
The Vision API can detect and extract multiple objects in an image with Object
Localization.
 Crop hint detection
Provides a bounding polygon for the cropped image, a confidence score, and an
importance fraction of this salient region with respect to the original image for
each request.

You can provide up to 16 image ratio values (width:height) for a single image.
 Web entities and pages
Provides a series of related Web content to an image.
 Face detection
Locates faces with bounding polygons and identifies specific facial "landmarks"
such as eyes, ears, nose, mouth, etc. along with their corresponding confidence
values.
Returns likelihood ratings for emotion (joy, sorrow, anger, surprise) and general
image properties (underexposed, blurred, headwear present).
Likelihoods ratings are expressed as 6 different values: UNKNOWN,
VERY_UNLIKELY, UNLIKELY, POSSIBLE, LIKELY, or VERY_LIKELY.

6.Video Intelligence API

The Video Intelligence API allows developers to use Google video analysis technology as
part of their applications. The REST API enables users to annotate videos stored locally
or in Cloud Storage, or live-streamed, with contextual information at the level of the entire
video, per segment, per shot, and per frame.
10. ASSIGNMENT

1. Create custom mode VPC networks with firewall rules. Create VM

instances using Compute Engine. Explore the connectivity for VM


instances across VPC networks. Create a VM instance with multiple
network interfaces.
2. Create a nginx web server. Create tagged firewall rules. Create a

service account with IAM roles. Explore permissions for the Network
Admin and Security Admin roles.

3. Configure an HTTP Load Balancer with global backends, as shown in


the diagram below. Then, you stress test the Load Balancer and
denylist the stress test IP with Cloud Armor.
Perform the following tasks: Create HTTP and health check firewall
rules. Configure two instance templates. Create two managed instance
groups. Configure an HTTP Load Balancer with IPv4 and IPv6. Stress

test an HTTP Load Balancer. Denylist an IP address to restrict access


to an HTTP Load Balancer
11. PART A QUESTIONS AND ANSWERS

1. What is Big Data?


Big Data is a collection of datasets that is large and complex, making it very difficult
to process using legacy data processing applications. Tools for big data can help with
the volume of the data collected, the speed at which that data becomes available to
an organization for analysis, and the complexity or varieties of that data.

2. What are the characteristics of Big Data?


The attributes that define big data are volume, variety, velocity, and variability. These
big data attributes are commonly referred to as the four v’s.
Volume: The key characteristic of big data is its scale—the volume of data that is
available for collection by your enterprise from a variety of devices and sources.
Variety: Variety refers to the formats that data comes in, such as email messages,
audio files, videos, sensor data, and more. Classifications of big data variety
include structured, semi-structured, and unstructured data.
Velocity: Big data velocity refers to the speed at which large datasets are
acquired, processed, and accessed.
Variability: Big data variability means the meaning of the data constantly
changes. Therefore, before big data can be analyzed, the context and meaning of
the datasets must be properly understood.

3. What is Cloud Dataproc?


Cloud Dataproc is a managed Apache Hadoop and Spark service for batch processing,
querying, streaming and machine learning. Users can quickly spin up Hadoop or Spark
clusters and resize them at any time without compromising data pipelines through
automation and orchestration. It can be fully integrated with other Google big data
services, such as BigQuery and Bigtable, as well as Stackdriver Logging and Monitoring.

4. What is Cloud Dataflow?


Cloud Dataflow is a serverless stream and batch processing service. Users can build a
pipeline to manage and analyze data in the cloud, while Cloud Dataflow automatically
manages the resources. It was built to integrate with other Google services, including
BiqQuery and Cloud Machine Learning, as well as third-party products, such as Apache
Spark and Apache Beam.

5. What is Google BigQuery?


BigQuery is a data warehouse that processes and analyzes large data sets using SQL
queries. These services can capture and examine streaming data for real-time
analytics. It stores data with Google's Capacitor columnar data format, and users can
load data via streaming or batch loads. To load, export, query and copy data, use the
classic web UI, the web UI in the GCP Console, the bq command-line tool or client
libraries. Since BigQuery is a serverless offering, enterprises only pay for the storage
and compute they consume.

6. What is Apache Hadoop?


Apache Hadoop is a set of tools and technologies which enables a cluster of
computers to store and process large volumes of data. The Apache Hadoop software
library is a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models.

7. List the features of Cloud Dataproc.


 Cost Effective
 Fast and Scalable
 Open-source ecosystem
 Fully managed
 Image versioning
 Built-in integration
8. How will you create a Dataproc cluster with an optional component?
To create a Dataproc cluster and install one or more optional components on the
cluster, use the gcloud beta dataproc clusters create cluster-name command with the -

-optional-components flag.
gc l o ud dataproc c l u s t e r s c reat e clusĞer-name \
- - o p t onal-components=CO PONENT-NA E ( s ) \
. . . oĞher filags
9. List the usecases of Dataproc.
 Log processing
 Ad-hoc data analysis
 Machine learning
10. What is Dataproc workflow template?
The Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism
for managing and executing workflows. A Workflow Template is a reusable workflow
configuration. It defines a graph of jobs with information on where to run those jobs.

11. List the types of workflow templates.


 Managed cluster
 Cluster selector
 Parameterized
 Inline
12. What is a pipeline?
A pipeline encapsulates the entire series of computations involved in reading input
data, transforming that data, and writing output data. The input source and output
sink can be the same or of different types, allowing you to convert data from one
format to another. Apache Beam programs start by constructing a Pipeline object, and
then using that object as the basis for creating the pipeline's datasets. Each pipeline
represents a single, repeatable job.

13. What is BigQuery storage?


 BigQuery stores data using a columnar storage format that is optimized for
analytical queries.
 BigQuery presents data in tables, rows, and columns and provides full support for
database transaction semantics (ACID).
 BigQuery storage is automatically replicated across multiple locations to provide
high availability.

14. What are the ways to ingest data into BigQuery?


There are several ways to ingest data into BigQuery:
 Batch load a set of data records.
 Stream individual records or batches of records.
 Use queries to generate new data and append or overwrite the results to a table.
 Use a third-party application or service.
15. Define query. What are the types of data sources?
The primary unit of analysis in BigQuery is the SQL query. BigQuery has two SQL
dialects: Google Standard SQL and legacy SQL. Google Standard SQL is the preferred
dialect.

 Data sources:

BigQuery lets you query the following types of data sources:

Data stored in BigQuery - load data into BigQuery for analysis.

External data - query various external data sources such other Google Cloud storage
services (like Cloud Storage) or database services (like Cloud Spanner or Cloud SQL).

Multi-cloud data - query data that's stored in other public clouds such as AWS or Azure.

Public datasets - can analyze any of the datasets that are available in the public dataset
marketplace.

16. Define Machine Learning


Machine learning (ML) is a subfield of artificial intelligence (AI). The goal of ML is to make
computers learn from the data that you give them. Instead of writing code that describes
the action the computer should take, your code provides an algorithm that adapts based
on examples of intended behavior. The resulting program, consisting of the algorithm
and associated learned parameters, is called a trained model.

17. List out the ML options available in Google Cloud


 BigQuery ML
 Auto ML
 Custom training
 Pre-built APIs
18. Describe Auto ML option in Google Cloud
AutoML is a no-code solution to build ML models through Vertex AI. It allows users to
train high-quality custom machine learning models with minimal effort or machine
learning expertise.

19. What are the Ways BigQuery ML functionality can be accessed by


customer?
 The Google Cloud console
 The bq command-line tool
 The BigQuery REST API
 An external tool such as a Jupyter notebook or business intelligence platform
20. Define VERTEX AI.
Build, deploy, and scale machine learning (ML) models faster, with fully managed ML
tools for any use case. Vertex AI allows users to build machine learning models with
either AutoML, a no-code solution, or Custom Training, a code-based solution. AutoML is
easy to use and lets data scientists spend more time turning business problems into ML
solutions, but custom training enables data scientists to have full control over the
development environment and process.
21. List out and describe the stages in VERTEX AI.
● Data readiness:

During the data readiness stage, users can upload data from wherever it’s stored:
Cloud Storage, BigQuery, or a local machine.

● Feature readiness:

Then, during the feature readiness stage, users can create features, which are the
processed data that will be put into the model, and then share them with others
by using the feature store.

● Training and hyperparameter tuning:

After that, it’s time for training and hyperparameter tuning. This means that when
the data is ready, users can experiment with different models and adjust
hyperparameters.

● Deployment and model monitoring:

And finally, during deployment and model monitoring, users can set up the pipeline
to transform the model into production by automatically monitoring and
performing continuous improvements.

22. What are the Benefits of Vertex AI


 It’s seamless. Vertex AI provides a smooth user experience from uploading and
preparing data all the way to model training and production.
 It’s scalable. The machine learning operations (MLOps) provided by Vertex AI helps
to monitor and manage the ML production and therefore scale the storage and
computing power automatically.
 It’s sustainable. All the artifacts and features created with Vertex AI can be reused
and shared.
 And it’s speedy. Vertex AI produces models that have 80% fewer lines of code
than competitors
23. Which of the architecture the Auto ML will support
 Linear
 Feedforward deep neural network
 Gradient Boosted Decision Tree
 AdaNet
 Ensembles of various model architectures
24. Benefits of AutoML
 Using these technologies has produced a tool that can significantly benefit data
scientists
 One of the biggest benefits is that it’s a no-code solution. That means it can train
high-quality custom machine learning models with minimal effort and requires little
machine learning expertise.
 This allows data scientists to focus their time on tasks like defining business
problems or evaluating and improving model results.
 Others might find AutoML useful to quickly prototype models and explore new
datasets before investing in development. This might involve using it to identify
the best features in a dataset

25. Examples of Pre-built API:


● The Speech-to-Text API converts audio to text for data processing.
● The Cloud Natural Language API recognizes parts of speech called entities and
sentiment.

● The Cloud Translation API converts text from one language to another.
● The Text-to-Speech API converts text into high-quality voice audio.
● The Vision API works with and recognizes content in static images.
● And the Video Intelligence API recognizes motion and action in video.
12. PART B QUESTIONS

1. How will you leverage Big Data operations with cloud Dataproc?
2. How will you build extract, transform, and load pipelines using cloud dataflow?
3. Explain in detail about BigQuery.
4. Explain in detail about Machine Learning tools in Google Cloud.
5. Demonstrate the concept VERTEX AI in detail.
6. List out the Prebuilt APIs and explain it in detail.
7. Explain the concept Auto ML in detail.
13. ONLINE CERTIFICATIONS

1. Cloud Digital Leader

Cloud Digital Leader | Google Cloud

2. Associate Cloud Engineer:

Associate Cloud Engineer Certification | Google Cloud

3. Google Cloud Computing Foundations Course

https://onlinecourses.nptel.ac.in/noc20_cs55/preview

4. Google Cloud Computing Foundations

https://learndigital.withgoogle.com/digitalgarage/course/gcloud-

computing-foundations
14. REAL TIME APPLICATIONS

Twitter: Helping customers find meaningful Spaces with AutoML


Since Twitter introduced Spaces in 2020 to enable live audio conversations on its
platform, the Twitter Spaces Engineering team has been continually testing, building, and
updating this feature in the open. Today, anyone can join, listen, and speak in a Space
on Twitter, and the feature’s popularity has taken off. But this success also poses a
challenge: with millions of people creating and joining Spaces at any time, how can they
find the Spaces to engage with while they’re happening? Taking this as an opportunity to
further improve the experience of its customers, Twitter has turned to machine learning
(ML) and cloud technology for answers.

“ML fits into the natural progression of Twitter consumer and revenue product building,
especially for a product feature such as Spaces,” explains Diem Nguyen, Senior Machine
Learning Engineer and Data Scientist at Twitter. “We launched Spaces with a base-line
algorithm using the ‘most popular’ heuristic which assumes that if a Space is popular,
there’s a good chance you'd like it too. But our aim is to leverage ML to surface the most
interesting and relevant Spaces to a particular Twitter customer, making it easier for them
to find and join the conversations they personally care about. This is a complex
functionality that Google Cloud ML capabilities help us to enable.”

While looking for the right tools to power this vision, Nguyen and her team started
evaluating in December 2021 whether the Vertex AI platform and AutoML in particular
could solve challenges observed when they first started building Spaces. These included
a lack of dedicated ML resources to build and deploy the product feature, and the need
to work on a multi-cloud environment.

“We had three key questions in mind during our assessment,” Nguyen explains. “Can we
realistically deploy the AutoML model off-platform? Once deployed, can it solve for the
request load that we get from the service we’re serving (in this case, the Spaces tab)?
And finally, can we develop and maintain such a solution without a dedicated team of ML
experts for this project?” The answer to all three questions was yes.

Positive answers motivated the Spaces Engineering team to take the solution to
production in February 2022. “We started using AutoML Tables to train high-accuracy
models with minimal ML expertise or effort, alleviating our resource constraint,” says
Nguyen of the results. “Soon AutoML also stood out for its high performance and for
supporting easy deployment beyond the Google Cloud Platform, making it ideal for this
project hosted in a multi-cloud environment.”

With a classification model in place to predict the probability of user engagement in a


particular Space, Twitter now aims to optimize its model with aggregated data around
Twitter features that can help it better understand customer preferences. For example, if
a customer has historically engaged with a particular topic and a new Space matches that
topic, the ML model increases the score of that Space being served to that user on the
Spaces tab.

Because Spaces are live audio conversations, the Spaces tab needs to be ranked to
customers in near real time, so they don’t miss out. With this in mind, Twitter’s model
currently performs 900 queries per second on the Spaces tab and evaluates 50,000
candidates per second. Meanwhile, 99% of these requests are faster than 100
milliseconds, and 90% of requests are faster than 50 milliseconds.

Twitter leveraging AutoML for Spaces recommendations | Google Cloud Blog


16. ASSESSMENT SCHEDULE

• Tentative schedule for the Assessment During 2023-2024 odd semester

Name of the
S.NO Assessment Start Date End Date Portion

1 IAT 1 09.09.2023 15.09.2023 UNIT 1 & 2

2 IAT 2 26.10.2023 01.11.2023 UNIT 3 & 4

3 MODEL 15.11.2023 25.11.2023 ALL 5 UNITS


17. PRESCRIBED TEXTBOOKS AND REFERENCES

REFERENCES:

1. https://cloud.google.com/docs
2. https://www.cloudskillsboost.google/course_templates/153
3. https://nptel.ac.in/courses/106105223
18. MINI PROJECT

As a junior data engineer in Jooli Inc. and recently trained with Google Cloud
and a number of data services you have been asked to demonstrate your newly
learned skills. The team has asked you to complete the following tasks.

 Create a simple Dataproc job


 Create a simple DataFlow job
 Create a simple Dataprep job
 Perform one of the three Google machine learning backed API tasks
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

You might also like