KEMBAR78
Google Cloud DevOps Engineer Cheat Sheet | PDF | Cloud Computing | Security
0% found this document useful (0 votes)
90 views36 pages

Google Cloud DevOps Engineer Cheat Sheet

The document serves as a cheat sheet for the Google Cloud DevOps Engineer Master exam, covering key topics such as resource hierarchy design, infrastructure as code management, CI/CD architecture, and environment management. It details various tools and practices for effective DevOps implementation on Google Cloud, including Cloud Build, Terraform, and IAM strategies. Additionally, it emphasizes security considerations, artifact management, and the importance of testing and approval workflows in CI/CD pipelines.

Uploaded by

Valentin Garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views36 pages

Google Cloud DevOps Engineer Cheat Sheet

The document serves as a cheat sheet for the Google Cloud DevOps Engineer Master exam, covering key topics such as resource hierarchy design, infrastructure as code management, CI/CD architecture, and environment management. It details various tools and practices for effective DevOps implementation on Google Cloud, including Cloud Build, Terraform, and IAM strategies. Additionally, it emphasizes security considerations, artifact management, and the importance of testing and approval workflows in CI/CD pipelines.

Uploaded by

Valentin Garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

SKILLCERTPRO

Google Cloud DevOps Engineer Master


Cheat Sheet
Section 1: Bootstrapping a Google Cloud organization for DevOps (~17% of the exam)
1.1 Designing the overall resource hierarchy for an organization. Considerations include:
Projects and Folders:

 Projects: Represent a collection of related Cloud resources, like Compute Engine instances,
Cloud Storage buckets, and Cloud Functions. Each project has a unique identifier and serves
as a unit of isolation for billing, IAM, and quotas.

 Folders: Organize projects into a hierarchical structure for better management and access
control. Folders group related projects based on function, department, or environment (e.g.,
development, staging, production).

Shared Networking:

 VPC (Virtual Private Cloud): A logically isolated network within Google Cloud Platform (GCP)
that provides a private address space for your resources. You can create multiple VPCs for
different environments or workloads.

 Cloud NAT: Enables outbound internet access for resources in private VPC subnets without
exposing them directly to the public internet.

 Cloud Interconnect: Creates a dedicated private connection between your on-premises


network and GCP, offering high bandwidth and low latency for hybrid cloud deployments.

Identity and Access Management (IAM):

 Governs who (users, service accounts) can access what resources (projects, buckets, etc.)
and what actions they can perform (read, write, delete).

 Roles: Predefined sets of permissions that grant specific access levels to resources. (e.g.,
"Owner" has full access, "Editor" can modify resources).

 Policies: Organization-wide rules that define IAM permissions at a higher level, inheriting
down to projects and folders.

Service Accounts:

 Special Google accounts used by applications or services to access GCP resources without
requiring a human user.

 Enhance security by avoiding the need for hardcoded credentials in code.

 Permissions: Assigned IAM roles to grant service accounts the necessary access to perform
their tasks.

Designing the Resource Hierarchy:

pg. 1
SKILLCERTPRO

 Project Granularity: Consider the size and complexity of your organization. Create projects
for specific applications, environments, or teams.

 Folder Structure: Organize projects logically to reflect your organizational structure and
access control needs.

 IAM Strategy: Define clear and granular IAM policies at the organization, folder, and project
levels to minimize access sprawl and improve security.

 Service Account Management: Create and manage service accounts with appropriate IAM
roles for application authentication and access control.

1.2 Managing infrastructure as code. Considerations include:


Infrastructure as Code Tooling

 Cloud Foundation Toolkit (Cloud SDK): Google's official command-line toolset for interacting
with GCP services. It allows provisioning and managing resources directly through
commands.

 Config Connector: Bridges the gap between GCP resources and IaC tools like Terraform. It
lets you manage GCP resources using familiar IaC syntax within existing workflows.

 Terraform: A popular, open-source IaC tool known for its flexibility and wide adoption across
cloud platforms. Terraform uses HashiCorp Configuration Language (HCL) to define
infrastructure resources.

 Helm: Primarily used for packaging and managing Kubernetes applications. Helm charts
define deployments, configurations, and dependencies for containerized applications on
Kubernetes clusters.

Choosing the Right Tool:

Selecting the most suitable tool depends on your specific needs and existing workflows. Here's a
brief consideration guide:

 Cloud SDK: Ideal for quick deployments or scripting interactions with GCP services.

 Config Connector: Integrates well with existing Terraform workflows for managing GCP
resources.

 Terraform: A versatile option for managing complex infrastructure across multiple cloud
providers.

 Helm: If your application relies on Kubernetes, Helm is the go-to tool for managing its
deployment and configuration.

Making Infrastructure Changes with Best Practices

 Leverage IaC Blueprints: Define infrastructure configurations using IaC tools. These
blueprints act as version-controlled templates, ensuring consistency and repeatability during
deployments.

 Version Control: Store your IaC code in a version control system (VCS) like Git. This allows
tracking changes, collaboration, and rollbacks if necessary.

pg. 2
SKILLCERTPRO

 Testing and CI/CD Integration: Implement testing procedures to validate your IaC code
before deployments. Integrate your IaC workflows with CI/CD pipelines for automated
infrastructure provisioning and updates.

 Use Google Recommendations: Refer to Google's documented best practices for


infrastructure management on GCP https://cloud.google.com/security/best-practices.

Immutable Infrastructure

Immutable infrastructure is a core principle for managing infrastructure as code. It revolves around
treating infrastructure resources as immutable objects. Here's the core idea:

 Provisioning: When changes are needed, a new infrastructure configuration is created,


essentially a new version of your infrastructure.

 Deployment: The new configuration is deployed alongside the existing infrastructure.

 Cutover: Once the new infrastructure is verified, traffic is switched over to it, and the old
infrastructure is decommissioned.

Benefits of Immutable Infrastructure:

 Improved Reliability: Rollbacks become easier as you can simply switch back to the previous
version.

 Increased Consistency: Infrastructure configurations are defined and deployed in a


controlled manner, minimizing configuration drift.

 Simplified Testing: Testing new infrastructure versions is easier when deploying them
alongside existing ones.

While powerful, immutable infrastructure can have trade-offs:

 Increased Resource Consumption: You might temporarily have both old and new
infrastructure running during cutover.

 More Complex Deployments: Deployment processes might be slightly more intricate


compared to modifying existing resources directly.

1.3 Designing a CI/CD architecture stack in Google Cloud, hybrid, and multi-cloud
environments. Considerations include:
CI with Cloud Build

 What is it? Cloud Build is a Google-managed CI/CD service. It triggers builds based on code
changes in repositories like Cloud Source Repositories, GitHub, or Bitbucket. Builds can be
customized with various steps like building container images, running tests, and deploying
artifacts.

 Benefits:

o Serverless: No infrastructure management needed.

o Scalable: Handles high volumes of builds effortlessly.

pg. 3
SKILLCERTPRO

o Docker Integration: Seamless integration with Docker for containerized builds.

o Customizable: Define workflows with scripts and commands for specific needs.

 Security Considerations:

o Service Account Permissions: Grant Cloud Build service accounts least privilege
access to repositories and resources.

o Build Step Security: Secure build steps by using trusted container images and
avoiding hardcoded secrets.

CD with Google Cloud Deploy

 What is it? Google Cloud Deploy is a managed service for continuous delivery. It automates
deployments based on build artifacts created by Cloud Build. Cloud Deploy offers features
like:

o Multi-environment deployments (staging, production)

o Rollback capabilities

o Detailed deployment history

 Benefits:

o Automated Deployments: Reduces manual intervention and errors.

o Rollback: Easily revert to previous deployments in case of issues.

o Auditing: Tracks deployment history for troubleshooting and compliance.

 Security Considerations:

o Deployment Permissions: Grant Cloud Deploy service accounts appropriate


permissions to deploy to specific environments.

o Environment Variables: Use Secret Manager to store sensitive environment variables


accessed during deployments.

Widely Used Third-Party Tooling

While Google Cloud offers excellent CI/CD tools, some scenarios might require additional options:

 Jenkins: A popular open-source CI/CD server offering extensive plugin support for various
tasks and integrations.

 Git: The de-facto standard version control system (VCS) for code management. Integrates
seamlessly with CI/CD pipelines for triggering builds on code changes.

 ArgoCD: An open-source GitOps tool that continuously applies infrastructure and application
configurations declared as Git repositories. Useful for managing deployments in multi-cluster
environments.

 Packer: An open-source tool for creating identical machine images for various cloud
providers. Helpful for building consistent base images for deployments across hybrid and
multi-cloud environments.

pg. 4
SKILLCERTPRO

Security Considerations for Third-Party Tools:

 Regular Updates: Keep third-party tools updated with the latest security patches.

 Access Control: Implement access controls to restrict who can manage configurations and
deployments.

 Vulnerability Scanning: Regularly scan infrastructure and application configurations for


vulnerabilities.

1.4 Managing multiple environments (e.g., staging, production). Considerations include:


Determining the Number of Environments and Their Purpose

This is crucial for establishing a smooth development workflow with clear separation of concerns.
Here's a breakdown:

 Typical Environments:

o Development (Dev): Used by developers for building and testing new features. Code
changes are frequently deployed here.

o Staging (QA): Simulates a production environment for rigorous testing before


pushing updates live.

o Production (Prod): The live environment where your application serves real users.

 Additional Environments (Optional):

o Integration (Int): Dedicated environment for testing how different microservices or


components interact.

o Performance (Perf): Environment specifically designed for load and performance


testing.

o Canary: A small subset of production users receive updates here first, allowing for
early detection of issues.

 Factors to Consider:

o Project Complexity: For larger projects, additional environments for specific


purposes might be beneficial.

o Deployment Frequency: High-frequency deployments might necessitate more


staging environments for proper testing.

o Team Size and Workflow: Tailor the environments to support efficient collaboration
within your development team.

Creating Environments Dynamically with GKE and Terraform

Here, automation plays a key role in streamlining environment creation:

 Infrastructure as Code (IaC): Tools like Terraform allow you to define infrastructure
configurations (networks, VMs, GKE clusters) as code. This code can be version controlled
and used to provision environments automatically.

pg. 5
SKILLCERTPRO

 Google Kubernetes Engine (GKE): A managed Kubernetes service for deploying and
managing containerized applications. Terraform can be used to create GKE clusters for each
environment.

 Benefits:

o Consistency: Ensures all environments are built identically, minimizing configuration


drift.

o Repeatability: Quickly rebuild or tear down environments for testing or rollbacks.

o Scalability: Easily manage multiple environments across a large project.

Config Management

This refers to the practice of managing the configuration of your application and infrastructure in a
centralized and automated way. Here's how it applies:

 Tools: Solutions like Anthos Config Management or Cloud Build can be used to define and
deploy configurations to different environments. These configurations can include
application settings, environment variables, secrets, and more.

 Version Control: Configuration files are version controlled alongside your application code,
ensuring consistency across environments.

 Benefits:

o Reduced Errors: Minimizes manual configuration changes and potential errors.

o Simplified Rollbacks: Easily rollback configuration changes across environments.

o Auditing: Provides a centralized location to track and audit configuration changes.

Section 2: Building and implementing CI/CD pipelines for a service (~23% of the exam)
2.1 Designing and managing CI/CD pipelines. Considerations include:
Artifact Management with Artifact Registry:

 What it is: Artifact Registry is a managed service on Google Cloud Platform (GCP) for storing
and managing container images, build artifacts, and other files used in your software
development process.

 Benefits in CI/CD:

o Centralized repository: Store all your build artifacts (e.g., Docker images, compiled
code) in one place, simplifying access and management for your CI/CD pipeline.

o Version control: Easily track different versions of artifacts, allowing rollbacks or


deployments of specific versions if needed.

o Security: Control access to artifacts with granular permissions, ensuring only


authorized users can access and deploy them.

pg. 6
SKILLCERTPRO

o Integration with CI/CD tools: Artifact Registry integrates seamlessly with Cloud
Build, allowing automatic pushing and pulling of artifacts during the build and
deployment process.

Deployment to Hybrid and Multi-Cloud Environments (e.g., Anthos, GKE):

 The Scenario: Your application might run across different environments, like on-premises
infrastructure, public clouds (GCP, AWS, Azure), or a combination (hybrid/multi-cloud).

 CI/CD Considerations: Your CI/CD pipeline needs to be flexible enough to handle


deployments across these environments. Here's how GCP helps:

o Cloud Build triggers: Configure triggers in Cloud Build that initiate the pipeline based
on events in different environments (e.g., Git push on a specific branch for on-prem
deployments, Pub/Sub message for cloud deployments).

o Containerization: Package your application as container images, making them


portable and easily deployable across any environment that supports containers (like
Anthos for on-prem or GKE for GCP).

o Service Meshes (like Istio): Use service meshes to manage service discovery, routing,
and communication between your application components across different cloud
environments.

CI/CD Pipeline Triggers:

 What they are: Triggers are events that initiate your CI/CD pipeline. They automate the
pipeline execution, reducing manual intervention and ensuring timely deployments.

 Common Triggers in GCP:

o Git events: Triggers based on Git actions in your source code repository (e.g., push to
a specific branch, merge to main branch).

o Cloud Storage events: Triggers based on changes in Cloud Storage buckets (e.g., new
file upload signifying a new build artifact).

o Pub/Sub messages: Triggers based on messages published to a Pub/Sub topic (e.g.,


message from a testing framework indicating successful tests).

o Schedule triggers: Run the pipeline periodically at a specific time or interval (e.g.,
nightly builds for continuous integration).

Testing a new application version in the pipeline:

 Integration Tests: In your CI pipeline, you can leverage Cloud Build to trigger automated
integration tests whenever there's a code commit. These tests ensure different parts of your
application work together seamlessly. Tools like JUnit or Google Test can be used within
Cloud Build for unit and integration testing.

 Deployment Testing: Before promoting a new version to production, consider using Cloud
Run for deployment testing. Cloud Run allows deploying containerized applications on a fully
managed serverless platform. You can deploy your new version to a staging environment on
Cloud Run and perform functional and performance tests there.

pg. 7
SKILLCERTPRO

 Load Testing: Utilize tools like Cloud Load Testing to simulate real-world traffic and assess
your application's performance under stress. This helps identify potential bottlenecks before
pushing the new version live.

Configuring deployment processes (e.g., approval flows):

 Environments: Set up different environments (development, staging, production) on GCP.


Cloud Build triggers can be configured to deploy code based on the branch being pushed to
(e.g., deploy to staging on pushes to the staging branch).

 Approval Workflows: Integrate Cloud Build with Cloud Functions to create custom approval
workflows. A Cloud Function can be triggered upon a successful build, sending notifications
or requiring manual approval before deploying to a specific environment.

 Deployment Strategies: Utilize deployment strategies like blue/green or canary


deployments. Blue/green involves switching all traffic to a completely new version (green)
after successful deployment, while canary deployments introduce the new version to a small
subset of users first. Google Cloud Deployment Manager can be used to automate these
strategies.

CI/CD of serverless applications:

 Cloud Build Triggers: Configure Cloud Build triggers to automatically build and deploy your
serverless code upon changes. For example, triggers can be set up for code pushes to your
Git repository hosted on Cloud Source Repositories.

 Cloud Functions: Cloud Functions are a perfect fit for serverless deployments. Cloud Build
can directly deploy your function code upon a successful build.

 Cloud Run vs. Cloud Functions: While both are serverless execution environments, Cloud
Run offers a container-based approach, allowing for more complex deployments requiring
frameworks or specific runtimes. Choose the environment that best suits your application's
needs.

 Testing Considerations: Testing serverless applications within the CI/CD pipeline requires
some adjustments. Mockito or Sinon.JS can be used for mocking dependencies during unit
testing. Integration testing can be done by deploying the entire serverless application to a
staging environment and testing its functionality there.

2.2 Implementing CI/CD pipelines. Considerations include:


Auditing and Tracking Deployments

 Visibility and Traceability:

o Cloud Build Logs: These logs provide detailed information about each build step's
execution, success/failure status, and any errors encountered. This allows you to
diagnose issues and track changes made during the build process.

o Artifact Registry: When using Artifact Registry to store build artifacts (container
images, packages, etc.), it automatically logs versioning information and timestamps.
This helps you identify which specific artifacts were deployed in a particular version.

pg. 8
SKILLCERTPRO

o Cloud Deploy Logs: Cloud Deploy logs track the deployment process, including the
deployed version, target environment, and any encountered issues. This provides a
clear audit trail for deployments.

o Cloud Audit Logs: GCP offers centralized Cloud Audit Logging, which captures a
comprehensive record of all API calls made to GCP services. This includes calls
triggered during your CI/CD pipeline, allowing you to track who made changes, what
was changed, and when.

 Best Practices:

o Implement a consistent logging strategy across all pipeline stages (Cloud Build, Cloud
Deploy, etc.) to ensure comprehensive auditing.

o Leverage Cloud IAM and roles to control access to different pipeline stages and
resources, ensuring only authorized users can make deployments.

o Define clear naming conventions for artifacts and versions for easy identification in
logs and registries.

Deployment Strategies

 Canary Deployments:

o A small subset of production traffic is directed to a new version of your application


running alongside the existing version.

o You can monitor the canary deployment's performance and stability before gradually
rolling it out to all users.

o Ideal for low-risk changes or for testing new features in a production-like


environment.

 Blue/Green Deployments:

o Two identical production environments (blue and green) are maintained.

o Deployments are made to the green environment first.

o Once testing is complete and the green environment is deemed stable, traffic is
switched over from blue to green, effectively replacing the old version with the new
one.

o Minimal downtime as the switch is typically instantaneous.

 Rolling Deployments:

o The new version is gradually rolled out to a percentage of production servers at a


time.

o Offers a balance between risk mitigation and faster deployments compared to


blue/green.

o Well-suited for applications that can tolerate short periods of inconsistency during
the rollout.

 Traffic Splitting:

pg. 9
SKILLCERTPRO

o Route a specific percentage of traffic to the new version of your application while the
remaining traffic continues using the existing version.

o Allows for A/B testing and gradual adoption of new features.

o Provides a way to monitor user behavior and application performance with both
versions before fully committing.

 Choosing the Right Strategy:

o The best deployment strategy depends on your application's characteristics, risk


tolerance, and desired rollout speed.

o Consider factors like downtime tolerance, rollback mechanisms, and the need for
A/B testing when making your decision.

Types of Rollbacks:

o Blue/Green Deployment: This approach involves maintaining two identical


production environments (Blue and Green). New deployments happen on Green,
and if successful, traffic is switched from Blue to Green. If there are issues, you can
simply switch traffic back to Blue.

o Canary Deployment: A small subset of users (canary) are exposed to the new
deployment first. If successful, the rollout continues to a larger group. This allows for
early detection of problems.

o Rollback to Previous Version: This is a simpler approach where you revert the
deployment configuration to the previous working version.

 Considerations for Rollback Strategies:

o Automated vs. Manual: Decide if rollbacks should be triggered automatically based


on pre-defined criteria (e.g., failed tests, performance degradation) or require
manual intervention.

o Rollback History: Maintain a history of deployments for easy rollback to a specific


version.

o Testing Rollbacks: Test your rollback procedures regularly to ensure they function
smoothly.

 Tools for Rollbacks in GCP:

o Cloud Deployment Manager: Lets you define infrastructure configurations and


rollback to previous deployments.

o Cloud Spanner: Offers a versioning system that allows reverting database changes.

o Cloud Build Logs & Artifact Registry: Provide historical logs and artifacts for
identifying the rollback point.

Troubleshooting Deployment Issues

Troubleshooting deployment issues is an inevitable part of CI/CD pipelines. Here are some key points
to consider:

pg. 10
SKILLCERTPRO

 Monitoring & Logging:

o Implement comprehensive monitoring of your application's health and performance


throughout the deployment process.

o Utilize Cloud Monitoring and Cloud Logging to gather detailed logs about the
deployment process.

 Identifying the Root Cause:

o Analyze logs to pinpoint the stage in the pipeline where the issue occurred.

o Leverage debuging tools and techniques to identify the specific code or configuration
causing the problem.

 Rollback vs. Fix in Place:

o Depending on the severity of the issue, decide if a rollback or a fix-in-place approach


is more suitable.

 Version Control & Reproducibility:

o Maintain clear version control of your codebase to easily identify changes that might
have caused the issue.

o Aim to reproduce the deployment issue in a non-production environment to isolate


the root cause.

2.3 Managing CI/CD configuration and secrets. Considerations include:


Secure Storage Methods and Key Rotation Services:

 Cloud Key Management Service (KMS): Google Cloud's KMS provides a centralized location
to manage encryption keys. You can create, rotate, and control access to these keys used for
data encryption at rest and in transit.

 Secret Manager: This service securely stores and manages API keys, passwords, certificates,
and other sensitive information. Secrets Manager integrates with CI/CD pipelines to inject
secrets securely into your applications at runtime.

Benefits:

 Centralized Management: Both KMS and Secret Manager offer a single point of control for
all your encryption keys and secrets, simplifying administration.

 Enhanced Security: They eliminate the need to store secrets in plain text within your code or
configuration files, reducing the risk of exposure.

 Granular Access Control: You can define fine-grained access permissions for who can access
and use these secrets, ensuring only authorized services can utilize them.

Secret Management:

pg. 11
SKILLCERTPRO

 Secret Rotation: Regularly rotate your secrets to minimize the window of vulnerability if
compromised. Both KMS and Secret Manager allow automated rotation of secrets.

 Least Privilege: Grant only the minimum level of access required for each service or user to
access secrets. This reduces the impact if a credential is compromised.

 Auditing and Logging: Enable audit logging to track all access attempts and modifications
made to secrets. This helps identify suspicious activity and potential breaches.

Build vs. Runtime Secret Injection:

 Build-Time Injection: Secrets are injected into the application image during the build
process. This approach simplifies deployment but can be risky if the image leaks.

 Runtime Injection: Secrets are injected into the application at runtime using environment
variables or credential providers. This offers better security as secrets don't persist within the
application image.

Choosing the Right Approach:

The best approach depends on the type of secret and your security requirements. Build-time
injection is suitable for non-sensitive data like database connection strings, while runtime injection is
preferred for highly sensitive credentials like API keys.

Additional Tips:

 Use environment variables for runtime secret injection to avoid hardcoding them in your
application code.

 Consider leveraging Secret Manager's Secret Access Environment Variable (SAEV) feature for
secure access to secrets within workloads running on Google Cloud.

 Regularly review and update your CI/CD pipeline configurations to ensure they adhere to
best security practices.

2.4 Securing the CI/CD deployment pipeline. Considerations include:


Vulnerability Analysis with Artifact Registry

 What it is: Artifact Registry is a GCP service that stores container images and other build
artifacts. It has built-in vulnerability scanning capabilities that automatically identify security
weaknesses in your container images during the build process.

 Benefits:

o Early detection of vulnerabilities: Catches security issues early in the pipeline,


preventing deployment of risky code.

o Streamlined workflow: Vulnerability scanning happens automatically, reducing


manual effort and saving time.

o Improved security posture: Proactive identification of vulnerabilities helps maintain a


strong security posture.

Binary Authorization

pg. 12
SKILLCERTPRO

 What it is: Binary Authorization is a GCP service that enforces policy-based control over
deployments. It allows you to define rules that dictate which container images can be
deployed to specific environments.

 Benefits:

o Enforces security policies: Ensures only authorized and approved images are
deployed, preventing unauthorized code from reaching production.

o Reduces risk of breaches: Mitigates the risk of deploying vulnerable or malicious


images.

o Improves compliance: Helps meet compliance requirements for secure deployments.

IAM Policies per Environment

 What it is: Identity and Access Management (IAM) is a core GCP service for controlling
access to resources. By defining IAM policies per environment (development, staging,
production), you can restrict access to specific users or groups for each stage of the pipeline.

 Benefits:

o Principle of Least Privilege: Provides granular control over access, ensuring users only
have the permissions necessary for their role in the pipeline.

o Reduces accidental deployments: Prevents unauthorized users from deploying code


to production environments.

o Improved accountability: Provides clear audit trails for tracking who has access to
what resources within the pipeline.

Utilizing these tools together:

By combining these features, you can create a robust security posture for your CI/CD pipeline. Here's
a possible workflow:

1. Developer pushes code to a version control system like Git.

2. Cloud Build triggers a build pipeline based on the push.

3. The pipeline builds the container image and scans it for vulnerabilities using Artifact Registry.

4. If vulnerabilities are found, the build fails, and developers are notified.

5. Once the build succeeds, Binary Authorization verifies if the image is authorized for
deployment to the target environment.

6. If authorized, Cloud Build deploys the image to the appropriate environment (dev, staging,
production) with access restricted by IAM policies.

Section 3: Applying site reliability engineering practices to a service (~23% of the


exam)
3.1 Balancing change, velocity, and reliability of the service. Considerations include:
Discovering SLIs (Service Level Indicators):

pg. 13
SKILLCERTPRO

 These are measurable metrics that reflect a service's performance from the user's
perspective. They act as the foundation for monitoring and evaluating your service's health.

 Examples of SLIs on Google Cloud Platform (GCP) include:

o Availability: Percentage of time your service is operational (e.g., uptime). You can
use Cloud Monitoring to track this metric.

o Latency: Response time experienced by users when interacting with your service.

Defining SLOs (Service Level Objectives) and Understanding SLAs (Service Level Agreements):

 SLOs: These are targets you set for your SLIs. They define the acceptable level of
performance for your service. For instance, an SLO for availability might be 99.9% uptime.

 SLAs: These are formal agreements between you (service provider) and your users
(customers) that outline the expected level of service. They often translate SLOs into
business terms with potential repercussions for not meeting them.

Error Budgets:

 This concept treats reliability as a spendable resource. You define an acceptable error rate
(budget) for your service based on your SLOs and risk tolerance.

 Each error "costs" from your budget. Proactive monitoring and mitigation strategies help
ensure you don't exceed your budget and compromise service quality.

Toil Automation:

 DevOps engineers often get bogged down in repetitive, manual tasks (toil). Automating these
tasks using tools like Cloud Functions or Cloud Workflows frees up your time for more
strategic initiatives.

 By automating toil, you streamline processes, improve efficiency, and reduce the risk of
human error.

Opportunity Cost of Risk and Reliability (e.g., Number of Nines):

 The more "nines" you strive for in your availability SLO (e.g., 99.999% uptime), the more
resources it requires. There's a trade-off between achieving high reliability and the
associated cost (infrastructure, development effort).

 You need to assess the impact of downtime on your business and users to determine the
optimal level of reliability to target.

3.2 Managing service lifecycle. Considerations include:


Service Management:

This encompasses the various stages a service goes through, ensuring a smooth and efficient
journey:

 Pre-service Onboarding: Before launching a new service, a DevOps Engineer implements a


structured approach. This might involve using a checklist to guarantee essential

pg. 14
SKILLCERTPRO

configurations are in place. This could include setting up security best practices, defining
monitoring tools, and ensuring proper logging mechanisms.

 Launch Plan & Deployment Plan: Developing a clear launch plan ensures a coordinated
rollout of the new service. This plan outlines tasks, dependencies, and communication
strategies for a successful launch. A deployment plan specifies the technical steps for
deploying the service to GCP. This includes choosing the appropriate deployment method
(e.g., blue-green deployment, canary deployments) and automating the process using tools
like Cloud Build and Cloud Deploy.

 Deployment & Monitoring: The DevOps Engineer oversees the actual deployment of the
service to GCP. This involves monitoring the deployment process for any errors and ensuring
a smooth transition. Following deployment, ongoing monitoring is crucial. The engineer
utilizes tools like Cloud Monitoring to track the health and performance of the service,
identifying and resolving any issues promptly.

 Maintenance & Updates: Services require ongoing maintenance to ensure optimal


performance and security. DevOps engineers implement processes for patching
vulnerabilities, updating dependencies, and performing regular backups. This may involve
automating these tasks as part of a CI/CD pipeline.

 Retirement: When a service reaches its end-of-life, a well-defined retirement plan helps
minimize disruption. This includes migrating data to alternative services, notifying users, and
gracefully shutting down the old service.

Capacity Planning (Quotas & Limits Management):

Efficient use of GCP resources is vital. A DevOps Engineer plays a key role in managing quotas and
limits:

 Quotas: GCP enforces quotas on resource usage to prevent exceeding costs or impacting
other users. The engineer understands these quotas and sets them appropriately for each
service, ensuring smooth operation while staying within budget. Tools like Cloud Billing can
help monitor resource usage and identify potential quota issues.

 Limits: GCP also has inherent limits on specific resources like CPU or memory. The engineer
considers these limits during service design and scales the service infrastructure effectively
to meet anticipated demand. This may involve using autoscaling features to automatically
adjust resource allocation based on real-time needs.

Autoscaling using managed instance groups, Cloud Run, Cloud Functions, or GKE

Autoscaling is a critical technique for ensuring your service can handle fluctuating demand without
compromising performance or incurring unnecessary costs. Google Cloud offers various options for
autoscaling depending on your application type:

 Managed Instance Groups (MIGs): MIGs are pre-configured virtual machine (VM) templates
that can be automatically scaled up or down based on predefined metrics like CPU utilization
or network traffic. This is ideal for stateful applications that require persistent storage.

 Cloud Run: This serverless platform automatically scales your containerized applications
based on incoming requests. You simply deploy your container image, and Cloud Run
manages the underlying infrastructure. This is perfect for stateless, web-based applications.

pg. 15
SKILLCERTPRO

 Cloud Functions: Similar to Cloud Run, Cloud Functions are another serverless offering for
deploying event-driven, short-lived functions. They automatically scale based on the number
of incoming events, making them ideal for microservices and background tasks.

 Google Kubernetes Engine (GKE): GKE, a managed Kubernetes service, allows you to
leverage horizontal pod autoscaler (HPA) for automatic scaling of containerized applications
within your Kubernetes cluster. HPA scales pods based on CPU or memory usage.

Implementing feedback loops to improve a service

Feedback loops are essential for continuous improvement of your services. They allow you to gather
data on performance, identify areas for optimization, and iterate on your deployments. Here's how
to implement feedback loops in Google Cloud:

 Monitoring and Logging: Utilize services like Cloud Monitoring and Cloud Logging to collect
data on service health, performance metrics, and user interactions. This data provides
insights into potential bottlenecks or areas for improvement.

 Alerting: Set up alerts based on predefined thresholds in your monitoring data. This allows
you to proactively address issues before they significantly impact users.

 CI/CD Pipeline Integration: Integrate feedback loops into your CI/CD pipeline. Analyze
metrics and logs during deployments to identify regressions or performance degradations.
This enables you to rollback problematic deployments or trigger corrective actions
automatically.

 A/B Testing: Implement A/B testing to compare different versions of your service and
identify features or configurations that improve performance or user experience. Cloud Run
or Firebase can be helpful tools for A/B testing.

 User Feedback: Actively collect user feedback through surveys, in-app ratings, or customer
support channels. Analyze this feedback to understand user needs and identify areas for
service improvement.

3.3 Ensuring healthy communication and collaboration for operations. Considerations


include:
Preventing Burnout

Burnout is a serious threat to any team's productivity and morale. As a DevOps Engineer, you can
take steps to mitigate it:

 Implement Automation Processes: Repetitive tasks are prime candidates for automation
using tools like Google Cloud Functions or Cloud Workflows. This frees up your team's time
and mental energy for more strategic initiatives.

 Encourage Breaks and Time Off: Constant work cycles can lead to burnout. Promote a
culture of taking breaks throughout the day and scheduling time off for vacations and mental
health.

 Promote Work-Life Balance: A healthy work-life balance is essential. Set clear expectations
and avoid creating an environment where people feel pressured to work long hours all the
time.

pg. 16
SKILLCERTPRO

Fostering a Culture of Learning and Blamelessness

A positive team environment that values learning and collaboration is key to success:

 Encourage Open Communication and Knowledge Sharing: Create a space where team
members feel comfortable asking questions, sharing knowledge, and learning from each
other's experiences.

 Focus on Learning from Mistakes: Shift the focus from assigning blame to identifying the
root cause of problems and using them as opportunities to learn and improve processes.

 Provide Opportunities for Learning: Support your team's professional development by


sponsoring training courses, conferences, or allowing time for them to explore new
technologies relevant to their roles.

Establishing Joint Ownership of Services

Breaking down team silos and fostering a sense of shared responsibility is essential for effective
DevOps:

 Shared Responsibility for Services: Assign ownership of services across teams instead of
within a single team. This promotes collaboration and ensures no single team becomes a
bottleneck.

 Encourage Cross-Training: Invest in cross-training team members to give everyone a basic


understanding of all the services involved in your applications. This broadens skillsets and
allows for better coverage during absences or emergencies.

 Clear Ownership Models: While promoting shared responsibility, it's also important to have
clear ownership models to avoid confusion and duplication of effort. Define roles and
responsibilities for each service to ensure everyone is accountable for their part.

3.4 Mitigating incident impact on users. Considerations include:


Communicating during an incident:

 Establish a clear communication plan: Having a predefined communication plan ensures


everyone involved knows their roles and how to communicate effectively during an incident.
This plan should outline who will communicate with users, what information will be shared,
and how often updates will be provided.

 Provide updates on the incident status, root cause, and resolution timeline: Transparency is
key. Users appreciate timely updates that explain the current situation, what's causing the
problem, and the estimated timeframe for a resolution.

 Use multiple communication channels: Reach a wider audience by using a combination of


channels like email, a status page, and social media. This ensures users who prefer a specific
platform are kept informed.

 Be transparent and honest about the situation: Even if the news isn't ideal, honesty builds
trust with users. Explain the situation clearly and avoid sugarcoating the problem.

Draining/redirecting traffic:

pg. 17
SKILLCERTPRO

 Divert traffic away from the affected service: When possible, minimize the impact on users
by temporarily routing traffic away from the malfunctioning service. This can be achieved
using load balancers or traffic management tools like Cloud Traffic Director. By diverting
traffic, you can prevent further overloading and allow for a faster recovery process.

Adding capacity:

 Scale up resources: In certain scenarios, adding resources like servers or databases can
mitigate the impact of an incident. This approach is particularly effective for incidents caused
by resource exhaustion, where the system is overloaded.

 Consideration before adding capacity: Scaling up resources might take time and may not
always be the most suitable solution. Evaluate the situation to determine if it's the most
efficient course of action.

3.5 Conducting a postmortem. Considerations include:


Documenting Root Causes

 Identify the Chain of Events: Chronologically detail what happened, when it happened, and
the sequence of events that led to the incident. Utilize monitoring data, logs, and timelines
to reconstruct the incident.

 Focus on "Why" over "Who": The goal is to understand the underlying causes, not assign
blame. Analyze code changes, configuration errors, infrastructure issues, or external
dependencies that might have triggered the problem.

 Gather Evidence: Include screenshots, error messages, and relevant data points to support
your analysis.

Creating and Prioritizing Action Items

 Define Solutions: Based on the identified root causes, propose actions to prevent similar
incidents in the future.

 Prioritize Actions: Evaluate the impact and likelihood of recurrence for each root cause.
Focus on high-impact, likely-to-recur issues first.

 Assign Ownership: Clearly assign ownership for each action item to ensure accountability
and timely completion.

 Set Deadlines: Establish clear deadlines for implementing the action items to maintain
momentum and prevent issues from lingering.

Communicating the Postmortem to Stakeholders

 Target the Audience: Tailor the level of technical detail based on the audience. A technical
report for engineers might differ from a high-level summary for business stakeholders.

 Focus on Impact and Resolution: Clearly communicate the impact of the incident on users or
services, and the steps taken to resolve it.

 Transparency and Learnings: Emphasize the learnings gained from the incident to build trust
and demonstrate a commitment to improvement.

pg. 18
SKILLCERTPRO

 Choose the Right Format: Consider using written reports, presentations, or even blameless
postmortem discussions (https://www.youtube.com/watch?v=C_nywn1aR44) to share the
information effectively.

Section 4: Implementing service monitoring strategies (~21% of the exam)


4.1 Managing logs. Considerations include:
Collecting Logs with Cloud Logging:

 Structured vs. Unstructured Logs:

o Structured logs: Machine-readable logs with a defined format (e.g., JSON, CSV).
Easier to analyze and filter.

o Unstructured logs: Human-readable text logs often containing free-form text and
varying formats. Require additional processing for analysis.

 Sources: Cloud Logging can collect logs from various Google Cloud services:

o Compute Engine: Logs from virtual machines (VMs) running on Compute Engine.

o GKE (Google Kubernetes Engine): Logs from containerized applications running on


Kubernetes clusters.

o Serverless Platforms: Logs from Google Cloud Functions and other serverless
services.

 Cloud Logging Agent: This agent is installed on VMs and GKE nodes to automatically collect
and send logs to Cloud Logging. You can configure the agent to:

o Specify the log files to collect (based on paths or filters).

o Set the severity level of logs (e.g., debug, info, warn, error).

o Define log retention policies.

Configuring the Cloud Logging Agent:

 The agent configuration file (logging.yaml) specifies what logs to collect and how to send
them.

 You can configure:

o Filters: Include or exclude specific log messages based on patterns.

o Resource labels: Attach labels to logs for easier identification and filtering (e.g.,
environment, application name).

o Destinations: Specify where to send logs (Cloud Logging, Cloud Storage, etc.).

Collecting Logs from Outside Google Cloud:

Cloud Logging can also collect logs from on-premises environments or other cloud providers. Here
are common methods:

pg. 19
SKILLCERTPRO

 Fluentd: An open-source log aggregator that can forward logs from various sources to Cloud
Logging.

 Syslog forwarding: Configure on-premises systems to send logs to a central syslog server that
then forwards them to Cloud Logging.

 APIs and SDKs: Use Cloud Logging APIs or SDKs in custom applications to directly send logs
from any environment.

Sending Application Logs Directly to the Cloud Logging API:

 Cloud Logging API: This is Google Cloud's managed service for collecting, storing, processing,
and analyzing logs. It offers a scalable and centralized way to handle logs from your
applications and infrastructure.

 Benefits:

o Centralized View: All logs are stored in one place, making it easier to search, analyze,
and troubleshoot issues.

o Scalability: Cloud Logging can handle massive volumes of logs without impacting
performance.

o Integration: Integrates with other Google Cloud services like Stackdriver Monitoring
for comprehensive monitoring.

Log Levels:

 Log levels define the severity and verbosity of logged information. Common levels include:

o Debug: Detailed information for troubleshooting (most verbose)

o Info: Informational messages about application flow

o Warning: Potential issues that might not cause immediate failures

o Error: Messages indicating errors that have occurred

o Fatal: Critical errors that have caused the application to crash (least verbose)

 Choosing Log Levels:

o Select log levels based on the desired level of detail and potential impact on log
volume and storage costs.

o Generally, use higher levels (debug) during development and testing, and lower
levels (info, error) in production.

Optimizing Logs:

 Optimizing logs ensures you capture essential information without generating excessive data:

o Multiline Logging: Break long messages into smaller chunks for readability.

o Exceptions: Log only relevant information from exceptions, not the entire stack
trace.

o Log Size: Minimize the size of each log message by avoiding unnecessary data.

pg. 20
SKILLCERTPRO

o Cost: Consider the volume and retention period of logs to manage storage costs.
Cloud Logging offers tiered pricing based on log volume.

Additional Tips:

 Structured Logs: Use structured logging formats (e.g., JSON) for easier parsing and analysis
by tools.

 Log Rotation: Set up log rotation policies to automatically archive and manage older logs.

 Monitoring Logs: Integrate Cloud Logging with Stackdriver Monitoring to create alerts based
on specific log patterns.

4.2 Managing metrics with Cloud Monitoring. Considerations include:


Collecting and Analyzing Application and Platform Metrics:

 Application Metrics: These are measurements that reflect the health and performance of
your application code. Examples include request latency, API call success rates, memory
usage, and thread pool saturation.

o Instrumentation: You can instrument your application code to emit these metrics
using libraries or frameworks provided by your programming language or platform
(e.g., SDKs for Java, Python, etc.).

o Supported Sources: Cloud Monitoring collects application metrics from various


sources:

 Cloud Monitoring agents: These agents are deployed alongside your


application on Compute Engine, Kubernetes Engine (GKE), or Cloud
Functions and automatically collect metrics.

 OpenTelemetry: This vendor-neutral framework allows collecting metrics


from diverse sources and sending them to Cloud Monitoring.

 APIs: You can directly send metrics to Cloud Monitoring via its API.

 Platform Metrics: These are metrics provided by Google Cloud for its services and
infrastructure. They offer insights into resource utilization, network performance, and overall
platform health.

o Examples: CPU usage, disk I/O, network throughput, and API quota usage for Google
Cloud services.

o Pre-configured: Cloud Monitoring automatically collects these metrics, so you don't


need to instrument your application.

Collecting Networking and Service Mesh Metrics:

 Networking Metrics: These provide insights into the health and performance of your
network traffic.

o Sources: Cloud Monitoring collects network metrics from:

pg. 21
SKILLCERTPRO

 Cloud Load Balancing: Measures traffic distribution, latency, and backend


health.

 Cloud VPN: Monitors tunnel health and throughput.

 Cloud CDN: Provides CDN edge server performance metrics.

 VPC Flow Logs: These detailed logs capture network traffic information for
analysis with Cloud Monitoring.

 Service Mesh Metrics: If you use a service mesh like Istio, Cloud Monitoring can collect
metrics related to service calls, latency, error rates, and overall mesh health.

o Integration: The specific integration method depends on your service mesh


implementation.

Using Metrics Explorer for Ad Hoc Metric Analysis:

 Metrics Explorer: This is a web-based tool in Cloud Monitoring that allows you to:

o Visualize Metrics: Plot graphs of collected metrics over time.

o Filter and Aggregate Data: Narrow down and group metrics for specific resources,
time ranges, or labels.

o Identify Trends and Anomalies: Use visualizations to spot unusual patterns or


performance issues.

o Perform Calculations: Create custom metrics by deriving new metrics from existing
ones using mathematical functions.

Creating Custom Metrics from Logs:

 Custom Metrics: These are metrics you define based on patterns or trends found in your
application logs.

o Use Case: For example, you might create a custom metric for the rate of failed login
attempts from your logs.

 Process: Cloud Monitoring allows you to define filters that extract specific log data points
and convert them into custom metrics.

o Benefits: Custom metrics offer insights beyond standard application metrics and
provide a more holistic view of your system's behavior.

4.3 Managing dashboards and alerts in Cloud Monitoring. Considerations include:


Creating a Monitoring Dashboard

A monitoring dashboard is a visual representation of key metrics and alerts related to your Google
Cloud resources. It helps you stay informed about the health and performance of your applications
and infrastructure at a glance. Here's how to create one:

1. Navigate to Cloud Monitoring: Go to the Google Cloud Console and select "Cloud
Monitoring" from the navigation menu.

pg. 22
SKILLCERTPRO

2. Create Dashboard: Click the "Create dashboard" button.

3. Add Charts: Select the metrics you want to display from the available options. You can
choose from various chart types (line, gauge, pie chart, etc.) to visualize the data effectively.

4. Customize Appearance: Edit titles, labels, and chart properties to make the dashboard
visually appealing and clear.

5. Save and Share: Give your dashboard a descriptive name and save it. You can also share it
with other users by granting them appropriate viewing permissions.

Filtering and Sharing Dashboards

 Filtering: You can filter the data displayed in your dashboard by applying time ranges,
resource types, or specific metric values. This allows you to focus on specific aspects of your
monitoring data. For example, you might filter a dashboard to show only metrics for a
particular application or service during a specific deployment window.

 Sharing: Granting others access to your dashboards helps them stay informed and
collaborate on monitoring tasks. You can set different permission levels, such as "View only"
or "Edit," to control how others interact with your dashboards.

Configuring Alerting

Alerts notify you when specific conditions are met in your monitoring data. This helps you proactively
identify and address potential issues before they impact your users or applications. Here's how to
configure alerts:

1. Define Alert Policy: Create an alert policy that specifies the metric, condition (e.g., exceeding
a threshold), and severity level (e.g., critical, warning) to trigger an alert.

2. Set Notification Channels: Choose how you want to receive alerts, such as email, SMS, or
integration with a ticketing system.

3. Refine Configuration: You can further refine your alerts by setting escalation policies,
silencing rules (temporarily disabling alerts), and defining auto-remediation actions
(automatic responses to specific alerts).

Additional Considerations:

 Dashboard Design: When creating dashboards, consider the audience and their needs. Focus
on presenting information clearly and concisely. Use a logical layout and intuitive
visualizations to make the data easy to understand.

 Alerting Thresholds: Set realistic and relevant thresholds for your alerts to avoid alert fatigue
(receiving too many unnecessary notifications).

 Proactive Monitoring: Don't wait for alerts to identify issues. Regularly review your
dashboards to identify trends and potential problems before they escalate.

Defining Alerting Policies Based on SLOs and SLIs

 SLO (Service Level Objective): An SLO is a measurable target that defines the expected
performance of a service. It's a high-level goal, like "Our website will be available 99.95% of
the time."

pg. 23
SKILLCERTPRO

 SLI (Service Level Indicator): An SLI is a metric you track to measure progress towards your
SLO. For website availability, an SLI could be "percentage of successful requests over a 5-
minute window."

Alerting based on SLOs and SLIs:

1. Set SLOs: Define your service's desired performance level using SLOs.

2. Identify SLIs: Choose metrics (SLIs) that accurately reflect your SLOs. Cloud Monitoring
provides various metrics for resources like VMs, databases, and custom metrics.

3. Create Alerting Policies: Configure alerts in Cloud Monitoring to trigger when an SLI deviates
from your SLO targets. For example, an alert could fire if the website's successful request
percentage falls below 99.95% for 5 minutes.

Benefits:

 Proactive issue detection

 Faster troubleshooting

 Improved service uptime and performance

Automating Alerting Policy Definition Using Terraform

Terraform is an Infrastructure as Code (IaC) tool that lets you manage cloud resources using code.
You can leverage Terraform to automate the creation and configuration of alerting policies in Cloud
Monitoring.

Here's how:

1. Define Terraform configuration: Write Terraform code that specifies the resources (e.g.,
metric, condition, notification channel) for your alerting policy.

2. Reference SLOs and SLIs: Use variables or data sources in Terraform to reference your
defined SLOs and SLIs.

3. Version control and deployment: Manage your Terraform code in a version control system
(like Git) and use CI/CD pipelines to automatically deploy and update alerting policies
whenever your infrastructure changes.

Benefits:

 Consistent and repeatable alerting configurations

 Reduced manual configuration errors

 Easier policy updates and rollbacks

Using Google Cloud Managed Service for Prometheus to Collect Metrics and Set Up Monitoring
and Alerting

Prometheus is an open-source monitoring tool that collects and analyzes metrics. Google Cloud
Managed Service for Prometheus is a fully managed service that simplifies running Prometheus on
Google Cloud.

How it works:

pg. 24
SKILLCERTPRO

1. Deploy Managed Prometheus: Provision a Managed Prometheus instance in your GCP


project.

2. Configure Metric Scrapers: Define configurations (called "scrapers") in Prometheus to collect


metrics from your resources. These can be Google Cloud services, custom applications, or
external systems.

3. Set Up Alerting Rules: Write alerting rules within Prometheus based on the collected
metrics. These rules trigger alerts when specific conditions are met.

4. Integrate with Cloud Monitoring: You can integrate Managed Prometheus with Cloud
Monitoring for centralized alerting and visualization.

Benefits:

 Scalable metric collection and analysis

 Powerful alerting capabilities with Prometheus rule language

 Integration with other Google Cloud monitoring tools

Choosing the Right Approach:

 Simple monitoring: If you need basic alerting for a few resources, manual configuration in
Cloud Monitoring might suffice.

 Large-scale deployments: For complex deployments with numerous metrics and


customizable alerting, Managed Prometheus is a good fit.

 Infrastructure as Code: If you prefer managing infrastructure configurations in code (IaC),


Terraform automation is a valuable option.

Additional Considerations:

 Alerting Fatigue: Avoid creating too many alerts, leading to alert fatigue (ignoring important
alerts).

 Alert Routing and Escalation: Define who gets notified for different types of alerts and how
escalation happens for critical issues.

 Alerting Best Practices: Follow best practices for writing effective alerting rules to ensure
they trigger at the right time and provide actionable information.

4.4 Managing Cloud Logging platform. Considerations include:


Cloud Logging Platform

Cloud Logging is a managed service in Google Cloud that centralizes log data collection, storage,
analysis, and export. It simplifies log management by offering a unified platform for various sources,
including applications, infrastructure, and platforms.

Key Considerations:

Enabling Data Access Logs (Cloud Audit Logs):

o Cloud Audit Logs record API calls, user activities, and resource changes within your
Google Cloud projects.

pg. 25
SKILLCERTPRO

o To enable them:

 Go to the IAM & Admin console (https://console.cloud.google.com/iam-


admin)

 Select the project you want to monitor.

 Navigate to "Audit Logs" under "Activity."

 Choose the logging bucket and sink to store and export logs.

o Benefits:

 Provide insights into user actions and resource modifications for security and
compliance purposes.

Enabling VPC Flow Logs:

o VPC Flow Logs capture information about network traffic traversing your Virtual
Private Cloud (VPC) network.

o To enable them:

 Go to the VPC console ([invalid URL removed])

 Select the VPC network.

 Under "Routes," click "Cloud Logging."

 Choose the logging bucket and sink for log storage and export.

o Benefits:

 Aid in network troubleshooting by providing details on ingress and egress


traffic, source and destination IP addresses, protocols, and packet counts.

Viewing Logs in the Google Cloud Console:

o The Cloud Logging console offers a user-friendly interface to explore your logs.

o Access it from the Cloud Logging menu in the Google Cloud console
(https://cloud.google.com/logging).

o Features:

 View logs from various sources in a single location.

 Filter logs based on timestamps, severity levels, resources, and log entries.

 Search for specific log messages using keywords.

 Create custom dashboards to visualize log data over time.

Basic vs. Advanced Log Filters:

o Cloud Logging provides filtering capabilities to narrow down log entries for specific
analysis.

o Basic filters:

pg. 26
SKILLCERTPRO

 Filter by timestamps (e.g., logs within the last hour).

 Filter by severity levels (e. g., errors, warnings).

 Filter by resource type (e.g., logs from a specific Compute Engine instance).

o Advanced filters (using expressions):

 Combine multiple filters using logical operators (AND, OR, NOT).

 Filter by log entry content using regular expressions.

o Choosing the right filter depends on the complexity of your logs and the level of
detail needed.

Logs Exclusion vs. Logs Export:

o Exclusion:

 Involves filtering out unwanted logs before they are stored or exported.

 Reduces storage costs and simplifies log analysis by excluding irrelevant data.

 You can define exclusion filters in Cloud Logging's configuration.

o Export:

 Involves sending logs to external destinations like BigQuery, Cloud Storage,


or third-party SIEM (Security Information and Event Management) systems.

 Allows for further analysis, long-term archiving, or integration with other


tools.

 You can configure log exports (sinks) in Cloud Logging.

Project-level vs. Organization-level Export:

 Project-level export: This means you're sending logs generated within a specific GCP project
to a destination like Cloud Storage or BigQuery. This is useful for isolating logs from individual
projects for analysis or compliance purposes.

 Organization-level export: Here, you're exporting logs from all projects within your GCP
organization to a centralized location. This is efficient for large-scale log management and
works well if you have consistent logging needs across projects.

Choosing between them depends on factors like:

 Log Volume: For high-volume projects, organization-level export might be more efficient.

 Compliance Needs: If specific projects have stricter compliance requirements, project-level


export offers better isolation.

 Cost Optimization: Organization-level export can be more cost-effective for centralized


storage.

Managing and Viewing Log Exports:

pg. 27
SKILLCERTPRO

 Sinks: Google Cloud Logging uses "sinks" to define where logs are exported. You can
configure sinks to send logs to various destinations like Cloud Storage buckets, BigQuery
datasets, or external systems.

 Cloud Logging Viewer: The Google Cloud Console provides a built-in Cloud Logging Viewer
for exploring exported logs. You can filter and search logs based on severity, timestamps,
resource types, and other criteria.

 APIs and Logs Explorer: Additionally, you can manage and view logs programmatically using
the Cloud Logging API or through advanced tools like the Logs Explorer for more granular
analysis.

Sending Logs to an External Logging Platform:

 Cloud Logging allows integration with external logging systems like Splunk or ELK Stack. You
can configure sinks to send logs to these platforms for centralized log management alongside
logs from other sources.

 This is beneficial for organizations already invested in an external logging ecosystem and
wanting to consolidate all logs in one place.

Filtering and Redacting Sensitive Data (PII and PHI):

 Cloud Logging offers powerful filtering capabilities to focus on relevant logs. You can filter
based on log severity, resource type, timestamps, and custom log fields.

 Redacting sensitive data is crucial for security and compliance. Cloud Logging supports log
exclusion filters to remove specific data patterns (e.g., credit card numbers) before export.
You can also use redaction engines to dynamically redact sensitive data based on pre-defined
rules.

4.5 Implementing logging and monitoring access controls. Considerations include:


Restricting Access to Audit Logs and VPC Flow Logs with Cloud Logging

 Audit Logs: These logs record API activity within your Google Cloud project. They provide a
crucial record of who did what, when, and where. Restricting access ensures only authorized
users can view these logs, protecting sensitive information about your project activity.

o IAM Roles: Use Google Cloud Identity and Access Management (IAM) roles to grant
granular access to audit logs. Specific roles like "Logging Viewer" or "Logging Editor"
allow users to view or modify logs, respectively.

 VPC Flow Logs: These logs capture network traffic information flowing into, out of, and
within your Virtual Private Cloud (VPC). Restricting access prevents unauthorized users from
seeing details about your network traffic patterns.

o IAM Policies: Create IAM policies at the folder, organization, or project level to
control access to VPC Flow Logs. You can define who can view or export these logs
based on specific IAM roles or service accounts.

Restricting Export Configuration with Cloud Logging

pg. 28
SKILLCERTPRO

 Log Exports: Cloud Logging allows you to export logs to external destinations like BigQuery
for analysis or compliance purposes. Restricting export configuration ensures only authorized
users can modify these destinations, preventing accidental data exposure or loss.

o IAM Permissions: Use IAM permissions like "logging.sinks.create" or


"logging.sinks.get" to control who can create, view, or edit log export configurations.
This ensures only authorized users can set up or modify where your logs are
exported.

Allowing Metric and Log Writing with Cloud Monitoring

 Writing Metrics and Logs: While access controls restrict viewing, it's important to allow
authorized applications and services to write metrics and logs to Cloud Monitoring. This data
is essential for monitoring system health, performance, and identifying issues.

o IAM Roles: Grant appropriate IAM roles like "Monitoring Writer" or "Monitoring
Editor" to applications or service accounts. This allows them to push metrics and logs
to Cloud Monitoring for analysis and alerting.

Additional Considerations:

 Principle of Least Privilege: Grant users the minimum level of access required for their role.
This minimizes the risk of unauthorized access to sensitive information.

 Monitoring Access Controls: Regularly review and update access controls to ensure they
reflect current user permissions and project requirements.

 Logging Best Practices: Implement log rotation policies to manage log storage and retention
effectively. Consider filtering logs to reduce noise and focus on relevant data.

Section 5: Optimizing the service performance (~16% of the exam)


5.1 Identifying service performance issues. Considerations include:
Using Google Cloud's operations suite to identify cloud resource utilization:

Google Cloud offers a comprehensive suite of tools for monitoring and analyzing resource utilization.
This helps you identify potential bottlenecks and optimize your infrastructure costs. Here's a
breakdown of key services:

 Cloud Monitoring: The central hub for collecting, visualizing, and alerting on metrics from
your cloud resources. You can monitor CPU, memory, disk I/O, network traffic, and more for
VMs, Cloud Storage buckets, Cloud SQL instances, and many other services.

 Cloud Logging: Aggregates logs from your applications and infrastructure. Logs provide
valuable insights into application behavior and potential errors. You can filter and analyze
logs to identify issues like slow queries, high error rates, or resource spikes.

 Cloud Billing: Provides detailed reports on your cloud resource usage and costs. You can
identify underutilized resources and optimize your spending by using features like committed
use discounts or preemptible VMs.

Interpreting service mesh telemetry:

pg. 29
SKILLCERTPRO

Service mesh is an architectural layer that simplifies application communication and provides
observability. When using a service mesh like Istio on Google Cloud, it generates telemetry data that
provides insights into how your microservices interact. Here's what you can glean from telemetry:

 Request latency: Measure how long it takes for requests to travel between services. High
latency can indicate overloaded services, network issues, or slow code execution.

 Error rates: Track the number of failed requests between services. This helps identify service
instability or configuration problems.

 Throughput: Monitor the volume of traffic flowing between services. Analyze spikes or dips
to understand service capacity and potential bottlenecks.

Troubleshooting issues with compute resources:

Compute resources like VMs are the workhorses of your applications. When performance problems
arise, troubleshooting compute resources is crucial. Here are some techniques:

 Monitoring metrics: Use Cloud Monitoring to track CPU, memory, and disk utilization on
your VMs. Identify situations where resources are maxed out, causing performance
degradation.

 Analyzing logs: Review VM logs for errors related to resource exhaustion, application
crashes, or kernel panics. Logs can pinpoint specific issues within the VM.

 Scaling resources: If your VMs are consistently overloaded, consider scaling them up by
increasing CPU, memory, or storage. Conversely, you can scale down underutilized VMs to
save costs.

 Cloud Profiler: This tool helps identify performance bottlenecks within your application code
running on VMs. It can pinpoint CPU-intensive functions or memory leaks.

Troubleshooting Deploy Time and Runtime Issues with Applications

Deployments should be smooth and applications should function flawlessly at runtime. Here's how
to identify issues in these stages:

 Cloud Logging: Google Cloud Logging aggregates logs from various sources like applications,
infrastructure, and platforms. Analyze logs for errors, warnings, or unexpected behavior
during deployments or application runtime. Look for patterns and timestamps to pinpoint
the problem area.

 Error Reporting: This service automatically collects crash reports, exceptions, and application
health data. It helps identify issues that might not be apparent in standard logs, like client-
side errors.

 Cloud Monitoring: Monitor key application metrics like CPU usage, memory consumption,
request latency, and throughput. Deviations from normal behavior can indicate performance
bottlenecks. Define Service Level Indicators (SLIs) and set alerts based on these metrics to
proactively identify issues.

 Application Instrumentation: Instrument your application code to capture detailed


performance data like function execution times, database queries, and external API calls.

pg. 30
SKILLCERTPRO

Tools like Cloud Profiler can help pinpoint performance bottlenecks within the application
code.

 CI/CD Pipeline Monitoring: Monitor your CI/CD pipeline (e.g., Cloud Build or third-party
tools) for failures during the build, test, or deployment stages. Identify errors or bottlenecks
in the pipeline itself that might be causing deployment delays.

Troubleshooting Network Issues

Network connectivity and performance are vital for any cloud application. Here's how to identify
network-related issues:

 VPC Flow Logs: These logs record all ingress and egress traffic within your Virtual Private
Cloud (VPC). Analyze flow logs to identify unexpected traffic patterns, potential security
threats, or asymmetry in traffic flow (indicating issues with specific instances).

 Firewall Logs: Firewall logs track all traffic that interacts with your Cloud Firewall rules.
Analyze these logs to identify blocked connections, suspicious activity, or rule
misconfigurations that might be impacting legitimate traffic.

 Latency Monitoring: Monitor network latency between different components (instances,


databases, etc.) in your application architecture. High latency can indicate overloaded
network resources, routing issues, or problems with specific network paths.

 Network Details: Use tools like Cloud Monitoring's Network section to visualize network
topology, identify bottlenecks, and monitor resource utilization for Cloud Load Balancing or
Cloud CDN.

Additional Tips:

 Use Stackdriver (now part of Cloud Monitoring) for a unified view of application
performance, logs, and infrastructure metrics.

 Leverage infrastructure as code tools like Terraform to ensure consistent configuration and
minimize manual errors that might lead to network issues.

 Consider cost optimization by using preemptible VMs for non-critical workloads. However, be
aware that preemptible VMs might be terminated during high demand periods, potentially
causing application downtime.

5.2 Implementing debugging tools in Google Cloud. Considerations include:


Application Instrumentation

 Concept: The process of embedding code or configurations within your application to gather
detailed information about its behavior during runtime. This information becomes invaluable
for troubleshooting issues and understanding application performance.

 Implementation in Google Cloud: Google Cloud offers various tools and libraries for
instrumenting your applications:

o OpenTelemetry: An open-source framework providing vendor-neutral


instrumentation APIs. Google Cloud provides OpenTelemetry libraries for popular
languages like Java, Python, Go, and Node.js.

pg. 31
SKILLCERTPRO

o Stackdriver Profiler: A managed profiling service that helps you identify performance
bottlenecks in your applications. It integrates seamlessly with Cloud Monitoring for
centralized data collection and analysis.

o Cloud Debugger: A suite of debugging tools (including breakpoints, variable


inspection, and stack traces) that can be used with instrumented applications.

 Benefits:

o Facilitates pinpointing the root cause of errors and performance problems.

o Enables gathering detailed runtime data about function calls, resource usage, and
application behavior.

o Provides insights into application health and performance metrics for proactive
monitoring and optimization.

Cloud Logging

 Concept: A managed logging service from Google Cloud that centralizes log data from your
applications, infrastructure, and platforms. It offers a scalable and cost-effective solution for
collecting, storing, processing, analyzing, and exporting logs.

 Key Features:

o Structured Logging: Enables logs to be formatted with relevant key-value pairs for
easier searching and filtering.

o Sinks: Logs can be routed to various destinations (Cloud Storage, BigQuery, Cloud
Monitoring, external systems) for further analysis and storage.

o Filters: Powerful filtering capabilities allow you to focus on specific logs based on
severity, timestamp, resource type, and custom labels.

o Log Viewer: A user-friendly interface for exploring and analyzing logs, facilitating
troubleshooting and performance monitoring.

 Benefits in Debugging:

o Provides a central repository for all your application logs, simplifying the process of
finding relevant information.

o Enables filtering and searching logs based on specific criteria to pinpoint issues
quickly.

o Offers insights into application behavior, errors, and performance metrics.

Cloud Trace

 Concept: A distributed tracing service from Google Cloud that helps you visualize and
analyze the flow of requests through a complex application or microservice architecture. It
tracks the latency and path of each request across different services and components.

 Benefits in Debugging:

o Provides a visual representation of the request flow, making it easier to identify


bottlenecks and pinpoint where issues originate.

pg. 32
SKILLCERTPRO

o Enables correlating logs from different services involved in a request for a more
comprehensive understanding.

o Helps you troubleshoot complex distributed system problems by tracing requests


across multiple components.

Effective Debugging with These Tools

 Combined Approach: Utilize application instrumentation to collect detailed data, Cloud


Logging to centralize and analyze logs, and Cloud Trace to visualize request flow. This
combined approach provides a holistic view of application behavior for efficient debugging.

 Alerting and Monitoring: Set up alerts based on log entries and trace data to proactively
identify potential issues before they impact users. Integrate logging and tracing information
with Cloud Monitoring for continuous monitoring and performance insights.

 Best Practices: Follow Google Cloud's best practices for logging and tracing:

o Use structured logs with appropriate labels for easy filtering and analysis.

o Include relevant context (timestamps, user IDs, request IDs) in your logs to facilitate
troubleshooting.

o Implement distributed tracing across all services in your application.

o Leverage Cloud Monitoring dashboards and integrations to gain comprehensive


insights.

Error Reporting

 Focuses on identifying and understanding application errors in your Google Cloud


deployments.

 Automatically captures exceptions thrown by your code.

 Provides detailed information like:

o Stack traces to pinpoint the location of the error.

o Occurrence counts to identify the most frequent errors.

o Affected user data to understand user impact.

o Time charts to visualize error trends.

 Integrates with other tools like Cloud Monitoring for comprehensive debugging.

 Offers real-time alerting to notify you of new errors immediately.

Cloud Profiler

 Analyzes the performance of your applications running in production.

 Helps identify bottlenecks and optimize resource usage.

 Offers two profiling modes:

pg. 33
SKILLCERTPRO

o CPU Profiling: Measures how your application spends CPU time, pinpointing
functions that consume excessive resources.

o Heap Profiling: Analyzes memory allocation patterns, identifying potential memory


leaks.

 Provides detailed call stacks and allocation traces for in-depth analysis.

 Low-overhead, minimizing performance impact during profiling.

Cloud Monitoring

 Centralized platform for monitoring your entire Google Cloud infrastructure.

 Collects and analyzes various metrics:

o Application metrics like request latency, throughput, and error rates.

o Resource metrics like CPU usage, memory utilization, and disk I/O.

o Network metrics like traffic volume and latency.

 Creates custom dashboards to visualize key metrics and identify trends.

 Defines alerts based on specific thresholds to be notified proactively of potential issues.

 Integrates with other debugging tools like Error Reporting and Cloud Trace for a holistic view.

5.3 Optimizing resource utilization and costs. Considerations include:


Preemptible/Spot Virtual Machines (VMs):

 Preemptible VMs are virtual machines offered by GCP at a significantly lower cost compared
to sustained use VMs. The catch? Google can reclaim these VMs with short notice (usually 24
hours) to meet their own computing needs.

 Use Cases: These VMs are ideal for workloads that are:

o Fault-tolerant and can be restarted without significant impact.

o Short-lived tasks like batch processing, data analysis, or CI/CD builds.

 Pros:

o Substantial cost savings compared to sustained use VMs.

o Still offer good performance for many workloads.

 Cons:

o VMs can be interrupted at any time, requiring applications to be designed for such
interruptions.

o Not suitable for critical production workloads.

Committed-use Discounts (CUDs):

 CUDs offer significant discounts on compute resources (VMs, Cloud Storage, etc.) in
exchange for a commitment to a specific level of usage over a set period (typically 1 or 3
years).

pg. 34
SKILLCERTPRO

 Types of CUDs:

o Flexible: Offers discounts on a combination of VM instance types, regions, and


machine types. Provides some flexibility in resource usage.

o Resource-based: Offers discounts on specific VM instance types, regions, and


machine types. Provides the highest discount but least flexibility.

 Benefits:

o Significant cost savings compared to on-demand pricing.

o Predictable costs for budgeting purposes.

 Considerations:

o Requires accurate forecasting of resource needs to avoid under or over-commitment.

o Less flexibility compared to on-demand pricing.

Sustained-use Discounts:

 GCP offers discounts for resources that are used consistently over a prolonged period. These
discounts reward predictable workloads and can significantly reduce your cloud bill.

 There are two primary sustained-use discount programs:

o Committed Use Discounts (CUDs): You commit to using a specific amount of


resources for a set period (one or three years) in exchange for a significant upfront
discount. This is ideal for predictable workloads where you're confident about your
resource needs.

o Cloud Spanner committed usage discounts: Similar to CUDs, but specifically for
Cloud Spanner relational database instances.

Network Tiers:

 GCP offers different network tiers that cater to varying network traffic needs:

o Premium Tier: High-performance, low-latency networking for mission-critical


applications requiring maximum throughput and minimal jitter. Ideal for real-time
applications or large data transfers. (Comes at a higher cost)

o Standard Tier: General-purpose networking suitable for most applications. Offers a


good balance of performance, cost, and scalability.

o Basic Tier: Cost-effective option for workloads with minimal network traffic. Ideal for
development environments or low-traffic applications.

By choosing the appropriate network tier based on your application's traffic patterns, you can
optimize network costs without compromising performance.

Sizing Recommendations:

pg. 35
SKILLCERTPRO

 GCP provides recommendations for optimal machine type and size based on your
application's resource requirements. Utilizing these recommendations can help you:

o Avoid overprovisioning: Allocate only the resources your application needs,


preventing unnecessary spending.

o Prevent under-provisioning: Ensure sufficient resources to handle peak loads and


maintain application performance.

GCP offers various tools for sizing recommendations, including:

* **Cloud Monitoring:** Monitors resource utilization metrics like CPU, memory, and disk usage. You
can identify underutilized or overloaded resources and adjust their size accordingly.

* **Cloud Architecture Center:** Provides recommendations for machine types and configurations
based on your workload type.

* **Firebase Performance Monitoring:** Analyzes application performance on mobile and web


platforms, helping you identify resource bottlenecks and optimize instance sizing.

By effectively utilizing sustained-use discounts, choosing the right network tier, and following sizing
recommendations, you can significantly reduce your GCP costs without impacting application
performance.

 For a full set of 700+ questions. Go to


https://skillcertpro.com/product/google-professional-cloud-devops-engineer-
exam-questions/
 SkillCertPro offers detailed explanations to each question which helps to
understand the concepts better.
 It is recommended to score above 85% in SkillCertPro exams before attempting
a real exam.
 SkillCertPro updates exam questions every 2 weeks.
 You will get life time access and life time free updates
 SkillCertPro assures 100% pass guarantee in first attempt.

Disclaimer: All data and information provided on this site is for informational
purposes only. This site makes no representations as to accuracy, completeness,
correctness, suitability, or validity of any information on this site & will not be
liable for any errors, omissions, or delays in this information or any losses,
injuries, or damages arising from its display or use. All information is provided on
an as-is basis.

pg. 36

You might also like