Google Cloud DevOps Engineer Cheat Sheet
Google Cloud DevOps Engineer Cheat Sheet
Projects: Represent a collection of related Cloud resources, like Compute Engine instances,
Cloud Storage buckets, and Cloud Functions. Each project has a unique identifier and serves
as a unit of isolation for billing, IAM, and quotas.
Folders: Organize projects into a hierarchical structure for better management and access
control. Folders group related projects based on function, department, or environment (e.g.,
development, staging, production).
Shared Networking:
VPC (Virtual Private Cloud): A logically isolated network within Google Cloud Platform (GCP)
that provides a private address space for your resources. You can create multiple VPCs for
different environments or workloads.
Cloud NAT: Enables outbound internet access for resources in private VPC subnets without
exposing them directly to the public internet.
Governs who (users, service accounts) can access what resources (projects, buckets, etc.)
and what actions they can perform (read, write, delete).
Roles: Predefined sets of permissions that grant specific access levels to resources. (e.g.,
"Owner" has full access, "Editor" can modify resources).
Policies: Organization-wide rules that define IAM permissions at a higher level, inheriting
down to projects and folders.
Service Accounts:
Special Google accounts used by applications or services to access GCP resources without
requiring a human user.
Permissions: Assigned IAM roles to grant service accounts the necessary access to perform
their tasks.
pg. 1
SKILLCERTPRO
Project Granularity: Consider the size and complexity of your organization. Create projects
for specific applications, environments, or teams.
Folder Structure: Organize projects logically to reflect your organizational structure and
access control needs.
IAM Strategy: Define clear and granular IAM policies at the organization, folder, and project
levels to minimize access sprawl and improve security.
Service Account Management: Create and manage service accounts with appropriate IAM
roles for application authentication and access control.
Cloud Foundation Toolkit (Cloud SDK): Google's official command-line toolset for interacting
with GCP services. It allows provisioning and managing resources directly through
commands.
Config Connector: Bridges the gap between GCP resources and IaC tools like Terraform. It
lets you manage GCP resources using familiar IaC syntax within existing workflows.
Terraform: A popular, open-source IaC tool known for its flexibility and wide adoption across
cloud platforms. Terraform uses HashiCorp Configuration Language (HCL) to define
infrastructure resources.
Helm: Primarily used for packaging and managing Kubernetes applications. Helm charts
define deployments, configurations, and dependencies for containerized applications on
Kubernetes clusters.
Selecting the most suitable tool depends on your specific needs and existing workflows. Here's a
brief consideration guide:
Cloud SDK: Ideal for quick deployments or scripting interactions with GCP services.
Config Connector: Integrates well with existing Terraform workflows for managing GCP
resources.
Terraform: A versatile option for managing complex infrastructure across multiple cloud
providers.
Helm: If your application relies on Kubernetes, Helm is the go-to tool for managing its
deployment and configuration.
Leverage IaC Blueprints: Define infrastructure configurations using IaC tools. These
blueprints act as version-controlled templates, ensuring consistency and repeatability during
deployments.
Version Control: Store your IaC code in a version control system (VCS) like Git. This allows
tracking changes, collaboration, and rollbacks if necessary.
pg. 2
SKILLCERTPRO
Testing and CI/CD Integration: Implement testing procedures to validate your IaC code
before deployments. Integrate your IaC workflows with CI/CD pipelines for automated
infrastructure provisioning and updates.
Immutable Infrastructure
Immutable infrastructure is a core principle for managing infrastructure as code. It revolves around
treating infrastructure resources as immutable objects. Here's the core idea:
Cutover: Once the new infrastructure is verified, traffic is switched over to it, and the old
infrastructure is decommissioned.
Improved Reliability: Rollbacks become easier as you can simply switch back to the previous
version.
Simplified Testing: Testing new infrastructure versions is easier when deploying them
alongside existing ones.
Increased Resource Consumption: You might temporarily have both old and new
infrastructure running during cutover.
1.3 Designing a CI/CD architecture stack in Google Cloud, hybrid, and multi-cloud
environments. Considerations include:
CI with Cloud Build
What is it? Cloud Build is a Google-managed CI/CD service. It triggers builds based on code
changes in repositories like Cloud Source Repositories, GitHub, or Bitbucket. Builds can be
customized with various steps like building container images, running tests, and deploying
artifacts.
Benefits:
pg. 3
SKILLCERTPRO
o Customizable: Define workflows with scripts and commands for specific needs.
Security Considerations:
o Service Account Permissions: Grant Cloud Build service accounts least privilege
access to repositories and resources.
o Build Step Security: Secure build steps by using trusted container images and
avoiding hardcoded secrets.
What is it? Google Cloud Deploy is a managed service for continuous delivery. It automates
deployments based on build artifacts created by Cloud Build. Cloud Deploy offers features
like:
o Rollback capabilities
Benefits:
Security Considerations:
While Google Cloud offers excellent CI/CD tools, some scenarios might require additional options:
Jenkins: A popular open-source CI/CD server offering extensive plugin support for various
tasks and integrations.
Git: The de-facto standard version control system (VCS) for code management. Integrates
seamlessly with CI/CD pipelines for triggering builds on code changes.
ArgoCD: An open-source GitOps tool that continuously applies infrastructure and application
configurations declared as Git repositories. Useful for managing deployments in multi-cluster
environments.
Packer: An open-source tool for creating identical machine images for various cloud
providers. Helpful for building consistent base images for deployments across hybrid and
multi-cloud environments.
pg. 4
SKILLCERTPRO
Regular Updates: Keep third-party tools updated with the latest security patches.
Access Control: Implement access controls to restrict who can manage configurations and
deployments.
This is crucial for establishing a smooth development workflow with clear separation of concerns.
Here's a breakdown:
Typical Environments:
o Development (Dev): Used by developers for building and testing new features. Code
changes are frequently deployed here.
o Production (Prod): The live environment where your application serves real users.
o Canary: A small subset of production users receive updates here first, allowing for
early detection of issues.
Factors to Consider:
o Team Size and Workflow: Tailor the environments to support efficient collaboration
within your development team.
Infrastructure as Code (IaC): Tools like Terraform allow you to define infrastructure
configurations (networks, VMs, GKE clusters) as code. This code can be version controlled
and used to provision environments automatically.
pg. 5
SKILLCERTPRO
Google Kubernetes Engine (GKE): A managed Kubernetes service for deploying and
managing containerized applications. Terraform can be used to create GKE clusters for each
environment.
Benefits:
Config Management
This refers to the practice of managing the configuration of your application and infrastructure in a
centralized and automated way. Here's how it applies:
Tools: Solutions like Anthos Config Management or Cloud Build can be used to define and
deploy configurations to different environments. These configurations can include
application settings, environment variables, secrets, and more.
Version Control: Configuration files are version controlled alongside your application code,
ensuring consistency across environments.
Benefits:
Section 2: Building and implementing CI/CD pipelines for a service (~23% of the exam)
2.1 Designing and managing CI/CD pipelines. Considerations include:
Artifact Management with Artifact Registry:
What it is: Artifact Registry is a managed service on Google Cloud Platform (GCP) for storing
and managing container images, build artifacts, and other files used in your software
development process.
Benefits in CI/CD:
o Centralized repository: Store all your build artifacts (e.g., Docker images, compiled
code) in one place, simplifying access and management for your CI/CD pipeline.
pg. 6
SKILLCERTPRO
o Integration with CI/CD tools: Artifact Registry integrates seamlessly with Cloud
Build, allowing automatic pushing and pulling of artifacts during the build and
deployment process.
The Scenario: Your application might run across different environments, like on-premises
infrastructure, public clouds (GCP, AWS, Azure), or a combination (hybrid/multi-cloud).
o Cloud Build triggers: Configure triggers in Cloud Build that initiate the pipeline based
on events in different environments (e.g., Git push on a specific branch for on-prem
deployments, Pub/Sub message for cloud deployments).
o Service Meshes (like Istio): Use service meshes to manage service discovery, routing,
and communication between your application components across different cloud
environments.
What they are: Triggers are events that initiate your CI/CD pipeline. They automate the
pipeline execution, reducing manual intervention and ensuring timely deployments.
o Git events: Triggers based on Git actions in your source code repository (e.g., push to
a specific branch, merge to main branch).
o Cloud Storage events: Triggers based on changes in Cloud Storage buckets (e.g., new
file upload signifying a new build artifact).
o Schedule triggers: Run the pipeline periodically at a specific time or interval (e.g.,
nightly builds for continuous integration).
Integration Tests: In your CI pipeline, you can leverage Cloud Build to trigger automated
integration tests whenever there's a code commit. These tests ensure different parts of your
application work together seamlessly. Tools like JUnit or Google Test can be used within
Cloud Build for unit and integration testing.
Deployment Testing: Before promoting a new version to production, consider using Cloud
Run for deployment testing. Cloud Run allows deploying containerized applications on a fully
managed serverless platform. You can deploy your new version to a staging environment on
Cloud Run and perform functional and performance tests there.
pg. 7
SKILLCERTPRO
Load Testing: Utilize tools like Cloud Load Testing to simulate real-world traffic and assess
your application's performance under stress. This helps identify potential bottlenecks before
pushing the new version live.
Approval Workflows: Integrate Cloud Build with Cloud Functions to create custom approval
workflows. A Cloud Function can be triggered upon a successful build, sending notifications
or requiring manual approval before deploying to a specific environment.
Cloud Build Triggers: Configure Cloud Build triggers to automatically build and deploy your
serverless code upon changes. For example, triggers can be set up for code pushes to your
Git repository hosted on Cloud Source Repositories.
Cloud Functions: Cloud Functions are a perfect fit for serverless deployments. Cloud Build
can directly deploy your function code upon a successful build.
Cloud Run vs. Cloud Functions: While both are serverless execution environments, Cloud
Run offers a container-based approach, allowing for more complex deployments requiring
frameworks or specific runtimes. Choose the environment that best suits your application's
needs.
Testing Considerations: Testing serverless applications within the CI/CD pipeline requires
some adjustments. Mockito or Sinon.JS can be used for mocking dependencies during unit
testing. Integration testing can be done by deploying the entire serverless application to a
staging environment and testing its functionality there.
o Cloud Build Logs: These logs provide detailed information about each build step's
execution, success/failure status, and any errors encountered. This allows you to
diagnose issues and track changes made during the build process.
o Artifact Registry: When using Artifact Registry to store build artifacts (container
images, packages, etc.), it automatically logs versioning information and timestamps.
This helps you identify which specific artifacts were deployed in a particular version.
pg. 8
SKILLCERTPRO
o Cloud Deploy Logs: Cloud Deploy logs track the deployment process, including the
deployed version, target environment, and any encountered issues. This provides a
clear audit trail for deployments.
o Cloud Audit Logs: GCP offers centralized Cloud Audit Logging, which captures a
comprehensive record of all API calls made to GCP services. This includes calls
triggered during your CI/CD pipeline, allowing you to track who made changes, what
was changed, and when.
Best Practices:
o Implement a consistent logging strategy across all pipeline stages (Cloud Build, Cloud
Deploy, etc.) to ensure comprehensive auditing.
o Leverage Cloud IAM and roles to control access to different pipeline stages and
resources, ensuring only authorized users can make deployments.
o Define clear naming conventions for artifacts and versions for easy identification in
logs and registries.
Deployment Strategies
Canary Deployments:
o You can monitor the canary deployment's performance and stability before gradually
rolling it out to all users.
Blue/Green Deployments:
o Once testing is complete and the green environment is deemed stable, traffic is
switched over from blue to green, effectively replacing the old version with the new
one.
Rolling Deployments:
o Well-suited for applications that can tolerate short periods of inconsistency during
the rollout.
Traffic Splitting:
pg. 9
SKILLCERTPRO
o Route a specific percentage of traffic to the new version of your application while the
remaining traffic continues using the existing version.
o Provides a way to monitor user behavior and application performance with both
versions before fully committing.
o Consider factors like downtime tolerance, rollback mechanisms, and the need for
A/B testing when making your decision.
Types of Rollbacks:
o Canary Deployment: A small subset of users (canary) are exposed to the new
deployment first. If successful, the rollout continues to a larger group. This allows for
early detection of problems.
o Rollback to Previous Version: This is a simpler approach where you revert the
deployment configuration to the previous working version.
o Testing Rollbacks: Test your rollback procedures regularly to ensure they function
smoothly.
o Cloud Spanner: Offers a versioning system that allows reverting database changes.
o Cloud Build Logs & Artifact Registry: Provide historical logs and artifacts for
identifying the rollback point.
Troubleshooting deployment issues is an inevitable part of CI/CD pipelines. Here are some key points
to consider:
pg. 10
SKILLCERTPRO
o Utilize Cloud Monitoring and Cloud Logging to gather detailed logs about the
deployment process.
o Analyze logs to pinpoint the stage in the pipeline where the issue occurred.
o Leverage debuging tools and techniques to identify the specific code or configuration
causing the problem.
o Maintain clear version control of your codebase to easily identify changes that might
have caused the issue.
Cloud Key Management Service (KMS): Google Cloud's KMS provides a centralized location
to manage encryption keys. You can create, rotate, and control access to these keys used for
data encryption at rest and in transit.
Secret Manager: This service securely stores and manages API keys, passwords, certificates,
and other sensitive information. Secrets Manager integrates with CI/CD pipelines to inject
secrets securely into your applications at runtime.
Benefits:
Centralized Management: Both KMS and Secret Manager offer a single point of control for
all your encryption keys and secrets, simplifying administration.
Enhanced Security: They eliminate the need to store secrets in plain text within your code or
configuration files, reducing the risk of exposure.
Granular Access Control: You can define fine-grained access permissions for who can access
and use these secrets, ensuring only authorized services can utilize them.
Secret Management:
pg. 11
SKILLCERTPRO
Secret Rotation: Regularly rotate your secrets to minimize the window of vulnerability if
compromised. Both KMS and Secret Manager allow automated rotation of secrets.
Least Privilege: Grant only the minimum level of access required for each service or user to
access secrets. This reduces the impact if a credential is compromised.
Auditing and Logging: Enable audit logging to track all access attempts and modifications
made to secrets. This helps identify suspicious activity and potential breaches.
Build-Time Injection: Secrets are injected into the application image during the build
process. This approach simplifies deployment but can be risky if the image leaks.
Runtime Injection: Secrets are injected into the application at runtime using environment
variables or credential providers. This offers better security as secrets don't persist within the
application image.
The best approach depends on the type of secret and your security requirements. Build-time
injection is suitable for non-sensitive data like database connection strings, while runtime injection is
preferred for highly sensitive credentials like API keys.
Additional Tips:
Use environment variables for runtime secret injection to avoid hardcoding them in your
application code.
Consider leveraging Secret Manager's Secret Access Environment Variable (SAEV) feature for
secure access to secrets within workloads running on Google Cloud.
Regularly review and update your CI/CD pipeline configurations to ensure they adhere to
best security practices.
What it is: Artifact Registry is a GCP service that stores container images and other build
artifacts. It has built-in vulnerability scanning capabilities that automatically identify security
weaknesses in your container images during the build process.
Benefits:
Binary Authorization
pg. 12
SKILLCERTPRO
What it is: Binary Authorization is a GCP service that enforces policy-based control over
deployments. It allows you to define rules that dictate which container images can be
deployed to specific environments.
Benefits:
o Enforces security policies: Ensures only authorized and approved images are
deployed, preventing unauthorized code from reaching production.
What it is: Identity and Access Management (IAM) is a core GCP service for controlling
access to resources. By defining IAM policies per environment (development, staging,
production), you can restrict access to specific users or groups for each stage of the pipeline.
Benefits:
o Principle of Least Privilege: Provides granular control over access, ensuring users only
have the permissions necessary for their role in the pipeline.
o Improved accountability: Provides clear audit trails for tracking who has access to
what resources within the pipeline.
By combining these features, you can create a robust security posture for your CI/CD pipeline. Here's
a possible workflow:
3. The pipeline builds the container image and scans it for vulnerabilities using Artifact Registry.
4. If vulnerabilities are found, the build fails, and developers are notified.
5. Once the build succeeds, Binary Authorization verifies if the image is authorized for
deployment to the target environment.
6. If authorized, Cloud Build deploys the image to the appropriate environment (dev, staging,
production) with access restricted by IAM policies.
pg. 13
SKILLCERTPRO
These are measurable metrics that reflect a service's performance from the user's
perspective. They act as the foundation for monitoring and evaluating your service's health.
o Availability: Percentage of time your service is operational (e.g., uptime). You can
use Cloud Monitoring to track this metric.
o Latency: Response time experienced by users when interacting with your service.
Defining SLOs (Service Level Objectives) and Understanding SLAs (Service Level Agreements):
SLOs: These are targets you set for your SLIs. They define the acceptable level of
performance for your service. For instance, an SLO for availability might be 99.9% uptime.
SLAs: These are formal agreements between you (service provider) and your users
(customers) that outline the expected level of service. They often translate SLOs into
business terms with potential repercussions for not meeting them.
Error Budgets:
This concept treats reliability as a spendable resource. You define an acceptable error rate
(budget) for your service based on your SLOs and risk tolerance.
Each error "costs" from your budget. Proactive monitoring and mitigation strategies help
ensure you don't exceed your budget and compromise service quality.
Toil Automation:
DevOps engineers often get bogged down in repetitive, manual tasks (toil). Automating these
tasks using tools like Cloud Functions or Cloud Workflows frees up your time for more
strategic initiatives.
By automating toil, you streamline processes, improve efficiency, and reduce the risk of
human error.
The more "nines" you strive for in your availability SLO (e.g., 99.999% uptime), the more
resources it requires. There's a trade-off between achieving high reliability and the
associated cost (infrastructure, development effort).
You need to assess the impact of downtime on your business and users to determine the
optimal level of reliability to target.
This encompasses the various stages a service goes through, ensuring a smooth and efficient
journey:
pg. 14
SKILLCERTPRO
configurations are in place. This could include setting up security best practices, defining
monitoring tools, and ensuring proper logging mechanisms.
Launch Plan & Deployment Plan: Developing a clear launch plan ensures a coordinated
rollout of the new service. This plan outlines tasks, dependencies, and communication
strategies for a successful launch. A deployment plan specifies the technical steps for
deploying the service to GCP. This includes choosing the appropriate deployment method
(e.g., blue-green deployment, canary deployments) and automating the process using tools
like Cloud Build and Cloud Deploy.
Deployment & Monitoring: The DevOps Engineer oversees the actual deployment of the
service to GCP. This involves monitoring the deployment process for any errors and ensuring
a smooth transition. Following deployment, ongoing monitoring is crucial. The engineer
utilizes tools like Cloud Monitoring to track the health and performance of the service,
identifying and resolving any issues promptly.
Retirement: When a service reaches its end-of-life, a well-defined retirement plan helps
minimize disruption. This includes migrating data to alternative services, notifying users, and
gracefully shutting down the old service.
Efficient use of GCP resources is vital. A DevOps Engineer plays a key role in managing quotas and
limits:
Quotas: GCP enforces quotas on resource usage to prevent exceeding costs or impacting
other users. The engineer understands these quotas and sets them appropriately for each
service, ensuring smooth operation while staying within budget. Tools like Cloud Billing can
help monitor resource usage and identify potential quota issues.
Limits: GCP also has inherent limits on specific resources like CPU or memory. The engineer
considers these limits during service design and scales the service infrastructure effectively
to meet anticipated demand. This may involve using autoscaling features to automatically
adjust resource allocation based on real-time needs.
Autoscaling using managed instance groups, Cloud Run, Cloud Functions, or GKE
Autoscaling is a critical technique for ensuring your service can handle fluctuating demand without
compromising performance or incurring unnecessary costs. Google Cloud offers various options for
autoscaling depending on your application type:
Managed Instance Groups (MIGs): MIGs are pre-configured virtual machine (VM) templates
that can be automatically scaled up or down based on predefined metrics like CPU utilization
or network traffic. This is ideal for stateful applications that require persistent storage.
Cloud Run: This serverless platform automatically scales your containerized applications
based on incoming requests. You simply deploy your container image, and Cloud Run
manages the underlying infrastructure. This is perfect for stateless, web-based applications.
pg. 15
SKILLCERTPRO
Cloud Functions: Similar to Cloud Run, Cloud Functions are another serverless offering for
deploying event-driven, short-lived functions. They automatically scale based on the number
of incoming events, making them ideal for microservices and background tasks.
Google Kubernetes Engine (GKE): GKE, a managed Kubernetes service, allows you to
leverage horizontal pod autoscaler (HPA) for automatic scaling of containerized applications
within your Kubernetes cluster. HPA scales pods based on CPU or memory usage.
Feedback loops are essential for continuous improvement of your services. They allow you to gather
data on performance, identify areas for optimization, and iterate on your deployments. Here's how
to implement feedback loops in Google Cloud:
Monitoring and Logging: Utilize services like Cloud Monitoring and Cloud Logging to collect
data on service health, performance metrics, and user interactions. This data provides
insights into potential bottlenecks or areas for improvement.
Alerting: Set up alerts based on predefined thresholds in your monitoring data. This allows
you to proactively address issues before they significantly impact users.
CI/CD Pipeline Integration: Integrate feedback loops into your CI/CD pipeline. Analyze
metrics and logs during deployments to identify regressions or performance degradations.
This enables you to rollback problematic deployments or trigger corrective actions
automatically.
A/B Testing: Implement A/B testing to compare different versions of your service and
identify features or configurations that improve performance or user experience. Cloud Run
or Firebase can be helpful tools for A/B testing.
User Feedback: Actively collect user feedback through surveys, in-app ratings, or customer
support channels. Analyze this feedback to understand user needs and identify areas for
service improvement.
Burnout is a serious threat to any team's productivity and morale. As a DevOps Engineer, you can
take steps to mitigate it:
Implement Automation Processes: Repetitive tasks are prime candidates for automation
using tools like Google Cloud Functions or Cloud Workflows. This frees up your team's time
and mental energy for more strategic initiatives.
Encourage Breaks and Time Off: Constant work cycles can lead to burnout. Promote a
culture of taking breaks throughout the day and scheduling time off for vacations and mental
health.
Promote Work-Life Balance: A healthy work-life balance is essential. Set clear expectations
and avoid creating an environment where people feel pressured to work long hours all the
time.
pg. 16
SKILLCERTPRO
A positive team environment that values learning and collaboration is key to success:
Encourage Open Communication and Knowledge Sharing: Create a space where team
members feel comfortable asking questions, sharing knowledge, and learning from each
other's experiences.
Focus on Learning from Mistakes: Shift the focus from assigning blame to identifying the
root cause of problems and using them as opportunities to learn and improve processes.
Breaking down team silos and fostering a sense of shared responsibility is essential for effective
DevOps:
Shared Responsibility for Services: Assign ownership of services across teams instead of
within a single team. This promotes collaboration and ensures no single team becomes a
bottleneck.
Clear Ownership Models: While promoting shared responsibility, it's also important to have
clear ownership models to avoid confusion and duplication of effort. Define roles and
responsibilities for each service to ensure everyone is accountable for their part.
Provide updates on the incident status, root cause, and resolution timeline: Transparency is
key. Users appreciate timely updates that explain the current situation, what's causing the
problem, and the estimated timeframe for a resolution.
Be transparent and honest about the situation: Even if the news isn't ideal, honesty builds
trust with users. Explain the situation clearly and avoid sugarcoating the problem.
Draining/redirecting traffic:
pg. 17
SKILLCERTPRO
Divert traffic away from the affected service: When possible, minimize the impact on users
by temporarily routing traffic away from the malfunctioning service. This can be achieved
using load balancers or traffic management tools like Cloud Traffic Director. By diverting
traffic, you can prevent further overloading and allow for a faster recovery process.
Adding capacity:
Scale up resources: In certain scenarios, adding resources like servers or databases can
mitigate the impact of an incident. This approach is particularly effective for incidents caused
by resource exhaustion, where the system is overloaded.
Consideration before adding capacity: Scaling up resources might take time and may not
always be the most suitable solution. Evaluate the situation to determine if it's the most
efficient course of action.
Identify the Chain of Events: Chronologically detail what happened, when it happened, and
the sequence of events that led to the incident. Utilize monitoring data, logs, and timelines
to reconstruct the incident.
Focus on "Why" over "Who": The goal is to understand the underlying causes, not assign
blame. Analyze code changes, configuration errors, infrastructure issues, or external
dependencies that might have triggered the problem.
Gather Evidence: Include screenshots, error messages, and relevant data points to support
your analysis.
Define Solutions: Based on the identified root causes, propose actions to prevent similar
incidents in the future.
Prioritize Actions: Evaluate the impact and likelihood of recurrence for each root cause.
Focus on high-impact, likely-to-recur issues first.
Assign Ownership: Clearly assign ownership for each action item to ensure accountability
and timely completion.
Set Deadlines: Establish clear deadlines for implementing the action items to maintain
momentum and prevent issues from lingering.
Target the Audience: Tailor the level of technical detail based on the audience. A technical
report for engineers might differ from a high-level summary for business stakeholders.
Focus on Impact and Resolution: Clearly communicate the impact of the incident on users or
services, and the steps taken to resolve it.
Transparency and Learnings: Emphasize the learnings gained from the incident to build trust
and demonstrate a commitment to improvement.
pg. 18
SKILLCERTPRO
Choose the Right Format: Consider using written reports, presentations, or even blameless
postmortem discussions (https://www.youtube.com/watch?v=C_nywn1aR44) to share the
information effectively.
o Structured logs: Machine-readable logs with a defined format (e.g., JSON, CSV).
Easier to analyze and filter.
o Unstructured logs: Human-readable text logs often containing free-form text and
varying formats. Require additional processing for analysis.
Sources: Cloud Logging can collect logs from various Google Cloud services:
o Compute Engine: Logs from virtual machines (VMs) running on Compute Engine.
o Serverless Platforms: Logs from Google Cloud Functions and other serverless
services.
Cloud Logging Agent: This agent is installed on VMs and GKE nodes to automatically collect
and send logs to Cloud Logging. You can configure the agent to:
o Set the severity level of logs (e.g., debug, info, warn, error).
The agent configuration file (logging.yaml) specifies what logs to collect and how to send
them.
o Resource labels: Attach labels to logs for easier identification and filtering (e.g.,
environment, application name).
o Destinations: Specify where to send logs (Cloud Logging, Cloud Storage, etc.).
Cloud Logging can also collect logs from on-premises environments or other cloud providers. Here
are common methods:
pg. 19
SKILLCERTPRO
Fluentd: An open-source log aggregator that can forward logs from various sources to Cloud
Logging.
Syslog forwarding: Configure on-premises systems to send logs to a central syslog server that
then forwards them to Cloud Logging.
APIs and SDKs: Use Cloud Logging APIs or SDKs in custom applications to directly send logs
from any environment.
Cloud Logging API: This is Google Cloud's managed service for collecting, storing, processing,
and analyzing logs. It offers a scalable and centralized way to handle logs from your
applications and infrastructure.
Benefits:
o Centralized View: All logs are stored in one place, making it easier to search, analyze,
and troubleshoot issues.
o Scalability: Cloud Logging can handle massive volumes of logs without impacting
performance.
o Integration: Integrates with other Google Cloud services like Stackdriver Monitoring
for comprehensive monitoring.
Log Levels:
Log levels define the severity and verbosity of logged information. Common levels include:
o Fatal: Critical errors that have caused the application to crash (least verbose)
o Select log levels based on the desired level of detail and potential impact on log
volume and storage costs.
o Generally, use higher levels (debug) during development and testing, and lower
levels (info, error) in production.
Optimizing Logs:
Optimizing logs ensures you capture essential information without generating excessive data:
o Multiline Logging: Break long messages into smaller chunks for readability.
o Exceptions: Log only relevant information from exceptions, not the entire stack
trace.
o Log Size: Minimize the size of each log message by avoiding unnecessary data.
pg. 20
SKILLCERTPRO
o Cost: Consider the volume and retention period of logs to manage storage costs.
Cloud Logging offers tiered pricing based on log volume.
Additional Tips:
Structured Logs: Use structured logging formats (e.g., JSON) for easier parsing and analysis
by tools.
Log Rotation: Set up log rotation policies to automatically archive and manage older logs.
Monitoring Logs: Integrate Cloud Logging with Stackdriver Monitoring to create alerts based
on specific log patterns.
Application Metrics: These are measurements that reflect the health and performance of
your application code. Examples include request latency, API call success rates, memory
usage, and thread pool saturation.
o Instrumentation: You can instrument your application code to emit these metrics
using libraries or frameworks provided by your programming language or platform
(e.g., SDKs for Java, Python, etc.).
APIs: You can directly send metrics to Cloud Monitoring via its API.
Platform Metrics: These are metrics provided by Google Cloud for its services and
infrastructure. They offer insights into resource utilization, network performance, and overall
platform health.
o Examples: CPU usage, disk I/O, network throughput, and API quota usage for Google
Cloud services.
Networking Metrics: These provide insights into the health and performance of your
network traffic.
pg. 21
SKILLCERTPRO
VPC Flow Logs: These detailed logs capture network traffic information for
analysis with Cloud Monitoring.
Service Mesh Metrics: If you use a service mesh like Istio, Cloud Monitoring can collect
metrics related to service calls, latency, error rates, and overall mesh health.
Metrics Explorer: This is a web-based tool in Cloud Monitoring that allows you to:
o Filter and Aggregate Data: Narrow down and group metrics for specific resources,
time ranges, or labels.
o Perform Calculations: Create custom metrics by deriving new metrics from existing
ones using mathematical functions.
Custom Metrics: These are metrics you define based on patterns or trends found in your
application logs.
o Use Case: For example, you might create a custom metric for the rate of failed login
attempts from your logs.
Process: Cloud Monitoring allows you to define filters that extract specific log data points
and convert them into custom metrics.
o Benefits: Custom metrics offer insights beyond standard application metrics and
provide a more holistic view of your system's behavior.
A monitoring dashboard is a visual representation of key metrics and alerts related to your Google
Cloud resources. It helps you stay informed about the health and performance of your applications
and infrastructure at a glance. Here's how to create one:
1. Navigate to Cloud Monitoring: Go to the Google Cloud Console and select "Cloud
Monitoring" from the navigation menu.
pg. 22
SKILLCERTPRO
3. Add Charts: Select the metrics you want to display from the available options. You can
choose from various chart types (line, gauge, pie chart, etc.) to visualize the data effectively.
4. Customize Appearance: Edit titles, labels, and chart properties to make the dashboard
visually appealing and clear.
5. Save and Share: Give your dashboard a descriptive name and save it. You can also share it
with other users by granting them appropriate viewing permissions.
Filtering: You can filter the data displayed in your dashboard by applying time ranges,
resource types, or specific metric values. This allows you to focus on specific aspects of your
monitoring data. For example, you might filter a dashboard to show only metrics for a
particular application or service during a specific deployment window.
Sharing: Granting others access to your dashboards helps them stay informed and
collaborate on monitoring tasks. You can set different permission levels, such as "View only"
or "Edit," to control how others interact with your dashboards.
Configuring Alerting
Alerts notify you when specific conditions are met in your monitoring data. This helps you proactively
identify and address potential issues before they impact your users or applications. Here's how to
configure alerts:
1. Define Alert Policy: Create an alert policy that specifies the metric, condition (e.g., exceeding
a threshold), and severity level (e.g., critical, warning) to trigger an alert.
2. Set Notification Channels: Choose how you want to receive alerts, such as email, SMS, or
integration with a ticketing system.
3. Refine Configuration: You can further refine your alerts by setting escalation policies,
silencing rules (temporarily disabling alerts), and defining auto-remediation actions
(automatic responses to specific alerts).
Additional Considerations:
Dashboard Design: When creating dashboards, consider the audience and their needs. Focus
on presenting information clearly and concisely. Use a logical layout and intuitive
visualizations to make the data easy to understand.
Alerting Thresholds: Set realistic and relevant thresholds for your alerts to avoid alert fatigue
(receiving too many unnecessary notifications).
Proactive Monitoring: Don't wait for alerts to identify issues. Regularly review your
dashboards to identify trends and potential problems before they escalate.
SLO (Service Level Objective): An SLO is a measurable target that defines the expected
performance of a service. It's a high-level goal, like "Our website will be available 99.95% of
the time."
pg. 23
SKILLCERTPRO
SLI (Service Level Indicator): An SLI is a metric you track to measure progress towards your
SLO. For website availability, an SLI could be "percentage of successful requests over a 5-
minute window."
1. Set SLOs: Define your service's desired performance level using SLOs.
2. Identify SLIs: Choose metrics (SLIs) that accurately reflect your SLOs. Cloud Monitoring
provides various metrics for resources like VMs, databases, and custom metrics.
3. Create Alerting Policies: Configure alerts in Cloud Monitoring to trigger when an SLI deviates
from your SLO targets. For example, an alert could fire if the website's successful request
percentage falls below 99.95% for 5 minutes.
Benefits:
Faster troubleshooting
Terraform is an Infrastructure as Code (IaC) tool that lets you manage cloud resources using code.
You can leverage Terraform to automate the creation and configuration of alerting policies in Cloud
Monitoring.
Here's how:
1. Define Terraform configuration: Write Terraform code that specifies the resources (e.g.,
metric, condition, notification channel) for your alerting policy.
2. Reference SLOs and SLIs: Use variables or data sources in Terraform to reference your
defined SLOs and SLIs.
3. Version control and deployment: Manage your Terraform code in a version control system
(like Git) and use CI/CD pipelines to automatically deploy and update alerting policies
whenever your infrastructure changes.
Benefits:
Using Google Cloud Managed Service for Prometheus to Collect Metrics and Set Up Monitoring
and Alerting
Prometheus is an open-source monitoring tool that collects and analyzes metrics. Google Cloud
Managed Service for Prometheus is a fully managed service that simplifies running Prometheus on
Google Cloud.
How it works:
pg. 24
SKILLCERTPRO
3. Set Up Alerting Rules: Write alerting rules within Prometheus based on the collected
metrics. These rules trigger alerts when specific conditions are met.
4. Integrate with Cloud Monitoring: You can integrate Managed Prometheus with Cloud
Monitoring for centralized alerting and visualization.
Benefits:
Simple monitoring: If you need basic alerting for a few resources, manual configuration in
Cloud Monitoring might suffice.
Additional Considerations:
Alerting Fatigue: Avoid creating too many alerts, leading to alert fatigue (ignoring important
alerts).
Alert Routing and Escalation: Define who gets notified for different types of alerts and how
escalation happens for critical issues.
Alerting Best Practices: Follow best practices for writing effective alerting rules to ensure
they trigger at the right time and provide actionable information.
Cloud Logging is a managed service in Google Cloud that centralizes log data collection, storage,
analysis, and export. It simplifies log management by offering a unified platform for various sources,
including applications, infrastructure, and platforms.
Key Considerations:
o Cloud Audit Logs record API calls, user activities, and resource changes within your
Google Cloud projects.
pg. 25
SKILLCERTPRO
o To enable them:
Choose the logging bucket and sink to store and export logs.
o Benefits:
Provide insights into user actions and resource modifications for security and
compliance purposes.
o VPC Flow Logs capture information about network traffic traversing your Virtual
Private Cloud (VPC) network.
o To enable them:
Choose the logging bucket and sink for log storage and export.
o Benefits:
o The Cloud Logging console offers a user-friendly interface to explore your logs.
o Access it from the Cloud Logging menu in the Google Cloud console
(https://cloud.google.com/logging).
o Features:
Filter logs based on timestamps, severity levels, resources, and log entries.
o Cloud Logging provides filtering capabilities to narrow down log entries for specific
analysis.
o Basic filters:
pg. 26
SKILLCERTPRO
Filter by resource type (e.g., logs from a specific Compute Engine instance).
o Choosing the right filter depends on the complexity of your logs and the level of
detail needed.
o Exclusion:
Involves filtering out unwanted logs before they are stored or exported.
Reduces storage costs and simplifies log analysis by excluding irrelevant data.
o Export:
Project-level export: This means you're sending logs generated within a specific GCP project
to a destination like Cloud Storage or BigQuery. This is useful for isolating logs from individual
projects for analysis or compliance purposes.
Organization-level export: Here, you're exporting logs from all projects within your GCP
organization to a centralized location. This is efficient for large-scale log management and
works well if you have consistent logging needs across projects.
Log Volume: For high-volume projects, organization-level export might be more efficient.
pg. 27
SKILLCERTPRO
Sinks: Google Cloud Logging uses "sinks" to define where logs are exported. You can
configure sinks to send logs to various destinations like Cloud Storage buckets, BigQuery
datasets, or external systems.
Cloud Logging Viewer: The Google Cloud Console provides a built-in Cloud Logging Viewer
for exploring exported logs. You can filter and search logs based on severity, timestamps,
resource types, and other criteria.
APIs and Logs Explorer: Additionally, you can manage and view logs programmatically using
the Cloud Logging API or through advanced tools like the Logs Explorer for more granular
analysis.
Cloud Logging allows integration with external logging systems like Splunk or ELK Stack. You
can configure sinks to send logs to these platforms for centralized log management alongside
logs from other sources.
This is beneficial for organizations already invested in an external logging ecosystem and
wanting to consolidate all logs in one place.
Cloud Logging offers powerful filtering capabilities to focus on relevant logs. You can filter
based on log severity, resource type, timestamps, and custom log fields.
Redacting sensitive data is crucial for security and compliance. Cloud Logging supports log
exclusion filters to remove specific data patterns (e.g., credit card numbers) before export.
You can also use redaction engines to dynamically redact sensitive data based on pre-defined
rules.
Audit Logs: These logs record API activity within your Google Cloud project. They provide a
crucial record of who did what, when, and where. Restricting access ensures only authorized
users can view these logs, protecting sensitive information about your project activity.
o IAM Roles: Use Google Cloud Identity and Access Management (IAM) roles to grant
granular access to audit logs. Specific roles like "Logging Viewer" or "Logging Editor"
allow users to view or modify logs, respectively.
VPC Flow Logs: These logs capture network traffic information flowing into, out of, and
within your Virtual Private Cloud (VPC). Restricting access prevents unauthorized users from
seeing details about your network traffic patterns.
o IAM Policies: Create IAM policies at the folder, organization, or project level to
control access to VPC Flow Logs. You can define who can view or export these logs
based on specific IAM roles or service accounts.
pg. 28
SKILLCERTPRO
Log Exports: Cloud Logging allows you to export logs to external destinations like BigQuery
for analysis or compliance purposes. Restricting export configuration ensures only authorized
users can modify these destinations, preventing accidental data exposure or loss.
Writing Metrics and Logs: While access controls restrict viewing, it's important to allow
authorized applications and services to write metrics and logs to Cloud Monitoring. This data
is essential for monitoring system health, performance, and identifying issues.
o IAM Roles: Grant appropriate IAM roles like "Monitoring Writer" or "Monitoring
Editor" to applications or service accounts. This allows them to push metrics and logs
to Cloud Monitoring for analysis and alerting.
Additional Considerations:
Principle of Least Privilege: Grant users the minimum level of access required for their role.
This minimizes the risk of unauthorized access to sensitive information.
Monitoring Access Controls: Regularly review and update access controls to ensure they
reflect current user permissions and project requirements.
Logging Best Practices: Implement log rotation policies to manage log storage and retention
effectively. Consider filtering logs to reduce noise and focus on relevant data.
Google Cloud offers a comprehensive suite of tools for monitoring and analyzing resource utilization.
This helps you identify potential bottlenecks and optimize your infrastructure costs. Here's a
breakdown of key services:
Cloud Monitoring: The central hub for collecting, visualizing, and alerting on metrics from
your cloud resources. You can monitor CPU, memory, disk I/O, network traffic, and more for
VMs, Cloud Storage buckets, Cloud SQL instances, and many other services.
Cloud Logging: Aggregates logs from your applications and infrastructure. Logs provide
valuable insights into application behavior and potential errors. You can filter and analyze
logs to identify issues like slow queries, high error rates, or resource spikes.
Cloud Billing: Provides detailed reports on your cloud resource usage and costs. You can
identify underutilized resources and optimize your spending by using features like committed
use discounts or preemptible VMs.
pg. 29
SKILLCERTPRO
Service mesh is an architectural layer that simplifies application communication and provides
observability. When using a service mesh like Istio on Google Cloud, it generates telemetry data that
provides insights into how your microservices interact. Here's what you can glean from telemetry:
Request latency: Measure how long it takes for requests to travel between services. High
latency can indicate overloaded services, network issues, or slow code execution.
Error rates: Track the number of failed requests between services. This helps identify service
instability or configuration problems.
Throughput: Monitor the volume of traffic flowing between services. Analyze spikes or dips
to understand service capacity and potential bottlenecks.
Compute resources like VMs are the workhorses of your applications. When performance problems
arise, troubleshooting compute resources is crucial. Here are some techniques:
Monitoring metrics: Use Cloud Monitoring to track CPU, memory, and disk utilization on
your VMs. Identify situations where resources are maxed out, causing performance
degradation.
Analyzing logs: Review VM logs for errors related to resource exhaustion, application
crashes, or kernel panics. Logs can pinpoint specific issues within the VM.
Scaling resources: If your VMs are consistently overloaded, consider scaling them up by
increasing CPU, memory, or storage. Conversely, you can scale down underutilized VMs to
save costs.
Cloud Profiler: This tool helps identify performance bottlenecks within your application code
running on VMs. It can pinpoint CPU-intensive functions or memory leaks.
Deployments should be smooth and applications should function flawlessly at runtime. Here's how
to identify issues in these stages:
Cloud Logging: Google Cloud Logging aggregates logs from various sources like applications,
infrastructure, and platforms. Analyze logs for errors, warnings, or unexpected behavior
during deployments or application runtime. Look for patterns and timestamps to pinpoint
the problem area.
Error Reporting: This service automatically collects crash reports, exceptions, and application
health data. It helps identify issues that might not be apparent in standard logs, like client-
side errors.
Cloud Monitoring: Monitor key application metrics like CPU usage, memory consumption,
request latency, and throughput. Deviations from normal behavior can indicate performance
bottlenecks. Define Service Level Indicators (SLIs) and set alerts based on these metrics to
proactively identify issues.
pg. 30
SKILLCERTPRO
Tools like Cloud Profiler can help pinpoint performance bottlenecks within the application
code.
CI/CD Pipeline Monitoring: Monitor your CI/CD pipeline (e.g., Cloud Build or third-party
tools) for failures during the build, test, or deployment stages. Identify errors or bottlenecks
in the pipeline itself that might be causing deployment delays.
Network connectivity and performance are vital for any cloud application. Here's how to identify
network-related issues:
VPC Flow Logs: These logs record all ingress and egress traffic within your Virtual Private
Cloud (VPC). Analyze flow logs to identify unexpected traffic patterns, potential security
threats, or asymmetry in traffic flow (indicating issues with specific instances).
Firewall Logs: Firewall logs track all traffic that interacts with your Cloud Firewall rules.
Analyze these logs to identify blocked connections, suspicious activity, or rule
misconfigurations that might be impacting legitimate traffic.
Network Details: Use tools like Cloud Monitoring's Network section to visualize network
topology, identify bottlenecks, and monitor resource utilization for Cloud Load Balancing or
Cloud CDN.
Additional Tips:
Use Stackdriver (now part of Cloud Monitoring) for a unified view of application
performance, logs, and infrastructure metrics.
Leverage infrastructure as code tools like Terraform to ensure consistent configuration and
minimize manual errors that might lead to network issues.
Consider cost optimization by using preemptible VMs for non-critical workloads. However, be
aware that preemptible VMs might be terminated during high demand periods, potentially
causing application downtime.
Concept: The process of embedding code or configurations within your application to gather
detailed information about its behavior during runtime. This information becomes invaluable
for troubleshooting issues and understanding application performance.
Implementation in Google Cloud: Google Cloud offers various tools and libraries for
instrumenting your applications:
pg. 31
SKILLCERTPRO
o Stackdriver Profiler: A managed profiling service that helps you identify performance
bottlenecks in your applications. It integrates seamlessly with Cloud Monitoring for
centralized data collection and analysis.
Benefits:
o Enables gathering detailed runtime data about function calls, resource usage, and
application behavior.
o Provides insights into application health and performance metrics for proactive
monitoring and optimization.
Cloud Logging
Concept: A managed logging service from Google Cloud that centralizes log data from your
applications, infrastructure, and platforms. It offers a scalable and cost-effective solution for
collecting, storing, processing, analyzing, and exporting logs.
Key Features:
o Structured Logging: Enables logs to be formatted with relevant key-value pairs for
easier searching and filtering.
o Sinks: Logs can be routed to various destinations (Cloud Storage, BigQuery, Cloud
Monitoring, external systems) for further analysis and storage.
o Filters: Powerful filtering capabilities allow you to focus on specific logs based on
severity, timestamp, resource type, and custom labels.
o Log Viewer: A user-friendly interface for exploring and analyzing logs, facilitating
troubleshooting and performance monitoring.
Benefits in Debugging:
o Provides a central repository for all your application logs, simplifying the process of
finding relevant information.
o Enables filtering and searching logs based on specific criteria to pinpoint issues
quickly.
Cloud Trace
Concept: A distributed tracing service from Google Cloud that helps you visualize and
analyze the flow of requests through a complex application or microservice architecture. It
tracks the latency and path of each request across different services and components.
Benefits in Debugging:
pg. 32
SKILLCERTPRO
o Enables correlating logs from different services involved in a request for a more
comprehensive understanding.
Alerting and Monitoring: Set up alerts based on log entries and trace data to proactively
identify potential issues before they impact users. Integrate logging and tracing information
with Cloud Monitoring for continuous monitoring and performance insights.
Best Practices: Follow Google Cloud's best practices for logging and tracing:
o Use structured logs with appropriate labels for easy filtering and analysis.
o Include relevant context (timestamps, user IDs, request IDs) in your logs to facilitate
troubleshooting.
Error Reporting
Integrates with other tools like Cloud Monitoring for comprehensive debugging.
Cloud Profiler
pg. 33
SKILLCERTPRO
o CPU Profiling: Measures how your application spends CPU time, pinpointing
functions that consume excessive resources.
Provides detailed call stacks and allocation traces for in-depth analysis.
Cloud Monitoring
o Resource metrics like CPU usage, memory utilization, and disk I/O.
Integrates with other debugging tools like Error Reporting and Cloud Trace for a holistic view.
Preemptible VMs are virtual machines offered by GCP at a significantly lower cost compared
to sustained use VMs. The catch? Google can reclaim these VMs with short notice (usually 24
hours) to meet their own computing needs.
Use Cases: These VMs are ideal for workloads that are:
Pros:
Cons:
o VMs can be interrupted at any time, requiring applications to be designed for such
interruptions.
CUDs offer significant discounts on compute resources (VMs, Cloud Storage, etc.) in
exchange for a commitment to a specific level of usage over a set period (typically 1 or 3
years).
pg. 34
SKILLCERTPRO
Types of CUDs:
Benefits:
Considerations:
Sustained-use Discounts:
GCP offers discounts for resources that are used consistently over a prolonged period. These
discounts reward predictable workloads and can significantly reduce your cloud bill.
o Cloud Spanner committed usage discounts: Similar to CUDs, but specifically for
Cloud Spanner relational database instances.
Network Tiers:
GCP offers different network tiers that cater to varying network traffic needs:
o Basic Tier: Cost-effective option for workloads with minimal network traffic. Ideal for
development environments or low-traffic applications.
By choosing the appropriate network tier based on your application's traffic patterns, you can
optimize network costs without compromising performance.
Sizing Recommendations:
pg. 35
SKILLCERTPRO
GCP provides recommendations for optimal machine type and size based on your
application's resource requirements. Utilizing these recommendations can help you:
* **Cloud Monitoring:** Monitors resource utilization metrics like CPU, memory, and disk usage. You
can identify underutilized or overloaded resources and adjust their size accordingly.
* **Cloud Architecture Center:** Provides recommendations for machine types and configurations
based on your workload type.
By effectively utilizing sustained-use discounts, choosing the right network tier, and following sizing
recommendations, you can significantly reduce your GCP costs without impacting application
performance.
Disclaimer: All data and information provided on this site is for informational
purposes only. This site makes no representations as to accuracy, completeness,
correctness, suitability, or validity of any information on this site & will not be
liable for any errors, omissions, or delays in this information or any losses,
injuries, or damages arising from its display or use. All information is provided on
an as-is basis.
pg. 36