KEMBAR78
Data Engineering 101 - BigQuery | PDF | Cloud Computing | Table (Database)
0% found this document useful (0 votes)
22 views49 pages

Data Engineering 101 - BigQuery

Uploaded by

beginew28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views49 pages

Data Engineering 101 - BigQuery

Uploaded by

beginew28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Data

Engineering
101
Google
BigQuery
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Overview

BigQuery is a fully-managed, serverless data


warehouse that allows for super-fast SQL
queries using the processing power of Google's
infrastructure.

BigQuery allows you to run SQL queries on large


datasets without needing to manage infrastructure.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Datasets and Tables

Datasets are top-level containers that hold


tables. Tables contain the actual data,
structured in rows and columns.

CREATE TABLE mydataset.mytable


(id INT64, name STRING);

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Loading Data

Data can be loaded into BigQuery from various


sources, including CSV files, JSON files, and
Google Cloud Storage.

LOAD DATA INTO mydataset.mytable


FROM FILE 'gs://my-bucket/my-file.csv'
FORMAT CSV;

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Querying Data

BigQuery uses standard SQL for querying data.


Queries can join multiple tables, aggregate
data, and perform complex transformations.

SELECT name, COUNT(*) as name_count


FROM mydataset.mytable
GROUP BY name;

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Partitions

Tables can be partitioned by date or range,


improving query performance and managing
large datasets.

CREATE TABLE mydataset.mytable_partitioned


PARTITION BY DATE(timestamp_column)
AS
SELECT * FROM mydataset.mytable;

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Clustering

Clustering organizes data based on the values


in one or more columns, improving
performance for specific types of queries.

CREATE TABLE mydataset.mytable_clustered


CLUSTER BY category
AS
SELECT * FROM mydataset.mytable;

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

User-defined Functions

BigQuery allows you to create custom functions


using JavaScript or SQL. These functions can be
reused in multiple queries.

CREATE FUNCTION mydataset.mysquare(x FLOAT64)


RETURNS FLOAT64 AS (x * x);

SELECT mydataset.mysquare(3);

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Stored Procedures

Stored procedures allow you to execute a


sequence of SQL statements. They can include
control flow logic, like IF and LOOP statements.

CREATE PROCEDURE mydataset.myprocedure()


BEGIN
INSERT INTO mydataset.mytable (id, name) VALUES
(1, 'example'); END;

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Views

Views are virtual tables created by querying


data from one or more tables. Views do not
store data themselves but provide a way to
simplify complex queries.

CREATE VIEW mydataset.myview


AS
SELECT id, name
FROM mydataset.mytable;

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Materialized Views

Materialized views store query results


physically, improving performance for
frequently run queries.

CREATE MATERIALIZED VIEW mydataset.mymview


AS
SELECT id, COUNT(*) F
ROM mydataset.mytable
GROUP BY id;

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Data Security

BigQuery provides data security through access


controls, allowing you to specify who can view
or query your data.

Using IAM roles and policies to control access to


datasets and tables.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery ML

BigQuery ML allows you to create and execute


machine learning models using SQL queries
directly within BigQuery.

CREATE MODEL mydataset.mymodel


OPTIONS(model_type='linear_reg', input_label_cols=['label'])
AS
SELECT *
FROM mydataset.mytable;

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Geospatial Analysis

BigQuery supports geospatial data types and


functions, allowing you to perform location-
based queries and analysis.

SELECT ST_DISTANCE(ST_GEOGPOINT(lng1, lat1),


ST_GEOGPOINT(lng2, lat2))
FROM mydataset.mytable;

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery API

The BigQuery API allows you to interact with


BigQuery programmatically, enabling you to
automate tasks and integrate with other
applications.

Using client libraries in Python, Java, or other


languages to run queries and manage datasets.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Pricing

BigQuery pricing is based on data storage and


query usage. You pay for the amount of data
stored and the amount of data processed by
your queries.

Cost can be estimated using the BigQuery pricing


calculator and can be reduced by optimizing queries
to reduce data processed.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Data Transfer


Service
This service automates data movement into
BigQuery on a scheduled, managed basis. It
can transfer data from various Google services
and external data sources.

Example use case could like setting up a data


transfer from Google Ads or Google Analytics to
BigQuery using the Data Transfer Service.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Integration with Google


Cloud
BigQuery integrates seamlessly with other
Google Cloud services like Google Data Studio,
Google Cloud Storage, and Google Cloud
Pub/Sub.

For example creating dashboards in Google Data


Studio using BigQuery as the data source or
exporting BigQuery results to Google Cloud Storage.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Federated Queries

BigQuery allows you to query data stored


outside of BigQuery, such as in Google Cloud
Storage or Google Drive, without loading it into
BigQuery.

SELECT * FROM
EXTERNAL_QUERY('external_source_connection',
'SELECT * FROM my_external_table');

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery BI Engine

BI Engine is an in-memory analysis service that


provides sub-second query response times,
specifically for interactive dashboards and
reports.

BI Engine can enabled to speed up queries for


Google Data Studio reports connected to BigQuery.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Scheduling Queries

BigQuery allows you to schedule queries to run


at regular intervals, automating routine data
processing tasks.

For example a scheduled query to run daily and


load new data into a table using the BigQuery web
UI.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Storage API

The Storage API provides fast access to


BigQuery-managed storage for advanced
analytical use cases.

Storage API can be used to read data from BigQuery


into a Pandas DataFrame for further analysis in
Python.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Streaming Inserts

BigQuery supports real-time data ingestion,


allowing you to stream data into tables as it
arrives.

Using the BigQuery API or client libraries to stream


data into BigQuery tables in real-time for immediate
analysis.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Reservations

Reservations allow you to purchase dedicated


query processing capacity for predictable
workloads, improving cost management and
performance.

Reserving a certain number of slots for a critical


dashboard to ensure consistent performance during
business hours.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Access Control

BigQuery provides fine-grained access control


at the dataset, table, and column levels to
protect sensitive data.

Assigning roles and permissions to users and groups


to control who can view and query specific datasets.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Data Masking

BigQuery supports dynamic data masking to


protect sensitive information in query results.

Configuring a policy tag to mask a social security


number column so that only authorized users can
see the full data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Audit Logs

Audit logs provide detailed records of all the


operations performed on BigQuery resources,
helping with security and compliance.

Using Cloud Logging to monitor and analyze


BigQuery usage patterns and detect any unusual
activity.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

External Data Sources

BigQuery can query external data sources like


Google Sheets, Cloud SQL, and other databases
without moving the data.

SELECT *
FROM
EXTERNAL_QUERY('my-google-sheets-connection',
'SELECT * FROM sheet1');

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Backup and Restore

BigQuery provides mechanisms to back up and


restore datasets to protect against data loss.

Using scheduled queries and export jobs to back


up critical datasets to Google Cloud Storage for
disaster recovery.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Sandbox

The Sandbox provides a free tier with limited


usage for learning and experimenting with
BigQuery.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery in Dataflow

Dataflow integrates with BigQuery for ETL


(Extract, Transform, Load) operations, enabling
scalable data processing pipelines.

Writing a Dataflow job that reads from a Pub/Sub


topic, transforms the data, and writes it to a
BigQuery table.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Machine Learning

BigQuery ML supports various machine


learning models, including linear regression,
logistic regression, k-means clustering, and
more.

CREATE MODEL mydataset.mykmeans


MODEL_TYPE='kmeans' OPTIONS(num_clusters=3)
AS SELECT * FROM mydataset.training_data;

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Geospatial Indexing

BigQuery supports geospatial indexing to


improve the performance of location-based
queries.

Creating a geography column index to speed up


spatial queries involving large datasets.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Data Governance

BigQuery provides data governance features,


including audit logging, data classification, and
policy enforcement.

Implementing a data governance strategy to ensure


compliance with industry regulations and
organizational policies.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Analytics Hub

The Analytics Hub enables sharing and


discovering analytics assets within and across
organizations.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Data Catalog Integration

BigQuery integrates with Google Data Catalog


for metadata management, data discovery,
and data lineage tracking.

Using Data Catalog to search and discover datasets,


tables, and views across your BigQuery
environment.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Admin Console

The Admin Console provides tools for


managing and monitoring BigQuery resources,
including jobs, datasets, and reservations.

Using the Admin Console to monitor query


performance, manage reservations, and view job
history.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Column-level Security

BigQuery supports column-level security,


allowing you to restrict access to sensitive
columns within a table.

Configuring access policies to allow certain users to


view only non-sensitive columns in a dataset.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Cost Optimization

BigQuery offers various features to optimize


costs, including slot reservations, query
optimization techniques, and monitoring tools.

Using cost control features to manage and reduce


the expenses of running large-scale queries and
storing data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Query Execution Plans

BigQuery provides query execution plans to


help you understand and optimize the
performance of your queries.

Reviewing the query execution plan to identify and


address performance bottlenecks in complex
queries.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Data Import Formats

BigQuery supports multiple data import


formats, including CSV, JSON, Avro, Parquet,
and ORC.

Loading data from a CSV file stored in Google Cloud


Storage into a BigQuery table for analysis.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Data Export

BigQuery allows exporting data to various


formats and destinations, including CSV, JSON,
Avro, Parquet, and Google Cloud Storage.

Exporting query results to a Parquet file in Google


Cloud Storage for further processing in another tool.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Data QnA

Data QnA allows natural language queries over


BigQuery datasets, making it easier for non-
technical users to explore data.

Using Data QnA to ask questions in plain English and


get answers from your BigQuery datasets without
writing SQL.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Audit Reports

Audit reports provide detailed insights into the


usage and performance of your BigQuery
environment, helping with capacity planning
and optimization.

Generating audit reports to track the most


frequently accessed tables, query performance, and
data growth trends.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Performance Tuning

BigQuery provides various techniques for


performance tuning, including partitioning,
clustering, denormalization, and query
optimization.

Applying best practices for performance tuning to


ensure efficient and fast query execution in
BigQuery.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery in Looker

Looker integrates with BigQuery to provide


powerful data visualization and exploration
capabilities.

Creating interactive dashboards and reports in


Looker using BigQuery as the backend data source.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

BigQuery Export to Sheets

BigQuery allows exporting query results


directly to Google Sheets for further analysis
and collaboration.

Using the EXPORT DATA statement to export query


results to a Google Sheets document for easy
sharing and collaboration.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query

Resource Management

BigQuery provides tools for managing and


monitoring resources, including jobs, datasets,
and reservations, to ensure efficient use of
resources.

Monitoring resource usage and managing quotas to


optimize performance and control costs in BigQuery.

Shwetank Singh
GritSetGrow - GSGLearn.com

You might also like