Data
Engineering
101
Google
BigQuery
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Overview
BigQuery is a fully-managed, serverless data
warehouse that allows for super-fast SQL
queries using the processing power of Google's
infrastructure.
BigQuery allows you to run SQL queries on large
datasets without needing to manage infrastructure.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Datasets and Tables
Datasets are top-level containers that hold
tables. Tables contain the actual data,
structured in rows and columns.
CREATE TABLE mydataset.mytable
(id INT64, name STRING);
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Loading Data
Data can be loaded into BigQuery from various
sources, including CSV files, JSON files, and
Google Cloud Storage.
LOAD DATA INTO mydataset.mytable
FROM FILE 'gs://my-bucket/my-file.csv'
FORMAT CSV;
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Querying Data
BigQuery uses standard SQL for querying data.
Queries can join multiple tables, aggregate
data, and perform complex transformations.
SELECT name, COUNT(*) as name_count
FROM mydataset.mytable
GROUP BY name;
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Partitions
Tables can be partitioned by date or range,
improving query performance and managing
large datasets.
CREATE TABLE mydataset.mytable_partitioned
PARTITION BY DATE(timestamp_column)
AS
SELECT * FROM mydataset.mytable;
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Clustering
Clustering organizes data based on the values
in one or more columns, improving
performance for specific types of queries.
CREATE TABLE mydataset.mytable_clustered
CLUSTER BY category
AS
SELECT * FROM mydataset.mytable;
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
User-defined Functions
BigQuery allows you to create custom functions
using JavaScript or SQL. These functions can be
reused in multiple queries.
CREATE FUNCTION mydataset.mysquare(x FLOAT64)
RETURNS FLOAT64 AS (x * x);
SELECT mydataset.mysquare(3);
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Stored Procedures
Stored procedures allow you to execute a
sequence of SQL statements. They can include
control flow logic, like IF and LOOP statements.
CREATE PROCEDURE mydataset.myprocedure()
BEGIN
INSERT INTO mydataset.mytable (id, name) VALUES
(1, 'example'); END;
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Views
Views are virtual tables created by querying
data from one or more tables. Views do not
store data themselves but provide a way to
simplify complex queries.
CREATE VIEW mydataset.myview
AS
SELECT id, name
FROM mydataset.mytable;
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Materialized Views
Materialized views store query results
physically, improving performance for
frequently run queries.
CREATE MATERIALIZED VIEW mydataset.mymview
AS
SELECT id, COUNT(*) F
ROM mydataset.mytable
GROUP BY id;
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Data Security
BigQuery provides data security through access
controls, allowing you to specify who can view
or query your data.
Using IAM roles and policies to control access to
datasets and tables.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery ML
BigQuery ML allows you to create and execute
machine learning models using SQL queries
directly within BigQuery.
CREATE MODEL mydataset.mymodel
OPTIONS(model_type='linear_reg', input_label_cols=['label'])
AS
SELECT *
FROM mydataset.mytable;
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Geospatial Analysis
BigQuery supports geospatial data types and
functions, allowing you to perform location-
based queries and analysis.
SELECT ST_DISTANCE(ST_GEOGPOINT(lng1, lat1),
ST_GEOGPOINT(lng2, lat2))
FROM mydataset.mytable;
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery API
The BigQuery API allows you to interact with
BigQuery programmatically, enabling you to
automate tasks and integrate with other
applications.
Using client libraries in Python, Java, or other
languages to run queries and manage datasets.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Pricing
BigQuery pricing is based on data storage and
query usage. You pay for the amount of data
stored and the amount of data processed by
your queries.
Cost can be estimated using the BigQuery pricing
calculator and can be reduced by optimizing queries
to reduce data processed.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Data Transfer
Service
This service automates data movement into
BigQuery on a scheduled, managed basis. It
can transfer data from various Google services
and external data sources.
Example use case could like setting up a data
transfer from Google Ads or Google Analytics to
BigQuery using the Data Transfer Service.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Integration with Google
Cloud
BigQuery integrates seamlessly with other
Google Cloud services like Google Data Studio,
Google Cloud Storage, and Google Cloud
Pub/Sub.
For example creating dashboards in Google Data
Studio using BigQuery as the data source or
exporting BigQuery results to Google Cloud Storage.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Federated Queries
BigQuery allows you to query data stored
outside of BigQuery, such as in Google Cloud
Storage or Google Drive, without loading it into
BigQuery.
SELECT * FROM
EXTERNAL_QUERY('external_source_connection',
'SELECT * FROM my_external_table');
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery BI Engine
BI Engine is an in-memory analysis service that
provides sub-second query response times,
specifically for interactive dashboards and
reports.
BI Engine can enabled to speed up queries for
Google Data Studio reports connected to BigQuery.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Scheduling Queries
BigQuery allows you to schedule queries to run
at regular intervals, automating routine data
processing tasks.
For example a scheduled query to run daily and
load new data into a table using the BigQuery web
UI.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Storage API
The Storage API provides fast access to
BigQuery-managed storage for advanced
analytical use cases.
Storage API can be used to read data from BigQuery
into a Pandas DataFrame for further analysis in
Python.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Streaming Inserts
BigQuery supports real-time data ingestion,
allowing you to stream data into tables as it
arrives.
Using the BigQuery API or client libraries to stream
data into BigQuery tables in real-time for immediate
analysis.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Reservations
Reservations allow you to purchase dedicated
query processing capacity for predictable
workloads, improving cost management and
performance.
Reserving a certain number of slots for a critical
dashboard to ensure consistent performance during
business hours.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Access Control
BigQuery provides fine-grained access control
at the dataset, table, and column levels to
protect sensitive data.
Assigning roles and permissions to users and groups
to control who can view and query specific datasets.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Data Masking
BigQuery supports dynamic data masking to
protect sensitive information in query results.
Configuring a policy tag to mask a social security
number column so that only authorized users can
see the full data.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Audit Logs
Audit logs provide detailed records of all the
operations performed on BigQuery resources,
helping with security and compliance.
Using Cloud Logging to monitor and analyze
BigQuery usage patterns and detect any unusual
activity.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
External Data Sources
BigQuery can query external data sources like
Google Sheets, Cloud SQL, and other databases
without moving the data.
SELECT *
FROM
EXTERNAL_QUERY('my-google-sheets-connection',
'SELECT * FROM sheet1');
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Backup and Restore
BigQuery provides mechanisms to back up and
restore datasets to protect against data loss.
Using scheduled queries and export jobs to back
up critical datasets to Google Cloud Storage for
disaster recovery.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Sandbox
The Sandbox provides a free tier with limited
usage for learning and experimenting with
BigQuery.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery in Dataflow
Dataflow integrates with BigQuery for ETL
(Extract, Transform, Load) operations, enabling
scalable data processing pipelines.
Writing a Dataflow job that reads from a Pub/Sub
topic, transforms the data, and writes it to a
BigQuery table.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Machine Learning
BigQuery ML supports various machine
learning models, including linear regression,
logistic regression, k-means clustering, and
more.
CREATE MODEL mydataset.mykmeans
MODEL_TYPE='kmeans' OPTIONS(num_clusters=3)
AS SELECT * FROM mydataset.training_data;
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Geospatial Indexing
BigQuery supports geospatial indexing to
improve the performance of location-based
queries.
Creating a geography column index to speed up
spatial queries involving large datasets.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Data Governance
BigQuery provides data governance features,
including audit logging, data classification, and
policy enforcement.
Implementing a data governance strategy to ensure
compliance with industry regulations and
organizational policies.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Analytics Hub
The Analytics Hub enables sharing and
discovering analytics assets within and across
organizations.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Data Catalog Integration
BigQuery integrates with Google Data Catalog
for metadata management, data discovery,
and data lineage tracking.
Using Data Catalog to search and discover datasets,
tables, and views across your BigQuery
environment.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Admin Console
The Admin Console provides tools for
managing and monitoring BigQuery resources,
including jobs, datasets, and reservations.
Using the Admin Console to monitor query
performance, manage reservations, and view job
history.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Column-level Security
BigQuery supports column-level security,
allowing you to restrict access to sensitive
columns within a table.
Configuring access policies to allow certain users to
view only non-sensitive columns in a dataset.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Cost Optimization
BigQuery offers various features to optimize
costs, including slot reservations, query
optimization techniques, and monitoring tools.
Using cost control features to manage and reduce
the expenses of running large-scale queries and
storing data.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Query Execution Plans
BigQuery provides query execution plans to
help you understand and optimize the
performance of your queries.
Reviewing the query execution plan to identify and
address performance bottlenecks in complex
queries.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Data Import Formats
BigQuery supports multiple data import
formats, including CSV, JSON, Avro, Parquet,
and ORC.
Loading data from a CSV file stored in Google Cloud
Storage into a BigQuery table for analysis.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Data Export
BigQuery allows exporting data to various
formats and destinations, including CSV, JSON,
Avro, Parquet, and Google Cloud Storage.
Exporting query results to a Parquet file in Google
Cloud Storage for further processing in another tool.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Data QnA
Data QnA allows natural language queries over
BigQuery datasets, making it easier for non-
technical users to explore data.
Using Data QnA to ask questions in plain English and
get answers from your BigQuery datasets without
writing SQL.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Audit Reports
Audit reports provide detailed insights into the
usage and performance of your BigQuery
environment, helping with capacity planning
and optimization.
Generating audit reports to track the most
frequently accessed tables, query performance, and
data growth trends.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Performance Tuning
BigQuery provides various techniques for
performance tuning, including partitioning,
clustering, denormalization, and query
optimization.
Applying best practices for performance tuning to
ensure efficient and fast query execution in
BigQuery.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery in Looker
Looker integrates with BigQuery to provide
powerful data visualization and exploration
capabilities.
Creating interactive dashboards and reports in
Looker using BigQuery as the backend data source.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
BigQuery Export to Sheets
BigQuery allows exporting query results
directly to Google Sheets for further analysis
and collaboration.
Using the EXPORT DATA statement to export query
results to a Google Sheets document for easy
sharing and collaboration.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Big Query
Resource Management
BigQuery provides tools for managing and
monitoring resources, including jobs, datasets,
and reservations, to ensure efficient use of
resources.
Monitoring resource usage and managing quotas to
optimize performance and control costs in BigQuery.
Shwetank Singh
GritSetGrow - GSGLearn.com