0% found this document useful (0 votes)

34 views23 pages

AWS Glue Interview Guide

The document provides a curated list of the top 20 interview questions and answers related to AWS Glue, focusing on its features, functionalities, and integration in data engineering. Key topics include the AWS Glue Catalog, ETL processes, schema management, partitioning, and the differences between SparkContext, SparkSession, and GlueContext. It also discusses the use of DynamicFrames and the importance of custom classifiers in handling diverse data formats.

Uploaded by

suresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views23 pages

AWS Glue Interview Guide

Uploaded by

suresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Top 20 AWS Data Engineering Glue

Interview Q & A
Curated by:
Sachin Chandrashekhar
Founder – Data Engineering Hub
LinkedIn: https://www.linkedin.com/in/sachincw/
WhatsApp Community:
https://chat.whatsapp.com/FAqHgo4YpUsLFScpiMvtSF
Top mate link: https://lnkd.in/d28ETqaN
AWS Program Waitlist: https://masterclass.sachin.cloud

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

1) What is AWS Glue Catalog?

Answer: AWS Glue Catalog is a fully managed metadata repository that is used to store
metadata about data sources, transformations, and targets. It is a central repository for
storing table definitions, schema information, and other metadata related to ETL jobs.

2) What is the purpose of the AWS Glue Catalog in the ETL process?

Answer: The AWS Glue Catalog plays a crucial role in the ETL (Extract, Transform, Load)
process by providing a centralized metadata store. It stores information about data
sources, target destinations, and transformations, enabling the Glue ETL service to
discover, catalog, and transform data efficiently.

3) How are schemas defined in the AWS Glue Catalog?

Answer: Schemas in the AWS Glue Catalog are defined by creating metadata tables.
Each table contains information about the data's structure, including column names,
data types, and other relevant details.

4) Can AWS Glue Catalog be integrated with external data sources?

Answer: Yes, AWS Glue Catalog can be integrated with external data sources. It
supports connections to various data stores, including Amazon S3, Amazon RDS,
Amazon Redshift, and more. This allows Glue to catalog and process data from
different sources.

5) How does AWS Glue Catalog handle changes in schema over time?
Answer: AWS Glue Catalog supports versioning of table definitions. When there are
changes in the schema, a new version of the table is created in the catalog, preserving
historical versions and allowing for schema evolution.

6) How can you optimize the performance of queries using the AWS Glue Catalog?

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

Answer: Performance optimization in AWS Glue Catalog involves partitioning tables,
using column statistics to improve query planning, and optimizing transformations.
Leveraging partitioning and indexing can significantly enhance query performance.

7) Explain the concept of partitioning in AWS Glue Catalog.

Answer: Partitioning in AWS Glue Catalog involves dividing large datasets into smaller,
more manageable parts based on specific columns. It improves query performance by
allowing the system to skip irrelevant data when querying based on the partition
columns.

8) What is a custom classifier in Glue Crawler?

Answer: In AWS Glue, a custom classifier is a user-defined pattern or logic that the Glue
Crawler uses to infer the schema of data stored in various formats within a data store.
By default, Glue provides built-in classifiers for common data formats like JSON, CSV,
Parquet, and others. However, in cases where your data follows a specific format or
structure that is not covered by the built-in classifiers, you can create a custom
classifier.

Here are the key points about custom classifiers in Glue:

User-Defined Patterns: With a custom classifier, you define patterns or regular

expressions that match the structure of your data. This allows the Glue Crawler
to identify and understand the schema of your data based on these patterns.

Flexible Schema Inference: Custom classifiers provide flexibility when dealing

with data that might not conform to standard formats. They enable you to teach
the crawler how to interpret and classify your data correctly.

Support for Multiple Formats: Custom classifiers can be applied to various data
formats, including those not covered by the built-in classifiers. This is particularly
useful when dealing with proprietary or custom data formats.

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

Example Use Case: Consider a scenario where your data is stored in log files, and
the structure of each log entry follows a specific pattern. By creating a custom
classifier, you can instruct the Glue Crawler on how to interpret the log file
entries, allowing for accurate schema inference.

Integration with Crawlers: Once a custom classifier is created, you associate it

with a Glue Crawler. When the Crawler runs, it uses the custom classifier along
with built-in classifiers to infer the schema of the data in your data store.

Using custom classifiers is a way to extend the capabilities of AWS Glue to handle
diverse and non-standard data formats. It allows you to tailor the schema inference
process to the specific structure of your data, ensuring accurate and meaningful results
during the ETL (Extract, Transform, Load) process.

Let's walk through an example of creating a custom classifier in AWS Glue for a
hypothetical scenario where your data is stored in log files, and each log entry follows a
specific pattern.

Scenario:

Assume you have log files stored in Amazon S3, and each log entry has the following
format:

[2023-12-01 10:30:15] INFO: This is a log entry.

Here, the log entry includes a timestamp, log level (INFO in this case), and the log
message.

Steps to Create a Custom Classifier:

Go to the AWS Glue Console:

● Open the AWS Management Console and navigate to the AWS Glue
service.
Create a Custom Classifier:
● In the Glue Console, go to the "Classifiers" section.
Sachin Chandrashekhar - Data Engineering Hub
🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

● Click on "Add classifier" and choose "Custom classifier."
● Provide a name for your custom classifier (e.g., LogEntryClassifier).
● Specify the classification patterns using regular expressions. In this case,
you might use a pattern like:

\[\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\] ([A-Z]+): .*

Associate the Custom Classifier with a Crawler:

● After creating the custom classifier, go to the "Crawlers" section in the
Glue Console.
● Edit the existing crawler or create a new one that points to your S3
location.
● In the crawler settings, associate the custom classifier you created.
Run the Crawler:
● Execute the crawler to scan and catalog the data in your S3 location.

Result:

When the Glue Crawler runs, it uses the custom classifier along with any built-in
classifiers to infer the schema of the log files. The custom classifier helps Glue
understand the structure of each log entry, allowing for accurate schema inference.

After the crawl is complete, you can view the cataloged tables in the Glue Data Catalog.
The table schema should reflect the timestamp, log level, and message components of
each log entry.

This example illustrates how a custom classifier in AWS Glue can be used to handle
non-standard data formats, providing flexibility in schema inference for diverse
datasets.

9) Explain partition indexes in Glue Catalog with an example.

Answer: In AWS Glue, partition indexes are a feature used in the Glue Data Catalog to
improve the efficiency of query performance, especially when dealing with large
Sachin Chandrashekhar - Data Engineering Hub
🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

datasets. Partition indexes are metadata structures that store information about the
partitions of a table, making it easier for the Glue ETL service to skip irrelevant data
when executing queries.

Here's an explanation of partition indexes in Glue Catalog with an example:

Example Scenario:

Consider a scenario where you have a large dataset of sales records stored in Amazon
S3, and the data is partitioned by the year and month of the sales transactions. Each
sales record has attributes such as product_id, sales_amount, and transaction_date.

Without Partition Indexes:

Without partition indexes, querying the entire dataset might involve scanning all the data
files, even if you are interested in a specific subset of the data, such as sales records for
a particular month or year. This can result in slower query performance and increased
resource usage.

With Partition Indexes:

Partitioning Setup:
● In the Glue Console, you define a table in the Data Catalog for the sales
records.
● You set up partitioning on the transaction_date column, creating
partitions for each year and month.
Partition Index Creation:
● When defining the table, you enable partition indexes. Glue will
automatically create partition indexes based on the specified partition
columns.

CREATE EXTERNAL TABLE sales_records (

product_id INT,
sales_amount DECIMAL(10, 2),
transaction_date DATE

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

)
PARTITIONED BY (year INT, month INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://your-s3-bucket/sales-data/';

Query Optimization:
● With partition indexes in place, when you execute a query that filters data
based on the partitioned columns (e.g., querying sales for a specific
month or year), Glue can use the partition indexes to skip reading
irrelevant data files. This results in faster query execution.
SELECT *
FROM sales_records
WHERE year = 2023 AND month = 1;

Reduced Data Scanning:

● Instead of scanning the entire dataset, Glue can narrow down the data
files that need to be read based on the specified partition values. This
significantly reduces the amount of data scanned, improving query
performance.
Benefits of Partition Indexes:

● Improved Query Performance: Partition indexes help skip unnecessary

data files during query execution, leading to faster query performance.
● Resource Efficiency: By reading only the relevant partitions, the query
consumes fewer resources, reducing the cost and time required for query
processing.
● Optimized for Filtered Queries: Partitioning is particularly effective when
queries involve filtering data based on the partitioned columns.

In summary, partition indexes in the Glue Catalog optimize query performance by

allowing the Glue ETL service to skip irrelevant data when querying partitioned tables.
Sachin Chandrashekhar - Data Engineering Hub
🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

This is especially beneficial for large datasets where selective querying based on
partition columns is common.

10) How does “Crawl based on events” work?

Answer: Suppose you want the crawler to crawl and then take necessary resulting
actions only based on the new files getting added to the S3 bucket path which the
crawler is set to crawl.
You can use the option Crawl based on events and provide a SQS queue to which the
events about the new files are sent.
Crawler can then read the events from SQS queue and can then crawl depending on its
schedule and then take necessary actions.

For more information, refer -

https://www.youtube.com/watch?v=HrppGyFdPjw&t=1598s

11) What is the difference between SparkContext, Spark session and GlueContext?
Answer: In the context of Apache Spark and AWS Glue, SparkContext, SparkSession,
and GlueContext are important components that serve different purposes. Let's
discuss each one:

SparkContext:
● SparkContext is the entry point for any Spark functionality. It sets up
internal services and establishes a connection to a Spark cluster. In
traditional Spark applications, you would create a SparkContext to
interact with the Spark engine.

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

● In AWS Glue, when using PySpark for scripting, the SparkContext is often
automatically created for you, and you don't need to explicitly create it.
You can typically access it using sc.

from pyspark.context import SparkContext

sc = SparkContext.getOrCreate()

SparkSession:
● SparkSession is a higher-level abstraction on top of SparkContext. It is
the entry point for structured and SQL functionality in Spark. A
SparkSession is required for creating DataFrames & working with SQL
queries
● In AWS Glue scripts, a SparkSession is typically automatically created for
you, and you can access it using spark.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

GlueContext:
● GlueContext is specific to AWS Glue and is part of the Glue ETL (Extract,
Transform, Load) service. It extends the functionality of SparkContext and
SparkSession to provide additional features that are specific to AWS Glue.
● GlueContext simplifies interactions with the AWS Glue Data Catalog,
allowing you to read and write metadata about your data. It also provides
additional methods for working with DynamicFrames, which are a Glue-
specific abstraction on top of Spark DataFrames.

from awsglue.context import GlueContext

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

from pyspark.context import SparkContext

sc = SparkContext()
glue_context = GlueContext(sc)

In summary:

● Use SparkContext for low-level Spark functionality to work with RDDs. In AWS
Glue scripts, it's often created automatically.
● Use SparkSession for higher-level structured Spark functionality, including
DataFrames and SQL operations. In AWS Glue scripts, it's often created
automatically.
● Use GlueContext for additional Glue-specific functionality, especially when
working with the AWS Glue Data Catalog and DynamicFrames in ETL jobs.

When working with AWS Glue scripts, you typically interact more with SparkSession and
GlueContext as they provide higher-level abstractions for ETL tasks.

For more info, refer - https://www.youtube.com/watch?v=-Vdgkg-

UqfY&pp=ygULZ2x1ZWNvbnRleHQ%3D

12) What is a Glue Dynamic Frame?

Answer: In AWS Glue, a DynamicFrame is a higher-level abstraction on top of Apache
Spark DataFrames. It is part of the Glue ETL (Extract, Transform, Load) service and is
designed to provide a more flexible and schema-less representation of data compared
to traditional Spark DataFrames.

Key characteristics of Glue DynamicFrames include:

Schema Flexibility:
● Unlike Spark DataFrames, DynamicFrames are schema-less, allowing
them to handle semi-structured or schema-less data more effectively. This

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

is particularly useful when dealing with data sources like JSON, where the
schema may vary.
Built for ETL:
● DynamicFrames are specifically designed for ETL operations in the AWS
Glue environment. They are the primary data structure used in Glue ETL
jobs to perform data transformations.
Support for Nested Data Structures:
● Glue DynamicFrames can represent nested data structures more easily
than traditional Spark DataFrames. This is advantageous when dealing
with complex data formats like JSON or Avro.
Ease of Use:
● Glue DynamicFrames provide a higher-level, more user-friendly API
compared to Spark DataFrames. They come with built-in transformations
that simplify common ETL tasks.
Integration with Glue Catalog:
● DynamicFrames seamlessly integrate with the AWS Glue Data Catalog,
which helps manage metadata and facilitates the discovery of data.

13) Why do we have to convert from dynamic frame to spark dataframe?

Answer: In AWS Glue, DynamicFrames and Spark DataFrames are two representations
of distributed data in Apache Spark. Each has its own advantages and use cases. The
conversion from a DynamicFrame to a Spark DataFrame, and vice versa, might be
necessary in certain scenarios based on the operations and transformations you need
to perform in your ETL (Extract, Transform, Load) processes.

Here are some reasons why you might need to convert from a DynamicFrame to a Spark
DataFrame:

Spark SQL Operations:

● Spark DataFrames are particularly useful when you need to perform Spark
SQL operations or use DataFrame-specific transformations. If you have
SQL-like queries or need to leverage Spark's DataFrame API, converting to
a DataFrame is beneficial.
Integration with Spark Libraries:

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

● Some Spark libraries and functions work directly with Spark DataFrames.
If you plan to use such libraries or functions in your ETL job, you might
need to convert your data to a DataFrame.
Complex Transformations:
● If you need to perform complex transformations that are more easily
expressed using Spark SQL or DataFrame operations, converting to a
DataFrame can simplify your code.
Parquet and ORC Formats:
● Certain file formats, such as Parquet and ORC, are commonly used with
Spark DataFrames. If you plan to write your data to these formats,
converting to a DataFrame before writing might be necessary.
Better Integration with Spark Ecosystem:
● In some cases, integration with other components of the Spark
ecosystem, such as Spark MLlib for machine learning, is more seamless
with Spark DataFrames.

Here is an example of converting from a DynamicFrame to a Spark DataFrame in AWS

Glue:

# Assuming dynamic_frame is your DynamicFrame

data_frame = dynamic_frame.toDF()

Converting from a DataFrame to a DynamicFrame might be necessary when you want to

take advantage of AWS Glue-specific transformations, especially when dealing with
semi-structured or nested data. AWS Glue provides certain transformations and
features that are more easily expressed using DynamicFrames.
# Assuming data_frame is your DataFrame and glue_context is your GlueContext
dynamic_frame = glue_context.create_dynamic_frame.fromDF(data_frame, glue_ctx,
"dynamic_frame_name")

In summary, the choice between DynamicFrames and DataFrames depends on your

specific use case and the operations you need to perform. Conversion allows you to

leverage the strengths of each representation based on your requirements.

Sachin Chandrashekhar - Data Engineering Hub
🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

14) What is Glue job bookmark? What is used for?

Answer: AWS Glue job bookmarks are used to keep track of the last processed record in

a data source when an AWS Glue job runs. This allows the job to resume processing

from where it left off in case of interruptions, failures, or incremental data updates.

AWS Glue tracks data that has already been processed during a previous run of an ETL

job by persisting state information from the job run. This persisted state information is

called a job bookmark. Job bookmarks help AWS Glue maintain state information and

prevent the reprocessing of old data. With job bookmarks, you can process new data

when rerunning on a scheduled interval. A job bookmark is composed of the states for

various elements of jobs, such as sources, transformations, and targets. For example,

your ETL job might read new partitions in an Amazon S3 file. AWS Glue tracks which

partitions the job has processed successfully to prevent duplicate processing and

duplicate data in the job's target data store.

Job bookmarks are implemented for JDBC data sources, Amazon Simple Storage

Service (Amazon S3) sources.

The job bookmark option is passed as a parameter when the job is started. The

following table describes the options for setting job bookmarks on the AWS Glue

console.

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

Job Description

bookmark

Enable Causes the job to update the state after a run to keep track

of previously processed data. If your job has a source with

job bookmark support, it will keep track of processed data,

and when a job runs, it processes new data since the last

checkpoint.

Disable Job bookmarks are not used, and the job always

processes the entire dataset. You are responsible for

managing the output from previous job runs. This is the

default.

Pause Process incremental data since the last successful run or

the data in the range identified by the following sub-

options, without updating the state of last bookmark. You

are responsible for managing the output from previous job

runs. The two sub-options are:

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

● job-bookmark-from <from-value> is the run ID which

represents all the input that was processed until the

last successful run before and including the

specified run ID. The corresponding input is

ignored.

● job-bookmark-to <to-value> is the run ID which

represents all the input that was processed until the

last successful run before and including the

specified run ID. The corresponding input excluding

the input identified by the <from-value> is

processed by the job. Any input later than this input

is also excluded for processing.

The job bookmark state is not updated when this option

set is specified.

The sub-options are optional, however when used both the

sub-options needs to be provided.

For Amazon S3 input sources, AWS Glue job bookmarks check the last modified time of

the objects to verify which objects need to be reprocessed. If your input source data has

been modified since your last job run, the files are reprocessed when you run the job

again.

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

There is lot more to Job bookmarks and to understand them in detail, I highly

recommend the following resources:

Refer:

https://www.youtube.com/watch?v=AX2KC0_RZvA&t=844s

https://www.youtube.com/watch?v=XdkxI6Xs9RA

https://www.youtube.com/watch?v=BbWiomEsw5Y

https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

https://docs.aws.amazon.com/glue/latest/dg/programming-etl-connect-

bookmarks.html

15) What are the 4 different types of workers in Glue Spark jobs and Glue Python? How

do we determine which type of worker to chose?

Answer: In AWS Glue Spark jobs, there are four types of workers that can be used to

process data: G 1X, G 2X, G 3X, and G 4X. G 1X has 4 vCPUs and 16 GB RAM, and the

remaining ones have vCPUs and RAM multiplied by this base, with their factors

mentioned in their names. So, for e.g. G 3X has 3 times the vCPU and RAM as that of G

1X.

In Glue Python Shell jobs, you cannot choose the number of workers currently. There are

only two options to choose from - 1/16 DPU and 1 DPU where DPU stands for Data

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

Processing Unit. A DPU is a relative measure of processing power that consists of 4

vCPUs of compute capacity and 16 GB of memory.

The choice of worker type depends on factors such as the complexity of the ETL

transformations, the size of the dataset, and the performance requirements of the job.

When deciding which type of worker to choose for your AWS Glue Spark job, consider

the following factors:

● Data Size and Complexity: If you are dealing with a large dataset or complex ETL

transformations, you may benefit from choosing a worker type with higher

compute resources (e.g., G 2X, G 3X, or G 4X).

● Performance Requirements: If your job has strict performance requirements and

you need faster execution, you might choose a worker type with higher compute

power.

● Cost Considerations: Worker types with higher compute resources are generally

more expensive. Consider your budget constraints and choose a worker type that

provides the necessary resources without exceeding your cost limits.

● Trial and Error: In some cases, it may be beneficial to experiment with different

worker types to determine the optimal configuration for your specific job. AWS

Glue allows you to adjust the worker type easily.

AWS Glue provides flexibility in choosing the appropriate worker type based on the

specific requirements of your ETL job, allowing you to balance performance and cost

effectively.

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

16) What is Flex Execution Feature in Glue jobs?

Answer: Flex jobs supports only ETL type jobs, and no streaming sources.

Flexible Execution is ideal for customer workloads that don't require fast jobs start

times or consistent execution times. By using Flexible Execution AWS Glue Studio jobs

run on spare capacity in AWS instead of dedicated reserved or on-demand EC2

instances.

It is unlikely, but possible, that jobs will run significantly longer due to resource

reallocation. We recommend preventing that by setting a shorter timeout; for example,

120 minutes.

Source: AWS Documentation

17) Explain auto scaling of Glue workers.

Answer: Auto Scaling is available for your AWS Glue ETL and streaming jobs with AWS

Glue version 3.0 or later.

With Auto Scaling enabled, you will get the following benefits:

● AWS Glue automatically adds and removes workers from the cluster depending
on the parallelism at each stage or microbatch of the job run.
● It removes the need for you to experiment and decide on the number of workers
to assign for your AWS Glue ETL jobs.
● If you choose the maximum number of workers, AWS Glue will choose the right
size resources for the workload.

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

● You can see how the size of the cluster changes during the job run by looking at
CloudWatch metrics on the job run details page in AWS Glue Studio.
Auto Scaling for AWS Glue ETL and streaming jobs enables on-demand scaling up and
scaling down of the computing resources of your AWS Glue jobs. On-demand scale-up
helps you to only allocate the required computing resources initially on job run startup,
and also to provision the required resources as per demand during the job.

Auto Scaling also supports dynamic scale-down of the AWS Glue job resources over the
course of a job. Over a job run, when more executors are requested by your Spark
application, more workers will be added to the cluster. When the executor has been idle
without active computation tasks, the executor and the corresponding worker will be
removed.

Refer:

https://docs.aws.amazon.com/glue/latest/dg/auto-scaling.html

18) How do you optimize Glue jobs?

Answer: Optimizing AWS Glue jobs is essential to ensure efficient and cost-effective
data processing. Here are some tips and best practices for optimizing Glue jobs:

Data Partitioning:
● Use data partitioning to reduce the amount of data that needs to be
processed. Partitioning can significantly improve the performance of
queries.
● Partition your data based on commonly used filters in your queries to
minimize the data scanned.
Crawler Configuration:
● Optimize the Glue crawler settings to avoid unnecessary scans of the
entire dataset. Use custom classifiers and exclusion patterns to focus on
relevant data.
DynamicFrames and Pushdown Predicates:

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

● Leverage DynamicFrames to handle semi-structured and nested data
efficiently.
● Use pushdown predicates to push filtering operations down to the
underlying data storage, reducing the amount of data that needs to be
processed.
Choose the Right Worker Type:
● Select an appropriate worker type based on the complexity and size of
your ETL job. Experiment with different worker types to find the optimal
balance between performance and cost.
Use Parquet or ORC Formats:
● When possible, use columnar storage formats like Parquet or ORC. These
formats can significantly improve performance and reduce the amount of
data scanned.
Optimize Transformations:
● Minimize unnecessary transformations and filters to reduce the
processing time.
● Use built-in functions and expressions whenever possible, as they are
optimized for performance.
Use Glue ETL Python and Scala Scripts:
● For complex transformations, consider using Glue ETL Python or Scala
scripts instead of relying solely on the visual ETL interface. This can
provide more control over the execution flow and optimize performance.
Manage Connections and Partitions in JDBC Sources:
● When dealing with JDBC sources, manage the number of connections and
partitions appropriately to avoid overwhelming the source database and to
ensure efficient data retrieval.
Monitoring and Logging:
● Regularly monitor job runs and use AWS CloudWatch Logs for detailed
logging.
● Identify and address any performance bottlenecks or errors in the job
runs.
Provision Adequate Resources:
● Ensure that your Glue job has sufficient resources (memory, CPU, and
DPUs) based on the complexity and size of your data. Adjust the worker
type and count accordingly.

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

Always test and iterate on your optimizations to find the best configuration for your
specific use case.

19) What are the key components of AWS Glue?

Answer:
● AWS Glue Data Catalog: A centralized metadata repository for data
assets.
● Crawlers: Discover and define schema for data sources.
● Jobs: Execute ETL workflows using Spark or custom code.
● Triggers: Schedule or event-based job execution.
● Workflows: Create and visualize complex extract, transform, and load
(ETL) activities involving multiple crawlers, jobs, and triggers. Each
workflow manages the execution and monitoring of all its jobs and
crawlers. The AWS Glue console provides a visual representation of a
workflow as a graph.
● Transformers: Built-in and custom data transformation functions.

20) What are the different types of triggers in AWS Glue?

Answer: On-demand, schedule-based, and event-based.

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

Thank you so much for reading this document. I genuinely wish
you all the best in your AWS Data Engineering Interview
interviews.
- Sachin Chandrashekhar

Follow me on LinkedIn and click the bell 🔔

LinkedIn: https://www.linkedin.com/in/sachincw/

I conduct Real-world AWS Data Engineering (RADE) Programs.

Get on the waitlist
AWS RADE Waitlist: https://masterclass.sachin.cloud

I also post updates regularly on

WhatsApp Community:
https://chat.whatsapp.com/FAqHgo4YpUsLFScpiMvtSF

Look at other resources at:

Sachin Chandrashekhar - Data Engineering Hub

🎯 LinkedIn: https://www.linkedin.com/in/sachincw/

🎯 AWS Program Waitlist: https://masterclass.sachin.cloud

Abd213 R Howtobuildadatalakewithawsgluedatacatalog 180208045612
No ratings yet
Abd213 R Howtobuildadatalakewithawsgluedatacatalog 180208045612
43 pages
Aws Glue Interview
No ratings yet
Aws Glue Interview
259 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
AWS Athena & Glue for Data Analysis
No ratings yet
AWS Athena & Glue for Data Analysis
13 pages
Glue by Pushpjeet
No ratings yet
Glue by Pushpjeet
7 pages
AWS Glue Guide
No ratings yet
AWS Glue Guide
17 pages
AWS Glue
No ratings yet
AWS Glue
6 pages
AWS Glue ETL Guide: Setup & Execution
No ratings yet
AWS Glue ETL Guide: Setup & Execution
10 pages
Affinity
No ratings yet
Affinity
7 pages
AWS Glue
100% (1)
AWS Glue
225 pages
Data Pipelines With AWS Glue (Level 200)
No ratings yet
Data Pipelines With AWS Glue (Level 200)
33 pages
AWS Glue: Quick Start Guide for Devs
No ratings yet
AWS Glue: Quick Start Guide for Devs
36 pages
AWS Glue
No ratings yet
AWS Glue
5 pages
Processing XML With AWS Glue and Databricks Spark
No ratings yet
Processing XML With AWS Glue and Databricks Spark
23 pages
AWS Glue
No ratings yet
AWS Glue
3 pages
Lab Aws 14-10
100% (1)
Lab Aws 14-10
25 pages
Notes
No ratings yet
Notes
28 pages
Lesson 02 Exploring The World of AWS Glue
No ratings yet
Lesson 02 Exploring The World of AWS Glue
33 pages
Trigger and Crawler
No ratings yet
Trigger and Crawler
2 pages
Lab - Performing ETL On A Dataset by Using AWS Glue
100% (1)
Lab - Performing ETL On A Dataset by Using AWS Glue
26 pages
Aws Glue Tutorial Case Study
No ratings yet
Aws Glue Tutorial Case Study
13 pages
AWS Glue Metadata Management Guide
No ratings yet
AWS Glue Metadata Management Guide
5 pages
Presentation Glue
No ratings yet
Presentation Glue
12 pages
AWS Data Lake
No ratings yet
AWS Data Lake
13 pages
Assignment AWS Glue
No ratings yet
Assignment AWS Glue
38 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
BDA Assignment 5
No ratings yet
BDA Assignment 5
26 pages
Exercise 3 - Processing Data in A Data Lake
No ratings yet
Exercise 3 - Processing Data in A Data Lake
6 pages
AWS DATA Engineering Abhishek
No ratings yet
AWS DATA Engineering Abhishek
6 pages
AWS Certified Data Engineer
No ratings yet
AWS Certified Data Engineer
693 pages
AWS Glue: Create S3 Data Crawler Guide
No ratings yet
AWS Glue: Create S3 Data Crawler Guide
2 pages
Serverless Etl Aws Glue
No ratings yet
Serverless Etl Aws Glue
17 pages
T15 AWSAnalyticsAndAI ProblemStatement Mocktest
No ratings yet
T15 AWSAnalyticsAndAI ProblemStatement Mocktest
14 pages
AWS Project1
No ratings yet
AWS Project1
13 pages
Lab - Updating Dynamic Data in Place
No ratings yet
Lab - Updating Dynamic Data in Place
14 pages
Big Data Analysis Term Work
No ratings yet
Big Data Analysis Term Work
65 pages
AWS Glue 101 - All You Need To Know With A Full Walk-Through - by Kevin Bok - Towards Data Science
No ratings yet
AWS Glue 101 - All You Need To Know With A Full Walk-Through - by Kevin Bok - Towards Data Science
23 pages
ETL Lab
No ratings yet
ETL Lab
34 pages
Analytics Workshop Redshift Notebook
No ratings yet
Analytics Workshop Redshift Notebook
6 pages
Sample Article - KG-RP180925
No ratings yet
Sample Article - KG-RP180925
5 pages
BDA New 1
No ratings yet
BDA New 1
65 pages
Aws Certified Data Engineer Associate 9
No ratings yet
Aws Certified Data Engineer Associate 9
14 pages
Building Unified Catalog With Datazone
No ratings yet
Building Unified Catalog With Datazone
1 page
Serverless Etl Aws Glue
No ratings yet
Serverless Etl Aws Glue
62 pages
AWS & PySpark Interview Prep
No ratings yet
AWS & PySpark Interview Prep
16 pages
AWS Questions
No ratings yet
AWS Questions
2 pages
AWS Glue for ETL Developers
No ratings yet
AWS Glue for ETL Developers
5 pages
Awsq
No ratings yet
Awsq
5 pages
Incremental Data Loading AWS Detailed
No ratings yet
Incremental Data Loading AWS Detailed
17 pages
Simple AWS ETL Project
No ratings yet
Simple AWS ETL Project
3 pages
6 +Athena,+QuickSight,+EMR
No ratings yet
6 +Athena,+QuickSight,+EMR
63 pages
SAS Libraries To AWS Glue Migration Strategy
No ratings yet
SAS Libraries To AWS Glue Migration Strategy
2 pages
AWS Glue Is Managed ETL
No ratings yet
AWS Glue Is Managed ETL
2 pages
Social Media Data Integration and Analysis Ps
No ratings yet
Social Media Data Integration and Analysis Ps
7 pages
AWS Delta Lake Solution
No ratings yet
AWS Delta Lake Solution
7 pages
Mitigasi Dan Adaptasi
No ratings yet
Mitigasi Dan Adaptasi
535 pages
Aws Glue Consulting - Helical IT Solutions
No ratings yet
Aws Glue Consulting - Helical IT Solutions
3 pages
Aws Glue Developer Guide
No ratings yet
Aws Glue Developer Guide
498 pages
Unite Real-Time and Batch Analytics With AWS Glue
No ratings yet
Unite Real-Time and Batch Analytics With AWS Glue
28 pages
Load Balancer-7
No ratings yet
Load Balancer-7
11 pages
1-Spring Boot Productapp Application Jan 25
No ratings yet
1-Spring Boot Productapp Application Jan 25
38 pages
Java Streams
No ratings yet
Java Streams
13 pages
1-Spring Boot MS Bank App Step by Setp Jan 25
No ratings yet
1-Spring Boot MS Bank App Step by Setp Jan 25
29 pages
IT Troubleshooting
No ratings yet
IT Troubleshooting
3 pages
Wireshark Display Filters Cheat Sheet
No ratings yet
Wireshark Display Filters Cheat Sheet
2 pages
API Testing Practical Guide - QA - SDET
No ratings yet
API Testing Practical Guide - QA - SDET
7 pages
5-MS Communication Jan 25
No ratings yet
5-MS Communication Jan 25
4 pages
2-Spring Data Jan 25
No ratings yet
2-Spring Data Jan 25
14 pages
Java Interview-1
No ratings yet
Java Interview-1
9 pages
Constraint Deltalake Pyspark
No ratings yet
Constraint Deltalake Pyspark
9 pages
4-SpringBoot BlogPost Project Jan 25
No ratings yet
4-SpringBoot BlogPost Project Jan 25
8 pages
Spring Boot
No ratings yet
Spring Boot
7 pages
AWS DevOps Interview Q&A
No ratings yet
AWS DevOps Interview Q&A
5 pages
K8s Horizontal Pod Autoscaling
No ratings yet
K8s Horizontal Pod Autoscaling
12 pages
?DevOps Interview Disaster - Avoid These Pitfalls!?
No ratings yet
?DevOps Interview Disaster - Avoid These Pitfalls!?
7 pages
Data Workflow Automation Guide
No ratings yet
Data Workflow Automation Guide
6 pages
Linux Commands-2
No ratings yet
Linux Commands-2
16 pages
Day 17 of 30
No ratings yet
Day 17 of 30
7 pages
Hands-On Guide Running DeepSeek LLMs Locally
No ratings yet
Hands-On Guide Running DeepSeek LLMs Locally
10 pages
Java Design Patterns
No ratings yet
Java Design Patterns
9 pages
CNIL - Transfer Impact Assessment Practical Guide
No ratings yet
CNIL - Transfer Impact Assessment Practical Guide
28 pages
Docker With NFS
No ratings yet
Docker With NFS
2 pages
AWS Waste Management Application
No ratings yet
AWS Waste Management Application
9 pages
Java Interview Prep Guide
No ratings yet
Java Interview Prep Guide
19 pages
Core Fundamentals Java Developers Must Know
No ratings yet
Core Fundamentals Java Developers Must Know
11 pages
Swipe ??
No ratings yet
Swipe ??
20 pages
Cybersecurity Analyst Roles: L1, L2, L3
No ratings yet
Cybersecurity Analyst Roles: L1, L2, L3
34 pages
Kubernetes Deployments
No ratings yet
Kubernetes Deployments
5 pages
SAP SD Important Tables For SD Consultants
No ratings yet
SAP SD Important Tables For SD Consultants
9 pages
DH Ipc HFW5241T Ase
No ratings yet
DH Ipc HFW5241T Ase
3 pages
Documentacao Langchain
No ratings yet
Documentacao Langchain
53 pages
Coursera - Data Analytics - Course 3
No ratings yet
Coursera - Data Analytics - Course 3
14 pages
Enhancing Library Cataloging Through Artificial Intelligence
No ratings yet
Enhancing Library Cataloging Through Artificial Intelligence
5 pages
Integration of Distributed Learning Objects by Wrapper-Mediator Architecture
No ratings yet
Integration of Distributed Learning Objects by Wrapper-Mediator Architecture
6 pages
Thesisfs: Online Document Management System: Joseph Christian G. Noel William Yu Pierre Tagle, PHD
No ratings yet
Thesisfs: Online Document Management System: Joseph Christian G. Noel William Yu Pierre Tagle, PHD
6 pages
Data Warehousing Essentials Guide
No ratings yet
Data Warehousing Essentials Guide
34 pages
IBM InfoSphere DataStage and QualityStage Version 11 Release 3 Designer Client Guide
No ratings yet
IBM InfoSphere DataStage and QualityStage Version 11 Release 3 Designer Client Guide
279 pages
Knowledge Base Management System SRS
No ratings yet
Knowledge Base Management System SRS
19 pages
SAP CIA TechnicalDesktop
No ratings yet
SAP CIA TechnicalDesktop
4 pages
OpenText Documentum Xcelerated Composition Platform CE 23.2 - User Guide English (EDCPKL230200-UGD-EN-01)
No ratings yet
OpenText Documentum Xcelerated Composition Platform CE 23.2 - User Guide English (EDCPKL230200-UGD-EN-01)
744 pages
Industrial Revolution 4.0: Implication To Libraries and Librarians PDF
No ratings yet
Industrial Revolution 4.0: Implication To Libraries and Librarians PDF
4 pages
OBIEE 12c File Locations
No ratings yet
OBIEE 12c File Locations
3 pages
OBIEE Material
No ratings yet
OBIEE Material
117 pages
Defectdojo Owasp Stammtisch Final
No ratings yet
Defectdojo Owasp Stammtisch Final
37 pages
A Guide To Open Source GIS Software For The Public Sector
No ratings yet
A Guide To Open Source GIS Software For The Public Sector
9 pages
SAP S - 4HANA Migration Cockpit - Development News SAP S - 4HANA 2023
No ratings yet
SAP S - 4HANA Migration Cockpit - Development News SAP S - 4HANA 2023
33 pages
SAS Enterprise Guide
No ratings yet
SAS Enterprise Guide
24 pages
Critical Capabilities For Enterprise Architecture Tools: Key Findings
No ratings yet
Critical Capabilities For Enterprise Architecture Tools: Key Findings
46 pages
Owasp Christianmartorella Information Gathering Via OSINT
No ratings yet
Owasp Christianmartorella Information Gathering Via OSINT
68 pages
System Development Life Cycle of E-Learning Content Management Systems
No ratings yet
System Development Life Cycle of E-Learning Content Management Systems
12 pages
Technology Architecture - Analysis
No ratings yet
Technology Architecture - Analysis
81 pages
WHO - TRS - 996 - Annex05 Guidance On Good Data and Record Management Practices
No ratings yet
WHO - TRS - 996 - Annex05 Guidance On Good Data and Record Management Practices
46 pages
DSSDI Case Study v1.0
No ratings yet
DSSDI Case Study v1.0
15 pages
Modelling The KDD Process: A Four Stage Process and Four Element Model
No ratings yet
Modelling The KDD Process: A Four Stage Process and Four Element Model
8 pages
Business Process Framework WP 396413 PDF
No ratings yet
Business Process Framework WP 396413 PDF
65 pages
D Space Administration and Use
No ratings yet
D Space Administration and Use
47 pages
Digital Chain of Custody
No ratings yet
Digital Chain of Custody
10 pages
Broadcast Automation Solutions Guide
No ratings yet
Broadcast Automation Solutions Guide
7 pages
Copado Fundamentals I Full File
No ratings yet
Copado Fundamentals I Full File
35 pages