Assignment AWS Glue
Assignment AWS Glue
2). To connect to the AWS Management Console, choose the AWS link in the upper-left
corner.
Tip: If a new browser tab does not open, a banner or icon is usually at the top of your
browser with the message that your browser is preventing the site from opening pop-up
windows. Choose the banner or icon, and then choose Allow pop-ups.
Theem College of Engineering, Boisar
s3://noaa-ghcn-pds/csv/by_year/
Subsequent crawler runs: Choose Crawl all sub-folders.
Choose Add an S3 data source.
Choose Next.
For Existing IAM role, choose gluelab.
This role was provided in the lab environment for you. For reference, see the lab's
CloudFormation template. The following is the YAML snippet for this role:
GlueLab:
Type: AWS::IAM::Role
Properties:
RoleName: "gluelab"
Path: "/"
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Theem College of Engineering, Boisar
Service:
- glue.amazonaws.com
Action:
- sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
- arn:aws:iam::aws:policy/AmazonS3FullAccess
Choose Next.
In the Output configuration section, choose Add database.
A new browser tab opens.
For Name, enter weatherdata
Choose Create database.
Return to the browser tab that is open to the Set output and scheduling page in the
AWS Glue console.
For Target database, choose the weatherdata database that you just created.
Tip: To refresh the list of available databases, choose the refresh icon to the right of the
dropdown list.
In the Crawler schedule section, for Frequency, keep the default On demand.
Choose Next.
Confirm your crawler configuration is similar to the following.
Important: Wait for the status to change to Ready before moving to the next step. This
will take about 3 minutes.
AWS Glue creates a table to store metadata about the GHCN-D dataset. Next, you will
inspect the data that AWS Glue captured about the data source.
Theem College of Engineering, Boisar
Review the metadata that the weather crawler captured, as shown in the following
screenshot. The schema lists the columns that the crawler discovered in the imported
dataset.
Now you will edit the schema of the database, which is part of transforming data in the
ETL process.
Theem College of Engineering, Boisar
From the Actions menu in the upper-right corner of the page, choose Edit schema.
Change the column names according to the following table.
To change a column name, select the check box for the item that you want to modify,
and then choose Edit.
In the window that opens, change the value for the Name, and then choose Edit. Re-
peat these steps for each column name.
Date Date
element Type
data_value Observation
m_flag Mflag
q_flag Qflag
s_flag Sflag
obs_time Time
The schema for the table now looks like the following screenshot.
Theem College of Engineering, Boisar
The Athena console opens. Notice the error message that indicates that an output loca-
tion was not provided. Before you run a query in Athena, you need to specify an S3
bucket to hold query results.
Select Choose.
Keep the default settings for the other options, and choose Save.
Theem College of Engineering, Boisar
Tip: To view the column names and their data types in this table, choose the icon to the
left of the table name.
The first 10 records from the weatherdata table display, similar to the following screen-
shot:
Notice the run time and amount of data that was scanned for the query. As you develop
more complex applications, it is important to minimize resource consumption to optimize
costs. You will see examples of how to optimize cost for Athena queries later in this
task.
Theem College of Engineering, Boisar
First, you need to retrieve the name of the bucket that was created for you to store this
data.
In the search box next to Services, search for and choose S3.
In the Buckets list, copy the bucket name that contains glue-1950-bucket to a text
editor of your choice.
Return to the Athena query editor.
Copy and paste the following query into a query tab in the editor. Replace <glue-1950-
bucket> with the name of the bucket that you recorded:
Choose Run.
After the query runs, the run time and data scanned values are similar to the following:
First, create a view that only includes the maximum temperature reading, or TMAX, value.
To preview the results, in the Views section, to the right of the tmax view, choose
the ellipsis icon, and then choose Preview View.
Theem College of Engineering, Boisar
The purpose of this query is to calculate the average maximum temperature for each
year in the dataset.
After the query runs, the run time and data scanned values are similar to the following:
12. Find the Amazon Resource Number (ARN) for the gluelab IAM role. You need this ARN
to deploy the CloudFormation template.
o In the search box next to Services, search for and choose IAM to open the IAM con-
sole.
o In the navigation pane, choose Roles.
o Choose the link for the gluelab role.
A new browser tab opens and displays the AWS Cloud9 IDE.
Theem College of Engineering, Boisar
AWSTemplateFormatVersion: '2010-09-09'
Parameters: # The name of the crawler to be created
CFNCrawlerName:
Type: String
Default: cfn-crawler-weather
CFNDatabaseName:
Type: String
Default: cfn-database-weather
CFNTablePrefixName:
Type: String
Default: cfn_sample_1-weather
# Resources section defines metadata for the Data Catalog
Resources:
# Create a database to contain tables created by the crawler
CFNDatabaseWeather:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: !Ref CFNDatabaseName
Description: "AWS Glue container to hold metadata tables for the weather crawler"
#Create a crawler to crawl the weather data on a public S3 bucket
CFNCrawlerWeather:
Type: AWS::Glue::Crawler
Properties:
Name: !Ref CFNCrawlerName
Role: <GLUELAB-ROLE-ARN>
#Classifiers: none, use the default classifier
Description: AWS Glue crawler to crawl weather data
#Schedule: none, use default run-on-demand
DatabaseName: !Ref CFNDatabaseName
Targets:
S3Targets:
# Public S3 bucket with the weather data
- Path: "s3://noaa-ghcn-pds/csv/by_year/"
TablePrefix: !Ref CFNTablePrefixName
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
Theem College of Engineering, Boisar
Configuration:
"{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"T
ables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
Theem College of Engineering, Boisar
15. To validate the CloudFormation template, run the following command in the AWS
Cloud9 terminal:
Note: If you receive an error that says YAML not well-formed, check the value for the
name of the gluelab role. Also check the tabs and spacing for each line. YAML docu-
ments require exact spacing, and the parser will encounter errors if the spacing doesn't
match.
{
"Parameters": [
{
"ParameterKey": "CFNCrawlerName",
"DefaultValue": "cfn-crawler-weather",
"NoEcho": false
},
{
"ParameterKey": "CFNTablePrefixName",
"DefaultValue": "cfn_sample_1-weather",
"NoEcho": false
},
{
"ParameterKey": "CFNDatabaseName",
"DefaultValue": "cfn-database-weather",
"NoEcho": false
}
]
}
Note: The command includes the --capabilities parameter with the CAPABIL-
ITY_NAMED_IAM capability. This is because you are creating the following resources
with custom names, which affect permissions:
If the stack is validated, the CloudFormation ARN displays in the output, similar to the
following:
{
"StackId": "arn:aws:cloudformation:us-east-1:338778555682:stack/gluecrawler/2d8cec90-5c42-11ec-8fbf-
12034b0079a5"
}
The CloudFormation create-stack command creates the stack and deploys it. If valida-
tion passes and nothing causes the stack creation to roll back, proceed to the next step.
Tip: To check the progress of stack creation, navigate to the CloudFormation console.
In the navigation pane, choose Stacks.
Theem College of Engineering, Boisar
17. To verify that the AWS Glue database was created in the stack, run the following
command:
{
"DatabaseList": [
{
"Name": "cfn-database-weather",
"Description": "AWS Glue container to hold metadata tables for the weather crawler",
"Parameters": {},
"CreateTime": 1649267047.0,
"CreateTableDefaultPermissions": [
{
"Principal": {
"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
},
"Permissions": [
"ALL"
]
}
],
"CatalogId": "034140262343"
},
{
"Name": "weatherdata",
"CreateTime": 1649263434.0,
"CreateTableDefaultPermissions": [
{
"Principal": {
"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
},
"Permissions": [
"ALL"
]
}
],
"CatalogId": "034140262343"
}
]
}
Theem College of Engineering, Boisar
{
"CrawlerNames": [
"Weather",
"cfn-crawler-weather"
]
}
{
"Crawler": {
"Name": "cfn-crawler-weather",
"Role": "WeatherCrawler-001-CFNRoleWeather-17WB9OM5H5MFL",
"Targets": {
"S3Targets": [
{
"Path": "s3://noaa-ghcn-pds/csv/by_year/",
"Exclusions": []
}
],
"JdbcTargets": [],
"MongoDBTargets": [],
"DynamoDBTargets": [],
"CatalogTargets": [],
"DeltaTargets": []
},
"DatabaseName": "cfn-database-weather",
"Description": "AWS Glue crawler to crawl weather data",
"Classifiers": [],
"RecrawlPolicy": {
"RecrawlBehavior": "CRAWL_EVERYTHING"
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
Theem College of Engineering, Boisar
},
"LineageConfiguration": {
"CrawlerLineageSettings": "DISABLE"
},
"State": "READY",
"TablePrefix": "cfn_sample_1-weather",
"CrawlElapsedTime": 0,
"CreationTime": 1649083535.0,
"LastUpdated": 1649083535.0,
"Version": 1,
"Configuration":
"{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"T
ables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}",
"LakeFormationConfiguration": {
"UseLakeFormationCredentials": false,
"AccountId": ""
}
}
}
Theem College of Engineering, Boisar
Task 4: Reviewing the IAM policy for Athena and AWS Glue access
In the search box to the right of Services, search for and choose IAM to open the IAM
console.
In the navigation pane, choose Users.
Note that mary is one of the IAM users that is listed. This user is part of the DataSci-
enceGroup IAM group.
The Policy-For-Data-Scientists details page opens. Review the permissions that are as-
sociated with this policy. Notice that the permissions provide limited access for only the
Athena, AWS Glue, and Amazon S3 services.
Theem College of Engineering, Boisar
Task 5: Confirming that Mary can access and use the AWS Glue crawler
20. Retrieve the credentials for the mary IAM user, and store these as bash variables.
o In the search box next to Services, search for and choose CloudFormation.
o In the navigation pane, choose Stacks.
o Choose the link for the stack that created the lab environment. The stack name includes
a random string of letters and numbers, and the stack should have the oldest creation
time.
o On the stack details page, choose the Outputs tab.
Note: When you create a CloudFormation template, you can choose to output infor-
mation about the resources that the template will create. The CloudFormation template
that created the resources in your lab environment output the access key and secret
access key for the mary user.
AK=<ACCESS-KEY>
SAK=<SECRET-ACCESS-KEY>
To test whether the mary user can perform a specific command, you can pass the us-
er's credentials as bash variables (AK and SAK) with the command. The API will then
try to perform that command as the specified user.
Theem College of Engineering, Boisar
The output is similar to the following and looks like the output that was displayed after
you ran the command earlier:
{
"CrawlerNames": [
"Weather",
"cfn-crawler-weather"
]
}
o To test whether the mary user can perform the get-crawler command, run the following
command:
The output is similar to the following and looks like the output that was displayed after
you ran the command earlier. Note that the state of the crawler is READY, but no status
information is displayed. This is because the crawler hasn't run yet.
{
"Crawler": {
"Name": "cfn-crawler-weather",
"Role": "gluelab",
"Targets": {
"S3Targets": [
{
"Path": "s3://noaa-ghcn-pds/csv/by_year/",
"Exclusions": []
}
],
"JdbcTargets": [],
"MongoDBTargets": [],
"DynamoDBTargets": [],
"CatalogTargets": [],
"DeltaTargets": []
},
"DatabaseName": "cfn-database-weather",
"Description": "AWS Glue crawler to crawl weather data",
Theem College of Engineering, Boisar
"Classifiers": [],
"RecrawlPolicy": {
"RecrawlBehavior": "CRAWL_EVERYTHING"
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"LineageConfiguration": {
"CrawlerLineageSettings": "DISABLE"
},
"State": "READY",
"TablePrefix": "cfn_sample_1-weather",
"CrawlElapsedTime": 0,
"CreationTime": 1649267047.0,
"LastUpdated": 1649267047.0,
"Version": 1,
"Configuration":
"{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"T
ables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}",
"LakeFormationConfiguration": {
"UseLakeFormationCredentials": false,
"AccountId": ""
}
}
}
Theem College of Engineering, Boisar
21. Test that the mary user can run the crawler.
o Run the following command.
If the crawler runs successfully, the terminal doesn't display any output.
o To observe the crawler running and adding data to the table, navigate to the AWS Glue
console.
o In the navigation pane, choose Crawlers.
Here you can see status information for the crawler, as shown in the following screen-
shot.
When the status changes to Ready, the crawler is finished running. It might take a few
minutes.
o To confirm that the crawler is finished running, run the following command.
{
"Crawler": {
"Name": "cfn-crawler-weather",
"Role": "gluelab",
"Targets": {
"S3Targets": [
{
"Path": "s3://noaa-ghcn-pds/csv/by_year/",
"Exclusions": []
}
],
"JdbcTargets": [],
"MongoDBTargets": [],
"DynamoDBTargets": [],
"CatalogTargets": [],
"DeltaTargets": []
},
"DatabaseName": "cfn-database-weather",
"Description": "AWS Glue crawler to crawl weather data",
"Classifiers": [],
"RecrawlPolicy": {
"RecrawlBehavior": "CRAWL_EVERYTHING"
},
Theem College of Engineering, Boisar
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"LineageConfiguration": {
"CrawlerLineageSettings": "DISABLE"
},
"State": "READY",
"TablePrefix": "cfn_sample_1-weather",
"CrawlElapsedTime": 0,
"CreationTime": 1649267047.0,
"LastUpdated": 1649267047.0,
"LastCrawl": {
"Status": "SUCCEEDED",
"LogGroup": "/aws-glue/crawlers",
"LogStream": "cfn-crawler-weather",
"MessagePrefix": "5ef3cff5-ce6c-45d5-8359-e223a4227570",
"StartTime": 1649267649.0
},
"Version": 1,
"Configuration":
"{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"T
ables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}",
"LakeFormationConfiguration": {
"UseLakeFormationCredentials": false,
"AccountId": ""
}
}
}
Notice that the LastCrawl section is included, and the status in that section is SUC-
CEEDED. This means that Mary was able to run the crawler successfully.
Theem College of Engineering, Boisar
23. To record your progress, choose Submit at the top of these instructions.