Module Big Data 1 BDAHQLB - Part
Module Big Data 1 BDAHQLB - Part
CDOSS Certificate
Big Data Analytics with Hive
Query Language and Beeline
© Dr Heni Bouhamed
Big Data Trainer
Senior Lecturer at Sfax University
Senior Lecturer at ESTYA University (France)
Cloudera Instructor at Elitech Paris
Certified CDOSS Big Data Instructor
Heni.bouhamed@fsegs.usf.tn
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
1
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Interactive execution :
$ beeline –u jdbc:hive2://
-n (user name) u1
-p (password) hadoop
!quit to exit (!q also)
/databasesname
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q1: Once you're in the query editor in Hive or Beeline, Which of the following is a way to get the
schema of the makers table?
• Execute the command use toy; to make toy database as the active database and then run the
command DESCRIBE makers;
• Select the toy database as the active database and then run the command SHOW TABLE makers;
• Select the toy database as the active database and then run the command SHOW TABLES;
• Select the toy database as the active database and then run the command SHOW DATABASE;
• Select the toy database as the active database and then run the command DESCRIBE TABLES;
2
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q2: Which of the following provide a user interface where you can enter and run SQL queries?
Check all that apply.
• Tableau
• Hue
• Linux
• Beeline
• ODBC
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
• Beeline
• Hive
3
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q4: Which is a required argument for the beeline command on the VM?
• -u for user
• -u for URL
• -c for connect
• -n for name
• -p for password
• -c for command
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q5: Which are valid statements or commands once you are in Beeline? Check all that apply.
• !quit;
• !quit
4
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
• SELECT………FROM…….;
Operator(s)
+ - * / %
Priority from left to right string decimal
/ % priority to - +
‘hello’+5 name/2 -name inventor*list_price
Expression with column reference and operator
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Data types
• Numeric • Character
Integer Data Types - String
- tinyint: -128 to 127 - Char
- smallint: -32768 to 32767 - Varchar
- Int: -2147483648 to 2147483647
- Bigint: -9.2 quintillion to 9.2 quintillion
• Decimal data types
- Float
- Double
- Decimal
5
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Alias
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
6
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Casting
Select concat(name, ‘is for’, min_age, ‘or older’ from games;
Hive does the casting implicitly
But it is better to caste min_age with cast(min_age as string)
Distinct
Select distinct min_age from games; only once at the beginning, ALL is its opposite .
Possible with several columns, possible with * and possible with functions Select distinct
concat(substring(year,1,3), «0s»)
Select distinct min_age, max_players min_age Max_players FROM games;
FROM games; 8 6
min_age Max_players 8 4
Concat(substring(year,1,3), « 0s »)
8 6 3 4
1900s
8 4 10 5
1950s
8 6
1930s
3 4
1940s
10 5
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
From
From database.table replace use databases;
• Data bases identifiers
From database.table replace use databases;
Identifiers possibilities Some examples of reserved words
• Letters, digits, underscores • FROM
• Letter for first character • AS
• Lowercase letters • DISTINCT
• Max length varies (recommend fewer than 30) • SHOW
• USE
www.tiny.cloudera.com/hive-reserved-words
You can use them with back quote:
7
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
• SELECT
• FROM
• games
• FROM games
• SELECT name, min_players, max_players FROM games; (that is, the whole statement)
• name, min_players, max_players
• SELECT name, min_players, max_players
8
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q2: The table toys has columns id (integer), name (string), price (decimal), and maker_id (integer). For
which of the following statements is the SELECT list a literal string? Check all that apply.
• SELECT Lite-Brite FROM toys
• SELECT 'toys'
• SELECT toys
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q3: The table toys has columns id (integer), name (string), price (decimal), and
maker_id (integer). Which of the following are valid SELECT statements? Check all that
apply. (You might want to use the VM and try them!)
9
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q4: The following expression will cause an error when used in a SELECT statement:
"-7.5" % 3.2
What is the error? (You might want to use the VM and try it!)
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q5: The profit for an item sold can be found using the formula profit = price – cost. You want to
query an inventory table with columns name, price, and cost, and get the following result:
name profit
Widget 38 15
Widget 38e 49
Gadget 2000 72
Which SELECT statements will produce that table? Check all that apply.
• SELECT name, price - cost, profit FROM inventory
• SELECT name, price - cost profit FROM inventory
• SELECT name, price - cost AS profit FROM inventory
• SELECT name, price - cost (AS profit) FROM inventory
• SELECT name, price - cost FROM inventory
• SELECT name, profit FROM inventory
10
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
SELECT round(3.47) ;
SELECT ceil(-2.47);
-2
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
• SELECT 3^4;
• SELECT pow(3,4);
• SELECT abs(3,4);
• SELECT round(3,4);
11
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q8: First, try writing a SELECT statement to answer the question: When will each of the games in the
games table turn one hundred years old? In other words, what year will mark the one hundredth
anniversary of the invention of each game? Write a SELECT statement that answers this question, and run
it with Hive. You should include the names of the games in the result set. Then answer the following
question:
Which game has already had its 100th anniversary?
• Candy Land
• Scrabble
• Monopoly
Using SELECT name, cast(year AS INT) + 100 as century_year FROM games; shows that
• Risk Monopoly turned 100 in 2003.
• Clue
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q9: How many unique values are there in the min_players column of the fun.games
table? Write and run a SQL query to check.
You should get only one row if you run SELECT DISTINCT min_players FROM games;
12
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q10: You are working in the default database and want to list all the data in the card_rank
table, which is in the fun database. Which of the following allow you to do that? Check all
that apply. (You might want to try this in the VM.)
• Change the current database to fun and run SELECT * FROM card_rank;
• Change the current database to card_rank and run SELECT * FROM fun;
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q11: True or false: In a SELECT statement executed with Hive, identifiers (such as names of
tables and columns) will work regardless of their case (upper, lower, or a mix).
• True
• False
13
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q12: Which are true of the formatting of SELECT statements (for example, how they would
look in the Hive query editor)? Check all that apply.
• Newlines (line or paragraph breaks) can only be added just before a new keyword
• Indenting with a tab character is necessary when a clause is too long for a single line
• By convention, indent clauses every time you start them on a new line
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q13: Which is a correct command for running a SQL query through Beeline in a command line argument?
• beeline -u jdbc:hive2:// -e 'SELECT * FROM fun.games; !quit;'
14
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q14: Here are some commands for you to try in the VM. The $ means you should type this at
your command line prompt, not within Beeline.
• $ beeline -h
• $ beeline -u jdbc:hive2:// -e 'SELECT * FROM fun.games; !quit;'
• $ beeline -u jdbc:hive2:// -e 'USE fun; SELECT * FROM games; !quit;'
• $ beeline -u jdbc:hive2:// --silent=true -e 'SELECT * FROM fun.games; !quit;'
Before doing more commands, first create a file and put the following queries in it and save it in
your current directory (unless you specify otherwise).
• USE fun;
• SELECT * FROM games;
• SELECT name, list_price, 0.8*list_price AS discounted_price FROM games; !quit
Then try these commands. Compare what happens.
• $ beeline -u jdbc:hive2:// -f commands.sql
• $ beeline -u jdbc:hive2:// --silent=true -f commands.sql
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q15: Suppose you want your query results to output to the terminal in this format:
1,Monopoly,Elizabeth Magie,1903,8,2,6,19.99
2,Scrabble,Alfred Mosher Butts,1938,8,2,4,17.99
3,Clue,Anthony E. Pratt,1944,8,2,6,9.99
4,Candy Land,Eleanor Abbott,1948,3,2,4,7.99
5,Risk,Albert Lamorisse,1957,10,2,5,29.99
15
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q16: Which is a correct command for saving query results as a comma-delimited file?
Check all that apply. (Try these in the VM, and see what error messages say—you might
learn something new to try!)
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Operands :
• Alias in SELECT are not allowed in WHERE (the engine
• = != <> < > <= >=
starts by executing WHERE)
• The elements of an operand must be in the same large
• NOT AND OR
family (numeric, character string).
0 et 1 for :
16
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Missing values
possible processing IS DISTINCT FROM
IS NOT DISTINCT FROM <=>
IS NULL
IS NOT NULL
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Missing values possible processing
17
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
File instanciation
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Scripts :
- Clearer vision
- Scheduler of periodic executions
- Do not rewrite the same script
- Usage inside code (python *)
- .Sh extension (chmod 755)
- ./name.sh for execution
*Import subprocess
Subprocess.call([!/script.sh’])
18
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q1: A table has 100 rows. You use a SELECT statement with a
WHERE clause to query the table. Which best describes how
many rows the result set must have?
• 100 or fewer
• 100 or more
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q2: For which of these tasks would you need to use a WHERE clause?
• For a table of web logs which show the IP addresses of every visit, removing
rows with duplicate IP addresses
• For a table of pets, including their owners and ages, finding the range of
values in their ages
• For a table of inventory items, including quantity and price, finding all
inventory items priced under $5
• For a table that includes which of many offices each employee works,
finding all the employees in the Chicago office
19
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q3: True or false: To use a WHERE clause that filters a table based on
the value of column_x, the SELECT list must include column_x.
• True
• False
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q4: Write and run a query on the wax.crayons table to find all
the crayon colors with a value for the column red that is less
than 110. How many rows are returned?
20
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q5: The following shows just a few rows from a table for students in a school. (GPA is grade
point average, where 4.0 means the student is getting the highest scores possible. Absences is
how many days the student has not attended school, and detention is a punishment for bad
behavior.)
id name age gpa absences detentions
930 Olufunmilayo 16 4.00 3 2
Ayton
667 Vincent 15 2.53 12 0
Michaelson
907 Asa Quigg 15 3.57 1 0
168 Kiran Patil 17 3.28 0 3
You're asked to find the most dedicated students to represent the school at a state-wide
meeting. Which of the following might be appropriate, even though they might give different
results? Check all that apply.
• SELECT name ... WHERE gpa >= 3.5;
• SELECT name ... WHERE detentions = 0;
• SELECT name ... WHERE id < 200;
• SELECT name ... WHERE absences = 0;
• SELECT name ... WHERE absences > detentions;
• SELECT name ... WHERE age = 17;
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q6: Which of the following evaluate as true? Check all that apply.
• 8 * -3 != -30 + 5
• 3 >= 1
• 2 * -12 > 6 * -4
• 10 != 10
21
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q7: Which of the following crayon colors have exactly the same red and green values? Use
the VM to query the wax.crayons table. Check all that apply.
• Black
• Blue Bell
• Canary
• Laser Lemon
• Olive Green
• Spring Green
• Unmellow Yellow
• White
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q8: How many of the crayon colors have more blue than red in the
R-G-B color model?
22
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q9: Which of the following crayon colors are dark, that is, the sum of red, green,
and blue values will be less than 325? (Not all colors meeting the criteria are listed.)
Check all that apply.
Note: Although it's not needed to answer the question, try writing a query whose
results include a column named dark, which is true when the sum is less than 325.
The result set for this query should only show rows where dark is true.
• Denim
• Eggplant
• Electric Lime
• Outer Space
• Plum
• Red
• Sepia
• Tropical Rain Forest
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q10: You have a table with integer columns int_x and int_y. Which expressions are
valid in SQL? Check all that apply.
• int_x NOT = 2
• int_x OR int_y = 3
• int_x != 2
• NOT int_x = 2
23
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q11: You want a list of students who have a GPA of at least 3.5, and who have either no more than 3
detentions or more than 5 absences. Which queries will accomplish this? Check all that apply.
Please read the question carefully before answering; this is an unusual set of criteria!
• SELECT * FROM students WHERE gpa >= 3.5 AND NOT (detentions > 3 OR absences > 5)
• SELECT * FROM students WHERE gpa >= 3.5 AND (detentions <= 3 OR absences > 5)
• SELECT * FROM students WHERE (gpa >= 3.5 AND NOT detentions > 3) OR absences > 5
• SELECT * FROM students WHERE gpa >= 3.5 AND (NOT detentions > 3 OR absences > 5)
• SELECT * FROM students WHERE gpa >= 3.5 AND NOT detentions > 3 OR absences > 5
• SELECT * FROM students WHERE (gpa >= 3.5 AND detentions <= 3) OR absences > 5
• SELECT * FROM students WHERE gpa >= 3.5 AND detentions <= 3 OR absences > 5
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q12: Run a query on the VM using the IN operator to find the smallest pack of crayons
that includes all three of Plum, Salmon, and Vivid Tangerine. (Be careful—remember
that the pack column gives the smallest pack that includes a particular color. Every
pack that's larger than that also includes that color, but no packs that are smaller do.
You want the smallest pack that includes all three of these colors.) Enter the size of the
pack below.
24
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q13: Run a query on the VM to find which of the following crayon colors has a
red value between 75 and 125 and a blue value between 125 and 175. Check all
that apply.
• Forest Green
• Manatee
• Royal Purple
• Screamin' Green
• Shadow
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q14: Write a query to return all rows for a flight in the flights table with
the following information: It departed on January 15, 2009, the carrier is
capital letters US, the flight number is 1549, and the origin airport is
capital letters LGA. Which column or columns in this row have NULL
values? Check all that apply.
• air_time
• arr_delay
• arr_time
• dep_time
25
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q15: How many different games in the fun.inventory table are located
in Aisle 3 of the Dicey shop? Check the best answer.
shop game aisle
Dicey Monopoly 3
Dicey Clue NULL
Board 'Em Monopoly 2
Board 'Em Candy Land 2
Board 'Em Risk 1
• None
• One
• At least one
• At most one
• More than one
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q16: On the VM, write and run a SELECT statement that queries the
fly.flights table and returns all the rows representing flights on January
15, 2009 that have non-missing departure time (dep_time) and
missing arrival time (arr_time). You’ll need to use both IS NULL and IS
NOT NULL to do this.
26
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q17: Which SELECT statements return all the rows in fly.flights in which
dep_delay and arr_delay are equal or both missing?
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q18: What value does this CASE expression return when size = 40?
CASE WHEN size >= 34 THEN 'small'
WHEN size >= 38 THEN 'medium'
WHEN size >= 42 THEN 'large'
WHEN size >= 46 THEN 'other'
ELSE 'other'
END
• 'small'
• 'medium'
• 'large'
• 'other'
• None, the expression causes an error
27
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q19: Which expression(s) are equivalent to: nullif(air_time, 0)? Check all that
apply.
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
28
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q21: Suppose that, as you are working, you need to run a bash script
query_script.sh with a SQL query in it. (That is, you want to run it now, not
schedule it for later.) You have never run this script before. Which of the
following is necessary to run the script? Check all that apply. (Note that the
order provided might not match the order in which you need to proceed.)
• Run the script from the command line using $ bash query_script.sh (assuming
it is in your current directory)
• Run the script from Beeline shell using BASH query_script.sh; (assuming it is
in your current directory)
• Run the script from the command line using $ ./query_script.sh (assuming it
is in your current directory)
• Give permission to the script using chmod
• Run the script from Beeline shell using RUN query_script.sh; (assuming it is in
your current directory)
• Use the root or superuser privileges when issuing the run command, so the
script has permissions to run
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
• Aggregation
GROUP BY
Aggregation functions
Example : MAX, MIN, AVG, SUM… WHERE preced GROUP BY
Row filtering before grouping
Aggregation functions and scalar functions
(example: ABS, ROUND ...) never together
• AVG(list_price), ABS(list_price)
• SELECT salary-AVG(salary)
29
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Missing values
30
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Alias
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
• Best practice : use the pushdown for Big Data (with aggregations,
groupings and filters) before using them with BI tools to avoid crashes
and exorbitant transfer costs
31
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q1: Below is a table with four rows. What is the value of SUM(items)
for this table?
order_id items total
829 3 38.92
220 5 107.06
1043 2 19.98
762 1 20.49
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q2: What is the average list price of the games in the fun.games table
in US dollars? Use the virtual machine (VM) to calculate this.
32
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
• 1147.43
Your query should look like
SELECT round(AVG(list_price) * 66.75, 2) FROM fun.games;
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q4: You could also have rounded the converted amounts and then
found the average; in this case, both calculations return the same
value.
Which of the following statements are valid? Check all that apply.
33
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q5: The flights dataset includes the distance (in miles) and time (in
minutes) in the air for the included flights. Write and run a query to find
the average air speed, in miles per hour, of only those flights that were in
the air for longer than 60 minutes. Report to the nearest mile per hour.
(Hints: Speed is distance divided by time. Remember that the time is in
minutes, and this problem asks for speed in miles per hour. Your answer
should be an integer.)
• 428
The query should look like
SELECT round(AVG(distance/air_time)*60) FROM flights
WHERE air_time > 60;
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q6: Here is a portion of the fun.games table. The table has 8 columns, but not all are shown (for space
considerations).
id name inventor year min_age ...
1 Monopoly Elizabeth Magie 1903 8 ...
2 Scrabble Alfred Mosher Butts 1938 8 ...
3 Clue Anthony E. Pratt 1944 8 ...
4 Candy Land Eleanor Abbott 1948 3 ...
5 Risk Albert Lamorisse 1957 10 ...
How many columns and rows does the result of this query have? SELECT min_age, COUNT(*) FROM
fun.games GROUP BY min_age; Please attempt to answer this question without actually running the query.
• 2 columns, 1 row
• 2 columns, 3 rows
• 2 columns, 5 rows
• 8 columns, 1 row
• 8 columns, 3 rows
• 8 columns, 5 rows
• 10 columns, 1 row
• 10 columns, 3 rows
• 10 columns, 5 rows
34
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q7: Here is a portion of the fun.games table. The table has 8 columns, but
not all are shown (for space considerations).
id name inventor year min_age ...
1 Monopoly Elizabeth Magie 1903 8 ...
2 Scrabble Alfred Mosher Butts 1938 8 ...
3 Clue Anthony E. Pratt 1944 8 ...
4 Candy Land Eleanor Abbott 1948 3 ...
5 Risk Albert Lamorisse 1957 10 ...
Which values occur in the second column of the result of the following
query? Check all that apply. SELECT min_age, COUNT(*) FROM fun.games
GROUP BY min_age;
•1
•2
•3
•8
• 10
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q8: Run the following query on the VM using hive. Then use the result set to
answer the following question:
SELECT min_age, COUNT(*)
FROM fun.games
WHERE list_price > 10
GROUP BY min_age;
How many games with a list price greater than $10 are suitable for players as
young as 3?
• Unknown
• 0
• 1
• 2
• 3
35
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q9: Write and run a query on the fly.planes table that would answer
the question, "What is the average number of seats for each type of
aircraft in the table?" Then use the results to enter the average
number of seats for the blimps/dirigibles in the table.
•6
Correct. The query SELECT type, ROUND(AVG(seats)) FROM planes
GROUP BY type; gives 6 for the average number of seats in the row
with type = "Blimp/Dirigible".
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q10: Which of these expressions runs without error in Hive? Check all that apply. (If needed,
check your answers by attempting to run these queries in Hive.)
• SELECT list_price < 10, COUNT(*) FROM fun.games GROUP BY list_price < 10;
• SELECT list_price < 10 AS low_price, COUNT(*) FROM fun.games GROUP BY list_price < 10;
36
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q11: Run this query in the VM using hive. Then use the result set to answer
the following question.
SELECT list_price > 20, max_players, COUNT(*)
FROM fun.games
GROUP BY list_price>20, max_players;
How many games cost more than $20 and have a maximum of 6 players?
• Unknown
• 0
• 1
• 2
• 3
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
37
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
38
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q15: The query SELECT MIN(price) FROM fun.inventory; gave one row
in the results, with only one column. The value was 9.99.
Choose which of the following statements is most accurate and
informative.
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q16: Write and run a query using hive to find the average air speed (distance
divided by air_time) of all flights in the fly.flights table, in miles per hour. Choose
the answer below that is most accurate and informative.
• Infinity mi/hr
• About 7 mi/hr
• Impossible to determine
• About 402 mi/hr The nullif function is needed to prevent division by 0. The
query should be similar to SELECT
AVG(distance/(nullif(air_time,0)/60)) FROM fly.flights;
39
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q17: Which statement will return the same result as this one?
SELECT AVG(price) AS avg_price FROM fun.inventory;
Try to choose the correct answer without running these SELECT
statements. If you are uncertain, check your answer by running it in hive
on the VM.
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q18: Use hive in the VM to find how many unique non-NULL combinations
of year, month, and day exist in the fly.flights table.
• 3653
You probably used the query SELECT COUNT(DISTINCT year, month, day)
FROM fly.flights; or something similar.
40
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q20: The fly.flights table includes the column flight, which gives a flight number
for each flight in the table. It also includes the column carrier, which gives the
airline for each flight. This query gives the number of carriers that use each
particular flight number:
SELECT flight, COUNT(DISTINCT carrier) FROM flights GROUP BY flight;
Which of the following is the best response to whether this is a good choice for
grouping? (If you are unsure, you might inspect the data in the VM, looking for
maximum and minimum values for the column, or counting the number of
distinct values.)
• It's not a good choice, because there could be any number of distinct values.
• It's a good choice because there would only be a few distinct values.
• It's a reasonable choice, because while there might be several thousand
values for flight numbers, for big data this is not an unreasonable number of
rows.
41
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q21: The fly.planes table contains data about planes, including the
columns manufacturer (who built the plane) and seats (how many seats
the plane has). Which query will provide the average number of seats in
all planes built by a manufacturer, but only for manufacturers who have at
least one plane with more than 100 seats?
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q22: A “long-haul” flight is sometimes defined as a flight with air time of 7 hours or longer. Choose the
SELECT statement that returns a result set describing how many long-haul flights each carrier has, along with
the average air time of each carrier’s long-haul flights—but only for the carriers that have over 5000 long-
haul flights represented in the flights table.
• SELECT carrier, COUNT(*), AVG(air_time) FROM flights GROUP BY carrier WHERE air_time >= 7 * 60
HAVING COUNT(*) > 5000;
• SELECT carrier, COUNT(*), AVG(air_time) FROM flights GROUP BY carrier HAVING air_time >= 7 * 60 AND
COUNT(*) > 5000;
• SELECT carrier, COUNT(*), AVG(air_time) FROM flights WHERE air_time >= 7 * 60 GROUP BY carrier
HAVING COUNT(*) > 5000;
• SELECT carrier, COUNT(*), AVG(air_time) FROM flights WHERE air_time >= 7 * 60 AND COUNT(*) > 5000
GROUP BY carrier;
42
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q23: The fly.flights table has enough information to calculate the flight speed for a flight, but it's a little long
and you probably don't want to repeat it any more than you have to. The calculation for a single flight, in
miles per hour, is distance/(nullif(air_time,0)/60)).
Which of the following queries is the most succinct (and correct) way to find the origin airport, destination
airport, average flight speed in miles per hour, and number of flights for origin-destination pairs for which the
average flight speed was over 575 miles per hour? (Recall that the nullif function is NULL if the two arguments
are equal, and the first argument if they are not. Using nullif here prevents division by 0.)
SELECT origin, dest,
AVG(distance/(nullif(air_time,0)/60)) AS avg_flight_speed,
COUNT(*) AS num_flights
FROM flights GROUP BY origin, dest
HAVING avg_flight_speed > 575;
SELECT origin, dest,
AVG(distance/(nullif(air_time,0)/60)),
COUNT(*) AS num_flights
FROM flights GROUP BY origin, dest
HAVING AVG(distance/(nullif(air_time,0)/60)) > 575;
SELECT origin, dest,
AVG(distance/(nullif(air_time,0)/60) AS flight_speed),
COUNT(*) AS num_flights
FROM flights GROUP BY origin, dest
HAVING AVG(flight_speed) > 575;
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q24: Run the following query, then answer the question below. (Note that this query will also
be used in the Discussion Prompt, “The Analytic Journey,” so you might want to go directly to
that discussion when you're done here.)
SELECT origin, dest,
AVG(distance/(nullif(air_time,0)/60)) AS avg_flight_speed,
COUNT(*) AS num_flights
FROM flights
GROUP BY origin, dest
HAVING avg_flight_speed > 575;
Which of these origin-destination pairs has highest reported flight speed?
• SLC-SYR
• SLC-BDL
• TUS-RNO
• MCO-JAX
43
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
ORDER BY
• Execution after SELECT FROM WHERE GROUP BY HAVING
Hive :
• The column (s) in ORDER BY must be in SELECT
• Columns in an expression (or itself) must be in select
For new versions of Hive:
Shortcuts: SELECT shop, game from inventory order by 2;
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Missing values
NULLS FIRST
NULLS LAST
New version
NULL less than values
44
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
LIMIT
• Last Position
• To use with a constant
• Inspect the data
• Avoid returning an exponential
number of lines
• Reduce the use of SQL engine
resources
• Top n element (beware of the limit) • Writing order: SELECT FROM WHERE GROUP BY HAVING ORDER BY LIMIT
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q1: Which games might be in the second row of the result set returned by running
the query below? Check all that apply. SELECT * FROM games ORDER BY min_age;
• Monopoly
• Risk
• Scrabble
• Candy Land
• Clue
45
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q2: Choose the valid SELECT statements. Check all that apply.
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q3: Select all the statements that return the same result as SELECT * FROM crayons
ORDER BY red; If you are uncertain of your answers, run the queries to check.
46
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q4: Run the following query with hive and use the result set to
answer the question below:
SELECT * FROM wax.crayons ORDER BY pack DESC, red DESC, green
ASC;
In the result set, which crayon color is represented in the second row
from the top?
• Yellow
• Cotton Candy
• Caribbean Green
• Mountain Meadow
• Canary
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q5: Write and run a SQL query to determine which color in the crayons table has the
lowest saturation value, excluding Black and White. The expression to compute
saturation is
(greatest(red, green, blue) - least(red, green, blue)) / greatest(red, green, blue)
Which color is it?
• Cadet Blue
• Gray
• Silver
• Timberwolf
This query will provide Timberwolf in the third row, after Black and White:
SELECT color FROM wax.crayons ORDER BY (greatest(red, green, blue) - least(red, green, blue)) / greatest(red, green, blue);
You could also exclude Black and White by using the query
SELECT color FROM wax.crayons WHERE color != "Black" AND color != "White" ORDER BY (greatest(red, green, blue) - least(red, green,
blue)) / greatest(red, green, blue);
47
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q6: Select the queries that will run without error in Hive. Check all that apply.
• SELECT color, red + green + blue AS rgb_sum FROM wax.crayons ORDER BY rgb_sum;
• SELECT color, red, green, blue FROM wax.crayons ORDER BY red + green + blue;
• SELECT color, red + green + blue AS rgb_sum FROM wax.crayons ORDER BY red, green, blue;
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q7: Select the valid SQL queries. Check all that apply.
- SELECT month, AVG(dep_delay) AS avg_dep_delay FROM flights
LIMIT 1000 WHERE origin = 'SFO'
GROUP BY month HAVING avg_dep_delay > 15;
- SELECT month, AVG(dep_delay) AS avg_dep_delay, 10 AS row_limit FROM flights
WHERE origin = 'SFO' GROUP BY month
HAVING avg_dep_delay > 15 LIMIT row_limit;
- SELECT month, AVG(dep_delay) AS avg_dep_delay FROM flights
WHERE origin = 'SFO' GROUP BY month
HAVING avg_dep_delay > 15 LIMIT 1;
- SELECT month, AVG(dep_delay) AS avg_dep_delay FROM flights
WHERE origin = 'SFO' LIMIT 100
GROUP BY month HAVING avg_dep_delay > 15;
- SELECT month, AVG(dep_delay) AS avg_dep_delay FROM flights
WHERE origin = 'SFO' GROUP BY month
HAVING avg_dep_delay > 15 LIMIT -10000;
48
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q8: Select the appropriate uses for the LIMIT clause. Check all that
apply:
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q9: The example in this video showed that the routes in the flights table that have the longest average
air time are the ones from the New York City airports to Honolulu. The query used in that example was
SELECT origin, dest,AVG(air_time) AS avg_air_time,COUNT(air_time) AS count_air_time
FROM flights GROUP BY origin, dest
ORDER BY avg_air_time DESC NULLS LAST LIMIT 10;
Now, write and run a new query with hive that displays only the two combinations of airline (carrier)
and airport (origin) had the quickest flights (smallest average air_time) from New York City to Honolulu.
The three New York City airports are EWR, JFK, and LGA. Honolulu airport is HNL.
Select the two correct answers:
• American (AA) flights from LaGuardia (LGA)
• Continental (CO) flights from Newark (EWR)
• Delta (DL) flights from Newark (EWR)
• Delta (DL) flights from Kennedy (JFK)
The query above should be modified so it ends this way:
• Hawaiian (HL) flights from Kennedy (JFK) WHERE dest = 'HNL' AND origin IN ('EWR', 'JFK', 'LGA')
• United (UA) flights from Newark (EWR) GROUP BY origin, carrier
ORDER BY avg_air_time ASC
LIMIT 2;
49
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q10: What is the correct order for specifying the clauses in a SELECT
statement?
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q11: In what order does a SQL engine execute the clauses of a SELECT
statement?
50
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
with ORDER BY
with LIMIT
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
JOIN =
It is possible to join
several tables
51
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
52
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q1: Which value is guaranteed to be in the top row of the result set
when you run the following query with hive?
SELECT country FROM customers
UNION ALL
SELECT country FROM offices
ORDER BY country DESC;
• ar
• ja
• pk
• ug
• us
• No particular value is guaranteed to be in the top row
53
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q2: The customers table has 4 rows, and the offices table also has 4
rows. How many rows does the following query return when you run it
with hive?
SELECT country FROM customers
UNION ALL
SELECT country FROM offices
LIMIT 2;
•6
• Correct. With hive, the LIMIT 2 at the end of this query limits the
number of rows returned from the offices table (the table on the right
side of the UNION ALL). It has no effect on the number of rows
returned from the customers table (the table on the left side of the
UNION ALL). The customers table has 4 rows, so this query returns 4 +
2 = 6 rows.
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q3: Which questions can only be answered by using data from two different
tables in the fly database on the VM? (For more information about the tables
in the database, see the Data Reference reading.) Check all that apply.
• How many flights departed from SFO and arrived at ORD in 2014?
54
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q4: Which of the following are valid join queries that hive will run successfully on the VM? Check all that
apply.
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q5: Which of the following are valid join queries that hive will run successfully on the VM? Check all that apply.
SELECT t.name, m.name AS maker
FROM toy.toys JOIN toy.makers ON t.maker_id = m.id
ORDER BY game;
55
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q6: Review the contents of the employees and offices tables (see the
Data Reference reading), and try to answer the following question
without actually running the join query. (You can check your answers by
running the query in hive and viewing the result set.)
SELECT first_name, last_name, city
FROM employees e INNER JOIN offices o
ON e.office_id = o.office_id;
Which employees are included in the inner join result? Check all that
apply.
• Ambrosio Rojas
• Val Snyder
• Virginia Levitt
• Sabahattin Tilki
• Lujza Csizmadia
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q7: Review the contents of the employees and offices tables and try to answer the
following question without actually running the join query. (You can check your
answers by running the query in hive and viewing the result set.)
SELECT first_name, last_name, city
FROM employees e INNER JOIN offices o
ON e.office_id = o.office_id;
Which offices are included in the inner join result? Check all that apply.
• Istanbul
• Chicago
• Rosario
• Singapore
56
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q8: Which FROM clauses could you use to return data about all the
customers, even the ones who have not placed any orders? Select all that
apply.
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q9: Which of the following queries returns only the employees whose office IDs do not match
any office IDs found in the offices table?
SELECT empl_id, first_name, last_name
FROM employees e LEFT OUTER JOIN offices o ON e.office_id = o.office_id
WHERE office_id IS NULL;
SELECT empl_id, first_name, last_name
FROM offices o LEFT OUTER JOIN employees e ON e.office_id = o.office_id
WHERE e.office_id IS NULL;
SELECT empl_id, first_name, last_name
FROM employees e LEFT OUTER JOIN offices o ON e.office_id = o.office_id
WHERE o.office_id IS NULL;
SELECT empl_id, first_name, last_name
FROM employees e LEFT OUTER JOIN offices o ON e.office_id = o.office_id
WHERE e.office_id IS NULL;
SELECT empl_id, first_name, last_name
FROM offices o LEFT OUTER JOIN employees e ON e.office_id = o.office_id
WHERE o.office_id IS NULL;
57
6/24/2021
CDOSS Certificate
Big Data Analytics with Hive Query Language and Beeline
Q10: How many rows will result if you cross join a table that has 20
rows with a table that has 30 rows?
• 600
• Correct. Each of the 20 rows in the first table will produce 30 rows
when joined with the second table. That's 20 * 30 or 600 rows.
58