QUERY PROCESSING AND SECURITY
• Overview of Query Processing • Database Security Issues
• Measuring of Query Cost • Types of Security
• Selection Operation, Sorting • Access Protection
• Joining Evaluation of Expression • User Accounts and Database Audits
• Query Optimization • Discretionary Access Control
• Database Administrator: DBA Roles • Mandatory Access Control
and Responsibilities • Data Encryption and Decryptions
QUERY PROCESSING
• translation of high-level queries into low-level expression.
• requires the basic concepts of relational algebra and file structure.
• refers to the range of activities that are involved in extracting data
from the database.
• includes translation of queries in high-level database languages into
expressions that can be implemented at the physical level of the file
system.
• In query processing, we will actually understand how these queries
are processed and how they are optimized.
PARSER AND TRANSLATOR:
• In this step, the parser of the query processor checks and verifies the syntax of the query, the user's privileges to
execute the query, the table names and attributes name etc.
1. Syntax check – concludes SQL syntactic validity. Example:
Select * from Employee;
Here error of wrong spelling of FROM is given by this check.
2. Semantic check – determines whether the statement is meaningful or not. Example: query contains a table name which
does not exist is checked by this check.
3. Shared Pool check – Every query possess a hash code during its execution. So, this check determines existence of
written hash code in shared pool if code exists in shared pool then database will not take additional steps for optimization
and execution.
• Translation involves conversion of high level query to low level instruction in relational algebra. Example : Select
book_title, price From Book Where price > 400 This query can be translated into either of the following relational-
algebra expressions:
• This query can be translated into either of the following relational-algebra expressions:
- π book_title, price ( σ price > 400 ( Book ) )
- σ price > 400 ( π book_title, price ( Book ) )
Optimization
• It is a process in which multiple query execution plan for satisfying a
query are examined and most efficient query plan is satisfied for
execution.
• Optimizer uses the statistical data stored as part of data dictionary.
• The statistical data are information about the size of the table, the
length of records, the indexes created on the table, etc.
• Optimizer also checks for the conditions and conditional attributes
which are parts of the query.
Execution Plan
• The query processor module, at this stage, using the information
collected in query optimization to find different relational algebra
expressions that are equivalent and return the result of the one which
we have written already.
• For our example, the query written in Relational algebra can also be
written as the one given below:
π book_title, price ( σ price > 400 ( Book ) )
Query Cost Estimation
• Cost is generally measured as total time elapsed for answering query. To convert high level query
to desired query we need some measurements. Basic measure for query cost are:
disk access
CPU cycle
Transit time in network
• Here CPU cost is difficult to calculate.
• For disk access, how many disk access required to convert the high level query to desired query.
• For CPU cycle, how many CPU cycle are consumed to evaluate desired query. CPU speed
increases at faster rate than disk speed. Due to which CPU cost is relatively lower than disk cost.
Disk Access Cost:
• Cost is measured by taking :
- no of seeks (no of attempts)
- no of block read
- no of block write
• Normally, cost(w) > cost(r)
• cost(w) : writing means continuously checking before writing or executing the program.
• cost(r) : only once you will read.
• To calculate disk access cost
- No of seeks(N) ; Cost = N * Average seek time
- No of block read ; Cost = N * Average block read cost
- No of block write ; Cost = N * Average block write cost
• For simplicity, we use No of blocks transfer from disk and no of seeks as measure of
cost.
- tT = time to transfer 1 block
- tS = time for 1 seek
• So, cost for b block transfer plus s seeks = b * tT + s * tS
Selection Operation (σ)
• Select algorithms that performs selections using an index are referred
as file scans.
• File scans are search algorithms that locate and retrieve records that
fulfil a selection condition.
• Select operation must search through the data files for record
meeting the selection criteria.
A1. Linear search
• All records are scanned to see whether they satisfy the selection condition.
• This algorithm is simple and slower than other algorithm for implementing selection.
• It can be applied to any file regardless of the ordering of file, availability of indices or
the nature of selection operation.
• There can be 2 cases : Whether the record is at 1st position or at the last position in
relation.
• If there are br number of blocks , then
Avg. Cost = (br /2) * tT + tS
Worst case Cost = br * tT + tS
tT No. of Traversal time for one block
tS No of seek time to reach beginning of file
br No of blocks on disk
A2. Binary search
• Applicable only when the records are based on search key value and
have equal condition.
• Not suitable for range operation or any other kind.
• Filter condition should be search key column=‘value’ like ID=5.
• Cost=[log (br )* (tT + tS )]
A3. Use of primary index, Equality on key
attribute
• Select condition involves on equality comparison on a key attribute
with primary index.
• Here we use B+ tree for maintaining the index. Primary index is
defined on an ordered data file.
• The data file is ordered n a key field. (B+ Tree is a file organization
method, similar to binary search tree but it can have more than two
leaf node. It stores all the records only at the leaf node. Intermediary
nodes will have pointer to the leaf node. They do not contain any data
record.)
• PTR points to PID Index is on key attribute.
• Cost to find the PID in index is height of the tree.
• Cost = (height of tree + 1) *( tT + tS )
A4. Use of primary index, Equality on non key attribute:
- Difference with previous one (primary index equality on key attribute) is that
multiple records can be fetched.
- We can retrieve multiple records using a primary index when the selection
condition specifies equality comparison on non-key attribute.
- σ pid=‘P01’ (Book)
Cost= height of tree *( tT + tS ) + b * tT
B no of blocks containing the records with specified search key
A5 .Use of secondary index, equality on key or non-key attributes
In this case, when search key is candidate key, single record can be retrieved.
• Equality on key attributes: (height of tree + 1) *( t T + tS )
And when search key is not candidate key, multiple records can be retrieved.
• Equality on non-key attributes: Cost = (height of tree + B) *( t T + tS )
A6. Use of primary index( comparison)
Cost = H *( tT + tS ) + b * tT
A7. Use of secondary index (comparison)
It can also be used for comparison.
Cost = disk access, since records may be on different blocks.
-Much expensive
Cost= (height of tree + B) *( tT + tS )
Joining Algorithms
- Like Selection, Joining can be implemented in a variety of ways.
- In term of disk access, joining can be expensive. So, Implementing and utilizing efficient join algorithms is
critical in minimizing query’s execution time.
Notations:
r,s Outer Relation r and inner relation s
tr Tuples in relation r
ts Tuples in relation s
nr No. of records in relation r
ns No. of records in relation s
br No. of blocks with records in relation r
bs No. of blocks with records in relation s
Types of Joining Algorithms
1. Nested Loop Join
- Called nested loop join algorithm since it basically consists of pair of nested for loops.
- Requires no indices
- Can be used regardless of what join condition is.
- Expensive since it examines every pair of tuples in two relations.
- In below algorithm, we can see that each record of the outer table r is verified with the each record s of the
inner table s. Hence it is very costly type of join.
- In the worst case, it requires NT +BT seeks and (n r * bs) +br Block transfers. It is always better to put smaller
tables in the inner loop, if it accommodated into its memory. Then the cost will reduce to 2 seeks and b r+ bs
Block transfers.
Algorithm:
For each tuple tr in r do begin
For each tuple ts in s do begin
Test pair (tr , ts ) to see if they satisfy the join condition θ.
if they do, add tr * ts to the result.
End
End
- Suppose we have EMPLOYEE and DEPT tables with following number of records and blocks. Let us see their
costs based on these numbers and the position of the larger and smaller tables.
- Observe case 1 and 2, where when smaller table DEPT is set as outer table, then the number of block
transfer doubles where as number of seek reduces.
- In the case 3 where both the tables fit into the memory block, the number of seek is reduced to 2 and block
transfer also considerably reduces.
- But we can observe the real difference when only smaller table DEPT fits into the memory and is considered
as inner table in case 4.
- From above examples, we can understand that cost of nested loop join is the game of number of records,
blocks and positioning of the tables. However, the cost of nested loop join is very expensive compared to
other joins below
2.Block nested-loop join Algorithm:
- Also requires no indexes and can be used with any kind of join condition.
- Worst case: db buffer can only hold one block of each relation br * bs +br disk accesses.
- Best case: both relations t into dB buffer br + bs disk accesses.
- If smaller relation completely ts into dB buffer, use that as inner relation.
- Reduces the cost estimate to br + bs disk accesses.
- Algorithm:
Index Nested Loop Join :
- Same as Nested loop join except an index file on inner relation .
- If an index is available on inner loop’s join attribute, index lookups can replace file scan.
- It can be used with existing indices as well as with temporary indices created for the sole purpose of
evaluating joins.
- Cost=br + nr * c
C cost of single selection on using joins
If B+ tree index is used and height of the index is 4,
- Cost of B+ tree index=h +1=4+1=5
- Cost= 100+500*5=25100 disk access
Merge Join :
- Cost effective alternative to construct an index for nested loop.
- Can be used for equi joins and natural joins.
- Each block needs to be read only once (assuming all tuples for any given value of the join attributes fit in memory.
Algorithm :
Sort outer relation R in asc order using join column.
Sort inner relation S in asc order using join column.
For each row in outer relation R do : read the row
For each row from inner relation s with value less than or equal to join column do: read the row
If R.join_column=S.join_column then
Accept the row and add to resulting set
End if
End for
End for
- Cost= br + bs
Hash Join:
- Applicable for Equi joins and natural joins.
- In it, both relations that have to be joined are considered as two inputs : build input and probe input.(smaller
table usually represent the build input.)
- Utilizes two hash tables file structure to partition each relation’s records into sets containing identical hash
values on the join attributes.
- After two hash tables are built, for each matching partition in hash tables, hash index of smaller
relation’s(build relation) is built and nested loop is performed against the corresponding records in other
relation, writing out the result for each join.
- If the required amount of memory is not available to hold the hash index and records in any partitions of
build relation then process called recursive partitioning is performed .
- Recursive partitioning is the one in which the system repeats the partitioning of the input until
each partition of the build input fits into the memory.
- Recursive partitioning is needed when the value of nh is greater than or equal to the number of memory
blocks.
- Cost (Without recursive partitioning)= 3* (br + bs ) + 4nh (nh is no of partitions in hash table).
- Cost (with recursive partitioning)= 2* (br + bs ) [logM-1 (bs )-1]+ br + bs
- M= no. of memory blocks used
Evaluation of Expressions in DBMS
• So far we have seen how to write a query in SQL.
• We see: how DBMS evaluates the query written in SQL. i.e.; how it
breaks them into pieces to get the records quickly.
• There are two methods of evaluating the query. They are:
• Materialization
• Pipelining
1. Materialization :
- Each operation in the expression is evaluated one by one in appropriate order and result of each operation is
materialized (created) in a temporary relation which becomes input for subsequent operations.
- Example : πpname , emailid (σcategory = 'Nobel' (Book) |X| Publisher)
- It's expression tree is:
- We can observe two queries. It breaks the query into two as mentioned above. Once it is broken, it evaluates the
first query and stores it in the temporary table in the memory. This temporary table data will be then used to
evaluate the second query.
- Although this method looks simple, the cost of this type of evaluation is always more.
- It takes the time to evaluate and write into
temporary table, then retrieve from this temporary
table and query to get the next level of result and so on.
- Cost = cost of individual SELECT + cost of write into
temporary table
2. Pipelining:
-In this method, DBMS do not store the records into temporary tables. Instead, it queries each query and result
of which will be passed to next query to process and so on. It will process the query one after the other and
each will use the result of previous query for its processing.
- In the example above, CLASS_ID of DESIGN_01 is passed to the STUDENT table to get the student details.
- In this method no extra cost of writing into temporary tables. It has only cost of evaluation of individual
queries; hence it has better performance than materialization.
Types of Pipelining
1. Demand Driven or Lazy evaluation
- In this method, the result of lower level queries are not passed to the higher level automatically.
- It will be passed to higher level only when it is requested by the higher level.
- In this method, it retains the result value and state with it and it will be transferred to the next level only when it is
requested.
- In our example above, CLASS_ID for DESIGN_01 will be retrieved, but it will not be passed to STUDENT query only when it is
requested. Once it gets the request, it is passed to student query and that query will be processed.
2. Producer Driven or Eager Pipelining
- In this method, the lower level queries eagerly pass the results to higher level queries.
- It does not wait for the higher level queries to request for the results.
- In this method, lower level query creates a buffer to store the results and the higher level queries pulls the results for its use.
- If the buffer is full, then the lower level query waits for the higher level query to empty it. Hence it is also called as PULL and
PUSH pipelining.
Query Optimization in DBMS
- We have seen so far how a query can be processed based on indexes
and joins, and how they can be transformed into relational expressions.
- The query optimizer uses these two techniques to determine which
process or expression to consider for evaluating the query.
-Example : Relation schema:
instructor (ID , name , dept_name , salary)
teaches (ID , course_id , sec_id , semester , year)
course (course_id , title , dept_name , credits)
- Find the names of all instructors in the music department together
with the course title of all the courses that the instructor teaches?
πname , title (σdept_name = 'Music'(instructor |X| (teaches |X| πcourse_id , title (course))))
Methods of Query Optimization
1. Cost based Optimization:
- This is based on the cost of the query.
- The query can use different paths based on indexes, constraints, sorting methods etc.
- This method mainly uses the statistics like record size, number of records, number of records per block,
number of blocks, table size, whether whole table fits in a block, organization of tables, uniqueness of
column values, size of columns etc.
2. Heuristic Optimization (Logical) :
- This method is also known as rule based optimization.
- This is based on the equivalence rule on relational expressions; hence the number of combination of queries
get reduces here. Hence the cost of the query too reduces.
- This method creates relational tree for the given query based on the equivalence rules.
- The most important set of rules followed in this method is listed below:
To Perform all the selection operation as early as possible in the query. This should be first and foremost set
of actions on the tables in the query. By performing the selection operation, we can reduce the number of
records involved in the query, rather than using the whole tables throughout the query.
Perform all the projection as early as possible in the query. This is similar to selection but will reduce the
number of columns in the query.
Next step is to perform most restrictive joins and selection operations. When we say most restrictive joins
and selection means, select those set of tables and views which will result in comparatively less number of
records.
Heuristic Based Optimization ( summary)
-It transforms the query tree using a set of rules(heuristics) that typically ( but not in all cases ) improve
execution performance of input query.
- Apply Select operation before project, join or other binary operation.
- Apply project operation before join or other binary operation.
- SELECT and PROJECT operation reduces the size of the file and hence should be applied before join or other
binary operation.
Equivalence Rules
Equivalence Rules
Equivalence Rules
Equivalence Rules
Equivalence Rules