R-Trees and Geospatial Data Structures

CS 6213 – Advanced Data Structures – Lecture 7

 Instructor
Prof. Amrinder Arora
amrinder@gwu.edu
Please copy TA on emails
Please feel free to call as well
 TA
Iswarya Parupudi
iswarya2291@gwmail.gwu.edu
L7 - R-TreesCS 6213 - Advanced Data Structures - Arora 2

CS 6213
Basics
Record / Struct
/ Arrays / LLs
Stacks /
Queues
Graphs / Trees
/ BSTs
Heaps and
PQs
Advanced
Trie, B-Tree
Splay Trees
R-Tree
Union Find
Applications
Databases
Spatial
String
In Memory

 Antonin Guttman, U. C. Berkeley
 K. A. Mohamed

5
 Spatial Data
 R-Tree Structure
 Operations
 Searching
 Insertion
 Deletion
 Variants
 Applications
L7 - R-TreesCS 6213 - Advanced Data Structures - Arora

Given a city
map, „index‟
all university
buildings in
an efficient
structure for
quick
topological
search.
6L7 - R-TreesCS 6213 - Advanced Data Structures - Arora

7
“Index”
buildings in
an efficient
structure for
quick search
Spatial object:
Contour (outline) of the area
around the building(s).
Minimum bounding region
(MBR) of the object.

8
MBR of the city
neighbourhoods.
MBR of the city
defining the
overall search
region.

Mostly involves 2D regions.
 Need to support 2D range queries.
 Multiple return values desired: Answering a query region by reporting
all spatial objects that are fully-contained-in or overlapping the query
region (Spatial-Access Method – SAM).
In general:
 Spatial data objects often cover areas in multidimensional spaces.
 Spatial data objects are not well-represented by point-location.
 An „index‟ based on an object‟s spatial location is desirable.
Problem Summary: To retrieve data items quickly and efficiently
according to their spatial locations.

 A B-Tree is an ordered, dynamic, multi-way structure of order m (i.e. each
node has at most m children).
 The keys and the subtrees are arranged in the fashion of a search tree.
 Each node may contain a large number of keys, and the number of subtrees
in each node, then, may also be large.
 The B-Tree is designed (among other objectives):
 to branch out this large number of directions, and
 to contain a lot of keys in each node so that the height of the tree is relatively short.
10
M
P T X
B D F G K L N O Q S V W Y ZI
E H

 A height-balanced tree, similar to a B-Tree.
 Index records in the leaf nodes contain pointers to the actual
spatial-objects (entries) they represent.
 Each entry has a unique identifier that points to one spatial object,
and its MBR; i.e., entry = (MBR, pointer).
 Spatial searching requires visiting only a small number of nodes.
 The index is completely dynamic: inserts and deletes can be
intermixed with searches. (No periodic reorganization is
required.)

 Let M be the maximum number of entries that will fit in one node.
 Let m ≤ M/2 be a parameter specifying the minimum number of entries in one
node.
Then an R-Tree must satisfy the following properties:
1. Every leaf node contains between m and M index records, unless it is the
root.
2. For each index-record Entry (I, tuple-identifier) in a leaf node, I is the MBR
that spatially contains the n-dimensional data object represented by the
tuple-identifier.
3. Every non-leaf node has between m and M children, unless it is the root.
4. For each Entry (I, child-pointer) in a non-leaf node, I is the MBR that
spatially contains the regions in the child node.
5. The root has two children unless it is a leaf.
6. All leaves appear on the same level.

 An entry E in a leaf node is defined as:
E = (I, tuple-identifier)
 Where I refers to the smallest binding n-dimensional region
(MBR) that encompasses the spatial data pointed to by its tuple-
identifier.
 I is a series of closed-intervals that make up each dimension of
the binding region.
 Example. In 2D, I = (Ix, Iy),
where Ix = [xa, xb], and Iy = [ya, yb].

[Not limited to 2D – higher dimensions are certainly possible.]
 In general I = (I0, I1, …, In-1) for n-dimensions, and that Ik = [ka, kb].
 If either ka or kb (or both) are equal to , this means that the
spatial object extends outward indefinitely along that dimension.

 An entry E in a non-leaf node is defined as: E = (I, child-pointer)
 Where the child-pointer points to the child of this node, and I is
the MBR that encompasses all the regions in the child-node‟s
pointer‟s entries.
15
I(A) I(B) … I(M)
I(a) I(b) I(c) I(d)
B
a
b
c
d

a b c d e f g h i j k l
m n o p

a
b
c
d
m
a b cd e f g h i j k l
m n o p

a
b
c
d
m
e f
n
m n o p

a
b
c
d
m
e f
n
h
g
i
o p
m n o p

21
Typical query:
Find and report
all university
building sites that
are within 5km of
the city centre.
Approach:
i.Build the R-Tree
using rectangular
regions a, b, … i.
ii.Formulate the
query range Q.
iii.Query the R-
Tree and report
all regions
overlapping Q.

Let Q be the query region.
Let T be the root of the R-Tree.
Search all entry-records whose regions overlaps Q.
Search sub-trees:
 If T is not leaf, then apply Search on ever child-node entry E
whose I overlaps Q.
Search leaf nodes:
 If T is leaf, then check each entry E in the leaf and return E if E.I
overlaps Q.

23
r2
e
r5 r8
r3 r4r1 r7r0
ic gf hba d
@ r6
@ r2 @ r5 @ r8
@ r0 @ r1 @ r7 @ r3 @ r4
R-Tree settings:
M =
m =

24
 The search algorithm descends the tree from the root in a manner
similar to a B-Tree.
 More than one subtree under a node visited may need to be
searched.
 Cannot guarantee good worst-case performance.
 Countered by the algorithms during insertion, deletion, and update
that maintain the tree in a form that allows the search algorithm to
eliminate irrelevant regions of the indexed space.
 So that only data near the search area need to be examined.
 Emphasis is on the optimal placement of spatial objects with respect
to the spatial location of other objects in the structure.

 A Node-Overflow happens when a new Entry is added to a fully
packed node, causing the resulting number of entries in the node
to exceed the upper-bound M.
 The „overflow‟ node must be split, and all its current entries, as
well as the new one, consolidated for local optimum arrangement.
 A Node-Underflow happens when one or more Entries are
removed from a node, causing the remaining number of entries in
that node to fall below the lower-bound m.
 The underflow node must be condensed, and its entries
dispersed for global optimum arrangement.

26
 New index entry-records are added to the leaves.
 Nodes that overflow are split, and splits propagate up the tree.
 A split-propagation may cause the tree to grow in height.
The main Insert routine
 Let E = (I, tuple-identifier) be the new entry to be inserted.
 Let T be the root of the R-Tree.
 [Ins_1] Locate a leaf L starting from T to insert E.
 [Ins_2] Add E to L. If L is already full (overflow), split L into L and L‟.
 [Ins_3] Propagate MBR changes (enlarged or reduced) upwards.
 [Ins_4] Grow tree taller if node split propagation causes T to split.

 Similar to insertion into B+-tree but may insert into any leaf; leaf
splits in case capacity exceeded.
 Which leaf to insert into? (Choose Leaf)
 How to split a node? (Node Split)

m
n
o p

29
[Ins_1] Locate a leaf L starting from T to insert E = (I, tuple-identifier).
 Notion (i): Select the path that would require the least enlargement to include E.I.
 Notion (ii): Resolve ties by choosing the child-node with the smallest MBR.
 Invoke: L = ChooseLeaf (E, T).
A B C
@rN
A
C
B
E.I
rN

30
Algorithm: ChooseLeaf (E, N)
Inputs: (i) Entry E = (I, tuple-identifier), (ii) A valid R-Tree node N.
Output: The leaf L where E should be inserted.
 If N is leaf Then Return N
 Let FS be the set of current entries in the node N
 Let F = (I, child-pointer) FS, so that F.I satisfies the Insertion-
Notions
 Return ChooseLeaf (E, F.child-pointer)

31
[Ins_2] Add E to L.
 Notion (i): If L has room for another entry, install E.
 Notion (ii): Otherwise split L to obtain L and L‟, which between
them, will contain all previous entries in L and the new E
(consolidated for local optima).
[Ins_3] Propagate MBR changes upwards by invoking
AdjustTree (L, L‟).
 Notion (i): Ascend from leaf L to the root T while adjusting the
covering rectangles MBR.
 Notion (ii): If L‟ exists, propagate node splits as necessary; i.e.
attempt to install a new entry in the parent of L to point to L‟.

32
Example. Found L = @Y to insert new E =
e. R-Tree settings: M = 3, m = 1.
K
@G
a b c
@Y
X Y Z
@K

33
Algorithm: AdjustTree (N, N’)
Inputs: (i) A node N that has had its contents modified, (ii) The
resultant split node N‟, if not NULL, that accompanies N.
Outputs: (i) N as above, (ii) N‟ as above.
 If N is the root Then Return {(i) N, (ii) N‟}
 Let PN be the parent node of N.
 Let EN = (I_N, child-pointer_N) in PN, where child-pointer_N points
to N.
 Adjust I_N so that it tightly encloses all entry regions in N.

34
 If N‟ is Not NULL Then
 If number of entries in PN < M-1 Then
 Create a new Entry EN‟ = (I_N’, child-pointer_N’)
 Install EN‟ in PN
 Return AdjustTree (PN, NULL)
 Else
 Set {PN, PN‟} = SplitNode (PN, EN‟)
 Return AdjustTree (PN, PN‟)
 End If
 Else
 Return AdjustTree (PN, NULL)
 End If

[Ins_4] Grow Tree taller.
 Notion: If the recursive node split propagation causes the root to
split, then create a new root whose children are the two resulting
nodes.
35
A B C
@T (root)
E F
@C
G H
@C’

36
 The height of the R-Tree containing n entry-records is at most
logm n – 1, because the branching factor of each node is at
least m.
 The maximum number of nodes is:
 Worst case space utilisation for all nodes except the root is:
 Nodes will tend to have more than m entries, and this will:

37
 Current index entry-records are removed from the leaves.
 Nodes that underflow are condensed, and its contents redistributed
appropriately throughout the tree.
 A condense propagation may cause the tree to shorten in height.
The main Delete routine
 Let E = (I, tuple-identifier) be a current entry to be removed.
 Let T be the root of the R-Tree.
 [Del_1] Find the leaf L starting from T that contains E.
 [Ins_2] Remove E from L, and condense „underflow‟ nodes.
 [Ins_3] Propagate MBR changes upwards.
 [Ins_4] Shorten tree if T contains only 1 entry after condense propagation.

 [Del_1] Find the leaf L starting from T that contains E.
 Algorithm: FindLeaf (E, N)
 Inputs: (i) Entry E = (I, tuple-identifier), (ii) A valid R-Tree node N.
 Output: The leaf L containing E.
 If N is leaf Then
 If N contains E Then Return N
 Else Return NULL
 Else
 Let FS be the set of current entries in N.
 For each F = (I, child-pointer) FS where F.I overlaps E.I Do
 Set L = FindLeaf (E, F.child-pointer)
 If L is not NULL Then Return L
 Next F
 Return NULL
 End If

[Del_2] Remove E from L, and condense „underflow‟ nodes.
[Del_3] Propagate MBR changes upwards.
 Notion (i): Ascend from leaf L to root T while adjusting covering
rectangles MBR.
 Notion (ii): If after removing the entry E in L and the number of
entries in L becomes fewer than m, then the node L has to be
eliminated and its remaining contents relocated.

 Propagate these notions upwards by invoking CondenseTree (N,
QS), where N is an R-Tree node whose entries have been modified,
and QS is the set of eliminated nodes.
 Start the propagation by setting N = L, and QS = .
 Re-insert the entries from the eliminated nodes in QS back into the
tree.
 Entries from eliminated leaf nodes are re-inserted as new entries
using the Insert routine discussed earlier.
 Entries from higher-level nodes must be placed higher in the tree so
that leaves of their dependent subtrees will be on the same level as
the leaves on the main tree.

 Example: Delete the index entry-record b. R-Tree settings: M = 4,
m = 2.
 Spatial constraint: a.I will form smallest MBR with r4.
41
r2 r6
@ r7
a b
@ r0
r0 r1
@ r2
r3 r4 r5
@ r6
c d e
@ r1
f g h
@ r3
i j
@ r4
k l m
@ r5
n

42
Algorithm: CondenseTree (N, QS)
Inputs: (i) A node N whose entries have been modified, (ii) A set of
eliminated nodes QS.
 If N is NOT the root Then
 Let PN be the parent node of N.
 Let EN = (I_N, child-pointer_N) in PN.
 If N.entries < m Then
 Delete EN from PN
 Add N to QS
 Else
 Adjust I_N so that it tightly encloses all entry regions in N.
 End If
 CondenseTree (PN, QS)

43
 Else If N is root AND Q is NOT Then
 For each Q QS Do
 For each E Q Do
 If Q is leaf Then Insert (E)
 Else Insert (E) as a node entry at the same node level as
Q
 End If
 Next E
 Next Q
 End If

Why ‘re-insert’ orphaned entries?
 Alternatively, like the delete routine in B-Tree (Rosenberg & Snyder, 1981),
an „underflow‟ node can be merged with whichever adjacent sibling that will
have its area increased the least, or its entries re-distributed among sibling
nodes.
 Both methods can cause the nodes to split.
 Eventually all changes need to be propagated upwards, anyway.
Re-insertion accomplishes the same thing, and:
 It is simpler to implement (and at comparable efficiency).
 It incrementally refines the spatial structure of the tree.
 It prevents gradual deterioration if each entry was located permanently under
the same parent node.

45
 A high value of m, nearer to M, is useful when the underlying
database represented by the R-Tree is mostly used for search
inquiries with very few updates.
 The height of the tree will be kept to a minimum.
 High search performance is maintained.
 However, the risk of overflow and underflow is also high.
 A small value of m is good when frequent updates and
modifications of the underlying database is required.
 The nodes are less dense.
 Maintenance is less costly.

 Avoids multiple paths during searching.
 Objects may be stored in multiple nodes
 MBRs of nodes at same tree level do not overlap
 On insertion/deletion the tree may change downward or upward in
order to maintain the structure
R-TreeVariants

http://perso.enst.fr/~saglio/bdas/EPFL0525/sld041.htm
R-TreeVariants

 Similar to other R-Trees except that the Hilbert
value of its rectangle centroid is calculated.
 That key is used to guide the insertion
 On an overflow, evenly divide between two nodes
 Experiments has shown that this scheme
significantly improves performance and decreases
insertion complexity.
 Hilbert R-tree achieves up to 28% saving in the
number of pages touched compared to R*-tree.
R-TreeVariants

 The Hilbert value of an object is found by interleaving the bits of
its x and y coordinates, and then chopping the binary string into 2-
bit strings.
 Then, for every 2-bit string, if the value is 0, we replace every 1 in
the original string with a 3, and vice-versa.
 If the value of the 2-bit string is 3, we replace all 2‟s and 0‟s in a
similar fashion.
 After this is done, you put all the 2-bit strings back together and
compute the decimal value of the binary string;
 This is the Hilbert value of the object.
 http://www-users.cs.umn.edu/research/shashi-
group/CS8715/exercise_ans.doc
R-TreeVariants

 Proposed by Norbert Beckmann, Hans-Peter Kriegel, Ralf
Schneider, and Bernhard Seeger in 1990
 Same algorithm as the regular R-tree for query and delete
operations.
 When inserting, the R*-tree uses a combined strategy.
 For leaf nodes, overlap is minimized
 For inner nodes, enlargement and area are minimized.
 When splitting, the R*-tree uses a topological split that chooses a
split axis based on perimeter, then minimizes overlap.
 In addition to an improved split strategy, the R*-tree also tries to
avoid splits by reinserting objects and subtrees into the tree,
inspired by the concept of balancing a B-tree.
R-TreeVariants

 MBR: Minimum Bounding Rectangle
 R-Trees are an extremely compelling data structure for spatial
data.
 Largely based on B-Tree (Can be considered a generalization of
B-Tree)
 Can support more than two dimensions
 Support same basic operations (deletion, searching, insertion,
update, etc.)
 Many variants of R-Trees are available

R-Trees and Geospatial Data Structures

More Related Content

What's hot

Viewers also liked

Similar to R-Trees and Geospatial Data Structures

More from Amrinder Arora

Recently uploaded

In this document

R-Trees and Geospatial Data Structures