メモリより大きなデータの Sufix Array 構築方法の紹介

メモリより⼤大きいデータの
Suffix Tree 構築⽅方法の紹介
2014-08-29
星野喬
1

概要
• 主に以下の論論⽂文の紹介
• Suffix trees for inputs larger than main
memory
– Marina Barsky, Ulrike Stege and Alex Thomo
– Information Systems archive Volume 36 Issue
3, May, 2011
– CIKM2009 での short paper の発展版
2

論論⽂文の主張
• メモリより⼤大きいデータの suffix tree の
構築⽅方法の提案
• 既存の構築⽅方法と違い，ディスクアクセ
スはほぼシーケンシャルアクセスのみ
3

Suffix tree とは (復復習)
• ⾼高速な⽂文字列列操
作のための suffix
を⽤用いた⽊木構造
– 葉葉ノード:
suffix を表す
– 節ノード:
suffixes 間の共通
prefix を表す
4
most N −1 internal nodes in the tree. Hence, the maximum number
linear in N. The tree’s total space is linear in N in the case that each
stored in a constant space. Fortunately, this is the case for an implicit
substrings by their positions.
R
a b a b c
0 1 2 3 4
0
ab
abc
1-1 b
4-4
abc
1
c
4
3
c
c
2
0-1
2-4 4-4 2-4 4-4
= ababc. For clarity, the explicit edge labels are shown, which are represented as
in the actual suffix tree. Each suffix Si can be found by concatenating substrings

提案アルゴリズム: B2ST
• Big tree, Big string Suffix Tree construction
algorithm
• 3 つのフェーズ
– (1) Input partitioning
– (2) Pair-wise sorting
– (3) Merging
5

変数の意味
• X: ⼊入⼒力力⽂文字列列 (⻑⾧長さN)
• Xij: X[i]からX[j-1]までの部分⽂文字列列
• Si: i 番⽬目の suffix (0 = i N)
• LCPij: Si と Sj の longest common prefix
• |LCPij|: LCPij の⻑⾧長さ
• SA: suffix array
• ST: suffix tree
6

partition, and its positions are not included into the suffix array of the partition,
positions are not included into the suffix array of the partition. This ensures suffix Input partitioning
Si of a partition Xu (0 ≤ i |Xu|, 0 ≤ u k) will be sorted as representative of a suffix Sup+i of X. In practice, for real-life DNA sequences, the tail is negligibly small compared to the size of the partition itself (it never exceeded
• X を k 個の Xu に分割 (0 = u k)
• k = 2N/M (M: メモリサイズ)
• 各 Xu は tail を持つ
– Xu 内での suffix sorting 時の区別のため
– tail と同じ部分⽂文字列列が含まれないように
– 遺伝⼦子だと⾼高々 1000 くらい
characters in our experiments with DNA databases).
Consider the example in Figure 2. It shows the partitioning of input string ababaaabbabbbabaabab into four partitions. The main memory can hold up to 16 the input at a time. To facilitate the example, we represent our input as stands for 0-bit, b stands for 1-bit). In this illustration we also refer to the as A, B, C, and D in order to distinguish them from the numbers representing
positions. Note that the tail of partition XB is substring bbb, which never = aabba.
Partition A Partition B Partition C Partition D
a b a b a a a b b a b b b a b a a b a b
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Input string X = ababaaabbabbbabaabab and its four partitions. The tail of partition B 7is

Pair-wise sorting
• 全ての XuXv について SA を LCP 付きで構
築 (SAuv)
– XuXv: Xu と Xv を結合したもの (u v)
– オンメモリで処理理できるように k を調整済
– SA 内で隣隣り合う suffix 同⼠士の LCP も保存
• 出⼒力力
– SAu: Xu の SA
– Ruv: SAuv 内の LCP と uv を表す bit の列列
8

In this step we generate suffix arrays for each pair of partitions. The pseudocode for this
step is shown in Figure 3. We concatenate every possible pair u, v of partitions with their
tails (0 ≤ u k − 1, u + SAu 1 ≤ v k, main memory and build the suffix array と u v) into Ruv string XuXv. のWe 例例load
this input into the
SAuv with attached LCP length information for
each suffix. The suffixes which start in tail positions are excluded from the output suffix
array, they serve only for determining the relative order of suffixes starting at the end of
each partition. An LCP entry of SAuv is the length of the longest common prefix of each
suffix in SAuv with its immediate predecessor. Figure • SAu 4 shows size: such an 4N/array k for + the α
pair
A, B of partitions for the same input string as in Figure 2. From each SAuv, we extract
– N/k 2^32 が前提
• Ruv size: 8N/k + α
• SAu は k 個
• Ruv は k(k-1)/2 個
• 計 4kN + α: O(kN)
9
for determining the relative order of suffixes starting at the end of
LCP entry of SAuv is the length of the longest common prefix of each
immediate predecessor. Figure 4 shows such an array for the pair
the same input string as in Figure 2. From each SAuv, we extract
SAAB (in memory)
suffix start
LCP
partition bit
5 1 3 1 2 5 4 2 4 3
0 2 1 3 2 3 0 2 3 1
A B A A B B A A B B
RAB
SAAB (in memory)
suffix start
LCP
partition bit
5 1 3 1 2 5 4 2 4 3
0 2 1 3 2 3 0 2 3 1
A B A A B B A A B B
LCP 0 2 1 3 2 3 0 2 3 1
partition bit A B A A B B A A B B
SAA
5 3 1 4 2
RAB
LCP 0 2 1 3 2 3 0 2 3 1
partition bit A B A A B B A A B B
SAA
written 5 3 to 1 disk
4 2
written to disk
with LCP information for a pair of partitions A and B. Two structures are
the suffix array of partition A and (2) the order array RAB storing the relative
B. These two structures are written to disk.
Fig. 4. Suffix array SAAB with LCP information for a pair of partitions A and B. Two structures are
extracted from SAAB: (1) the suffix array of partition A and (2) the order array RAB storing the relative
order of suffixes in A and B. These two structures are written to disk.

Pair-wise sorting pseudo code
10
Algorithm pairwiseSorting
input: k partitions of string X
1. for Algorithm (u=0; upairwiseSorting
k-1; i++)
2. for input: (v=1; k partitions vk; v++)
of string X
3. 1. concatenate for (u=0; uk-1; 2. for (v=1; vk; XuXv i++)
v++)
and load into RAM
4. 3. build concatenate suffix array XuXv with and load LCP into SAuv
RAM
5. 4. during build sequential suffix array scan with LCP of SAuv
SAuv
6. 5. if during v==k-sequential 1 //last scan chunk
of SAuv
6. output if v==k-1 to //last chunk
7. 7. output to disk disk SAv
SAv
8. 8. output output to to disk disk SAu
SAu
9. 9. output output to to disk disk Ruv Ruv //order //order array
array
Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs.
Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs.
Step 2: Suffix sorting in partition pairs

Merging
• 2 つの部品で構成
– SAu と Ruv のストリームから SA のストリー
ムを作る
– SA のストリームから ST を作る
• オンメモリデータ
– SA_buf_u: SAu ⼊入⼒力力バッファ(size m, k 個)
– R_buf_uv: Ruv ⼊入⼒力力バッファ(size 2m/k,
k(k-1)/2 個)
– ST_buf: ST 出⼒力力バッファ (多分 size M-km)
11

SA ストリーム作成
12
SAu
...
Ruv
...
HeapSA with LCP
Heap 構造を使って SAu (0 = u k)
の中で⼀一番⼩小さい suffix を持つ u を
選んで出⼒力力

5. Merge algorithm
13
What happens with each suffix in the output buffer is the subject pseudocode of merge is shown in Figure 8.
Algorithm merge
1. lastTransferred = null
2. while heap is not empty
3. remove smallest suffix Si of partition u
from the top of the heap
4. rebalance heap
5. lcp = 0
6. if lastTransferred is not null
7. v = lastTransferred.partitionID
8. lcp = LCP from R_bufuv [current_pointer]
9. create leaf for Si using lcp in ST_buf
10. advancePointers (u)
11. lastTransferred=Si
12. if ST_buf is full
13. store Si (max suffix)
as a pointer to the current tree
14. write ST_buf to disk
4. rebalance heap
5. lcp = 0
16. Sj = get next suffix from SA_bufu
17. if Sj is not null
18. insert Sj into heap
Fig. 8. The general pseudocode for merge.
lcp = 0
16. Sj = get next suffix from SA_bufu
17. if Sj is not null
18. insert Sj into heap
Fig. 8. The general pseudocode for merge.

the information we need to efficiently perform the merge. As produce information. Since each Ruv contains an information only about two partitions, we need to the use suffix one bit to tree represent for the the partition entire ID input in Ruv. string Specifically, X. we We use are 0 for doing u entire v (input u v). Figure string InitialMerge 4 shows into SAA main and memory. RAB extracted algorithmIn from fact, SAAB we for never At the end of this step we have on disk suffix arrays for partitions
partition access pair (k k (of total size plus k(k − 1)/2 order arrays for each possible pair of partitions (of total size kN).
Algorithm initializeMerge
This is all the information we need to efficiently perform the merge. As a result of merge we produce the suffix tree for the entire input string X. We are doing this without
loading the entire input string into main memory. In fact, we never access X anymore.
14
1. for each SA_bufu
2. read first m start positions
from disk suffix array SAu
Algorithm initializeMerge
1. for each SA_bufu
2. read first m start positions
3. for each R_bufuv
4. read first m/k LCP+partitionBit from Ruv
from disk suffix array SAu
3. for each R_bufuv
4. read first m/k LCP+partitionBit from Ruv
5. for each SA_bufu
6. insert SA_bufu[0] into heap
5. for each SA_bufu
6. insert SA_bufu[0] into heap
Fig. 5. The pseudocode for buffer allocation as the initial step for merge.
Fig. 5. The pseudocode for buffer allocation as the initial step for merge.

CompareSuffix algorithm
15
Algorithm compareSuffix (Si from partition u,
Sj from partition v)
1. if (u = = v)
2. return -1 //Si lex Sj, since they are sorted
//in increasing order inside each partition
Algorithm compareSuffix (Si from partition u,
3. if (u v)
4. if (partitionBit in R_bufuv[Sj from partition v)
1. if (u = = v)
current pointer] = = 0)
5. return 2. return -1 -1 //Si lex Sj, //Si since they are sorted
lex Sj
6. else
//in increasing order inside each partition
7. return1 //Si lex Sj
8. if (u v)
9. if (partitionBit in R_bufvu[current pointer] = = 0)
10. return 1 //Sj lex Si
11. else
12. return -1 //Sj lex Si
3. if (u v)
4. if (partitionBit in R_bufuv[current pointer] = = 0)
5. return -1 //Si lex Sj
6. else
7. return1 //Si lex Sj
8. if (u v)
9. if (partitionBit in R_bufvu[current pointer] = = 0)
10. return 1 //Sj lex Si
11. else
12. return -1 //Sj lex Si
Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrays
created during pairwise suffix sorting.
Algorithm for suffix comparison which uses the pairwise suffix information from the order during pairwise suffix sorting.

in all relevant order buffers. We insert into the heap the we continue in a similar way until all suffixes are merged.
Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrays
AdvancePointers algorithm
Pseudocode of current pointer advancing in suffix array buffer and the order 16
created during pairwise suffix sorting.
Algorithm advancePointers (partition ID u)
1. SA_bufu.current_pointer++
2. if reached the end of SA_bufu
3. refill SA_bufu from disk-based SAu
4. for (i=0; iu; i++)
5. R_bufiu.current_pointer++
6. if reached the end of R_bufiu
7. refill R_bufiu from disk-based Riu
8. for (i=u; ik; i++)
9. R_bufui.current_pointer++
10. if reached the end of R_bufui
11. refill R_bufui from disk-based Rui
pointers for all the order array buffers containing information about partition u are also
advanced by 1, as shown in pseudocode of Figure 7. This means we have determined the
order of the current suffix of partition u, and we need to consider the next element both
in SA BUFu and in all relevant order buffers. We insert into the heap the next suffix of
partition u, and we continue in a similar way until all suffixes are merged.
Algorithm advancePointers (partition ID u)
1. SA_bufu.current_pointer++
2. if reached the end of SA_bufu
3. refill SA_bufu from disk-based SAu
4. for (i=0; iu; i++)
5. R_bufiu.current_pointer++
6. if reached the end of R_bufiu
7. refill R_bufiu from disk-based Riu
8. for (i=u; ik; i++)
9. R_bufui.current_pointer++
10. if reached the end of R_bufui
11. refill R_bufui from disk-based Rui
Fig. 7. Pseudocode of current pointer advancing in suffix array buffer and the order buffers.

ST 作成例例
• X: ababc
• SA: 02134
• binary alphabet: a: 00, b: 01, c: 10
• 本⼿手法では ST は binary alphabet で表現
– 必ず 2 分⽊木になるので扱いやすい
17

ST with binary alphabet
18
ST with alphabet
ST with binary alphabet
S4
S4
S0S2S1S3
S0S2S1S3
abbc
abccabcc
010
0011
0001101000011010
index や LCP が bit 単位か
byte (⽂文字) 単位かの違い

⽊木のどこに挿⼊入するか
• SA の順で挿⼊入するので，必ず右端のパス
内のどこかに internal node を挿⼊入
• LCP を⾒見見てどこに挿⼊入するか決定
19
L1
L1
L1
inserted
internal node
L1
L1inserted
leaf node
L1

Insert s0 then s2
20
X: ababc
Xb: 0001000110
SA: 02134
Insert s0 (0001000110, I0, L0)Insert s2 (000110, I4, L4)
S0
S0
I0
L4
S2
I0+4 I4+4
I0
IX: index is X (in unit of bit)
LX: LCP is X (in unit of bit)

Insert s1
21
X: ababc
Xb: 0001000110
SA: 02134
S0
I0
L4
S2
I4 I8
Ix: index is x (in unit of bit)
Lx: LCP is x (in unit of bit)
S0
I0+1
L4-1
S2
I4 I8
I0
L1
S1
I2+1

Insert s3
22
X: ababc
Xb: 0001000110
SA: 02134
S0
I1
L3
S2
I4 I8
I0
L1
S1
I3+1
S0
I1
L3
S2
I4 I8
I0
L1
S1
I3
I3
L2-1
S3
I6+2

Insert s4
23
X: ababc
Xb: 0001000110
SA: 02134
S0
I1
L3
S2
I4 I8
I0
L1
S1
I4
I3
L1
S3
I8
S0
I1
L3
S2
I4 I8
I0+0
L1-0
S1
I4
I3
L1
S3
I8
I0
L0
S4
I8+0

Index と LCP の意味
24
S0
I1
L3
S2
0 10
I0
L1
I0+1+3 I4+1+3
I3
L1
S1
I2+1+1
S3
I6+1+1
I0
L0
S4
I8
001 1
000110 10 000110
10
Xb: 0001000110
idx: 0123456789

ST 出⼒力力フォーマット
• 3 つのデータ
– Internal node ⽤用の array
– Leaf node ⽤用の array
– 最⼤大 suffix の prefix (実体): divider
• divider を使って検索索すべきファイルを特定
25
Internal node structure (順番は不不明)Leaf node structure
Left childRight childIndex
4bytes4bytes4bytes
partition id (1byte)
Index
4bytes
partition id (1byte)

他⼿手法との⽐比較
• TDD and ST-merge (2005年年)
– 6MB メモリ 20MB データで 8 時間
• Trellis+SB (2008年年)
– 3GB データ 512MB メモリで 11 時間
– ランダムアクセス 95% 減
• DiGeST (2008年年)
– ランダムアクセス 98% 減
• どれもこれも分割して処理理後 merge する
• ディスクのランダムアクセスをいかに減らすか
がポイント 26

TDD which works with uncompressed inputs. The results 評価: 他⼿手法との⽐比較
Program Time, hours
TDD 125
B2ST 3
• ⼊入⼒力力 3GB の DNA (1⽂文字1byte)
• 使⽤用メモリ 600MB
27
Trellis+SB 11
of different suffix tree construction algorithms for approximately which is larger than the total allocated main memory.
suffix tree for the above 3GB input in 125 hours. of Trellis+SB reported in [20]. The value in Figure similar settings on a comparable machine.
divided the 3GB into partitions of 1GB each and built

produced an intermediate on-disk output of size 234GB. Despite this, the merge in only 59 minutes, scanning all this on-disk data in sequential manner 2514 suffix tree files of total 215GB.
評価: ⼊入⼒力力サイズを変えて⽐比較
Total
time,
min
Merge,
min
Pairwise
sorting,
min
Number of
partition
pairs
Number
of
partitions
Input
size,
GB
6 3 3 441 27 468
8 4 6 730 34 764
10 5 10 1100 42 1142
• ⼊入⼒力力データは DNA を複数結合したもの
• メモリは 2GB 使⽤用
• Pairwise sorting は k^2 で効いてくるが並列列化で
きそう
• Merge はほぼリニア
28
12 6 15 1462 59 1521
Running time (min) of B2ST for different sets (approximately 6, 8, 10 and 12GB) of genomic 2GB of main memory.
example shows that we need a large temporary disk space for scaling up the Specifically, we need D = k2p = kN disk space to store the order arrays partition pairs. Since the number of partitions is k = N/M, from D = N2/M the size of the largest input that we can process with M bytes of internal bytes of disk space. If we substitute the common values for modern computers,

所感
• もしかしてメモリのランダムアクセスよりも
ディスクのシーケンシャルアクセスの⽅方がス
ループット⼤大きかったりする？？？
• k を定数と⾒見見做せるケースでしか使えない
– 実質的に partition id に 1 byte, suffix index に 4 byte
の計 5 bytes (N 2^40) 程度度のデータが対象
– Induce sort でやっているように再帰的に⽐比較して
O(kN) à O(N) にできないものか
• tail の件といい dividers の件といい，LCP が巨⼤大
な場合をあまり考慮していない
– 遺伝⼦子が対象ならそれで良良いが．．．
29

メモリより大きなデータの Sufix Array 構築方法の紹介

More Related Content

Viewers also liked

Similar to メモリより大きなデータの Sufix Array 構築方法の紹介

More from Takashi Hoshino

Recently uploaded

メモリより大きなデータの Sufix Array 構築方法の紹介