KEMBAR78
メモリより大きなデータの Sufix Array 構築方法の紹介 | PDF
メモリより⼤大きいデータの 
Suffix Tree 構築⽅方法の紹介 
2014-08-29 
星野 喬 
1
概要 
• 主に以下の論論⽂文の紹介 
• Suffix trees for inputs larger than main 
memory 
– Marina Barsky, Ulrike Stege and Alex Thomo 
– Information Systems archive Volume 36 Issue 
3, May, 2011 
– CIKM2009 での short paper の発展版 
2
論論⽂文の主張 
• メモリより⼤大きいデータの suffix tree の 
構築⽅方法の提案 
• 既存の構築⽅方法と違い,ディスクアクセ 
スはほぼシーケンシャルアクセスのみ 
3
Suffix tree とは (復復習) 
• ⾼高速な⽂文字列列操 
作のための suffix 
を⽤用いた⽊木構造 
– 葉葉ノード: 
suffix を表す 
– 節ノード: 
suffixes 間の共通 
prefix を表す 
4 
most N −1 internal nodes in the tree. Hence, the maximum number 
linear in N. The tree’s total space is linear in N in the case that each 
stored in a constant space. Fortunately, this is the case for an implicit 
substrings by their positions. 
R 
a b a b c 
0 1 2 3 4 
0 
ab 
abc 
1-1 b 
4-4 
abc 
1 
c 
4 
3 
c 
c 
2 
0-1 
2-4 4-4 2-4 4-4 
= ababc. For clarity, the explicit edge labels are shown, which are represented as 
in the actual suffix tree. Each suffix Si can be found by concatenating substrings
提案アルゴリズム: B2ST 
• Big tree, Big string Suffix Tree construction 
algorithm 
• 3 つのフェーズ 
– (1) Input partitioning 
– (2) Pair-wise sorting 
– (3) Merging 
5
変数の意味 
• X: ⼊入⼒力力⽂文字列列 (⻑⾧長さN) 
• Xij: X[i]からX[j-1]までの部分⽂文字列列 
• Si: i 番⽬目の suffix (0 = i  N) 
• LCPij: Si と Sj の longest common prefix 
• |LCPij|: LCPij の⻑⾧長さ 
• SA: suffix array 
• ST: suffix tree 
6
partition, and its positions are not included into the suffix array of the partition, 
positions are not included into the suffix array of the partition. This ensures suffix Input partitioning 
Si of a partition Xu (0 ≤ i  |Xu|, 0 ≤ u  k) will be sorted as representative of a suffix Sup+i of X. In practice, for real-life DNA sequences, the tail is negligibly small compared to the size of the partition itself (it never exceeded 
• X を k 個の Xu に分割 (0 = u  k) 
• k = 2N/M (M: メモリサイズ) 
• 各 Xu は tail を持つ 
– Xu 内での suffix sorting 時の区別のため 
– tail と同じ部分⽂文字列列が含まれないように 
– 遺伝⼦子だと⾼高々 1000 くらい 
characters in our experiments with DNA databases). 
Consider the example in Figure 2. It shows the partitioning of input string ababaaabbabbbabaabab into four partitions. The main memory can hold up to 16 the input at a time. To facilitate the example, we represent our input as stands for 0-bit, b stands for 1-bit). In this illustration we also refer to the as A, B, C, and D in order to distinguish them from the numbers representing 
positions. Note that the tail of partition XB is substring bbb, which never = aabba. 
Partition A Partition B Partition C Partition D 
a b a b a a a b b a b b b a b a a b a b 
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 
Input string X = ababaaabbabbbabaabab and its four partitions. The tail of partition B 7is
Pair-wise sorting 
• 全ての XuXv について SA を LCP 付きで構 
築 (SAuv) 
– XuXv: Xu と Xv を結合したもの (u  v) 
– オンメモリで処理理できるように k を調整済 
– SA 内で隣隣り合う suffix 同⼠士の LCP も保存 
• 出⼒力力 
– SAu: Xu の SA 
– Ruv: SAuv 内の LCP と uv を表す bit の列列 
8
In this step we generate suffix arrays for each pair of partitions. The pseudocode for this 
step is shown in Figure 3. We concatenate every possible pair u, v of partitions with their 
tails (0 ≤ u  k − 1, u + SAu 1 ≤ v  k, main memory and build the suffix array と u  v) into Ruv string XuXv. のWe 例例load  
this input into the 
SAuv with attached LCP length information for 
each suffix. The suffixes which start in tail positions are excluded from the output suffix 
array, they serve only for determining the relative order of suffixes starting at the end of 
each partition. An LCP entry of SAuv is the length of the longest common prefix of each 
suffix in SAuv with its immediate predecessor. Figure • SAu 4 shows size: such an 4N/array k for + the α 
pair 
A, B of partitions for the same input string as in Figure 2. From each SAuv, we extract 
– N/k  2^32 が前提 
• Ruv size: 8N/k + α 
• SAu は k 個 
• Ruv は k(k-1)/2 個 
• 計 4kN + α: O(kN) 
9 
for determining the relative order of suffixes starting at the end of 
LCP entry of SAuv is the length of the longest common prefix of each 
immediate predecessor. Figure 4 shows such an array for the pair 
the same input string as in Figure 2. From each SAuv, we extract 
SAAB (in memory) 
suffix start 
LCP 
partition bit 
5 1 3 1 2 5 4 2 4 3 
0 2 1 3 2 3 0 2 3 1 
A B A A B B A A B B 
RAB 
SAAB (in memory) 
suffix start 
LCP 
partition bit 
5 1 3 1 2 5 4 2 4 3 
0 2 1 3 2 3 0 2 3 1 
A B A A B B A A B B 
LCP 0 2 1 3 2 3 0 2 3 1 
partition bit A B A A B B A A B B 
SAA 
5 3 1 4 2 
RAB 
LCP 0 2 1 3 2 3 0 2 3 1 
partition bit A B A A B B A A B B 
SAA 
written 5 3 to 1 disk 
4 2 
written to disk 
with LCP information for a pair of partitions A and B. Two structures are 
the suffix array of partition A and (2) the order array RAB storing the relative 
B. These two structures are written to disk. 
Fig. 4. Suffix array SAAB with LCP information for a pair of partitions A and B. Two structures are 
extracted from SAAB: (1) the suffix array of partition A and (2) the order array RAB storing the relative 
order of suffixes in A and B. These two structures are written to disk.
Pair-wise sorting pseudo code 
10 
Algorithm pairwiseSorting 
input: k partitions of string X 
1. for Algorithm (u=0; upairwiseSorting 
k-1; i++) 
2. for input: (v=1; k partitions vk; v++) 
of string X 
3. 1. concatenate for (u=0; uk-1; 2. for (v=1; vk; XuXv i++) 
v++) 
and load into RAM 
4. 3. build concatenate suffix array XuXv with and load LCP into SAuv 
RAM 
5. 4. during build sequential suffix array scan with LCP of SAuv 
SAuv 
6. 5. if during v==k-sequential 1 //last scan chunk 
of SAuv 
6. output if v==k-1 to //last chunk 
7. 7. output to disk disk SAv 
SAv 
8. 8. output output to to disk disk SAu 
SAu 
9. 9. output output to to disk disk Ruv Ruv //order //order array 
array 
Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs. 
Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs. 
Step 2: Suffix sorting in partition pairs
Merging 
• 2 つの部品で構成 
– SAu と Ruv のストリームから SA のストリー 
ムを作る 
– SA のストリームから ST を作る 
• オンメモリデータ 
– SA_buf_u: SAu ⼊入⼒力力バッファ(size m, k 個) 
– R_buf_uv: Ruv ⼊入⼒力力バッファ(size 2m/k, 
k(k-1)/2 個) 
– ST_buf: ST 出⼒力力バッファ (多分 size M-km) 
11
SA ストリーム作成 
12 
SAu 
... 
Ruv 
... 
HeapSA with LCP 
Heap 構造を使って SAu (0 = u  k) 
の中で⼀一番⼩小さい suffix を持つ u を 
選んで出⼒力力
5. Merge algorithm 
13 
What happens with each suffix in the output buffer is the subject pseudocode of merge is shown in Figure 8. 
Algorithm merge 
1. lastTransferred = null 
2. while heap is not empty 
3. remove smallest suffix Si of partition u 
from the top of the heap 
4. rebalance heap 
5. lcp = 0 
6. if lastTransferred is not null 
7. v = lastTransferred.partitionID 
8. lcp = LCP from R_bufuv [current_pointer] 
9. create leaf for Si using lcp in ST_buf 
10. advancePointers (u) 
11. lastTransferred=Si 
12. if ST_buf is full 
13. store Si (max suffix) 
as a pointer to the current tree 
14. write ST_buf to disk 
4. rebalance heap 
5. lcp = 0 
6. if lastTransferred is not null 
7. v = lastTransferred.partitionID 
8. lcp = LCP from R_bufuv [current_pointer] 
9. create leaf for Si using lcp in ST_buf 
10. advancePointers (u) 
11. lastTransferred=Si 
12. if ST_buf is full 
13. store Si (max suffix) 
as a pointer to the current tree 
14. write ST_buf to disk 
15. lastTransferred = null 
16. Sj = get next suffix from SA_bufu 
17. if Sj is not null 
18. insert Sj into heap 
Fig. 8. The general pseudocode for merge. 
lcp = 0 
6. if lastTransferred is not null 
7. v = lastTransferred.partitionID 
8. lcp = LCP from R_bufuv [current_pointer] 
9. create leaf for Si using lcp in ST_buf 
10. advancePointers (u) 
11. lastTransferred=Si 
12. if ST_buf is full 
13. store Si (max suffix) 
as a pointer to the current tree 
14. write ST_buf to disk 
15. lastTransferred = null 
16. Sj = get next suffix from SA_bufu 
17. if Sj is not null 
18. insert Sj into heap 
Fig. 8. The general pseudocode for merge.
the information we need to efficiently perform the merge. As produce information. Since each Ruv contains an information only about two partitions, we need to the use suffix one bit to tree represent for the the partition entire ID input in Ruv. string Specifically, X. we We use are 0 for doing u entire v (input u  v). Figure string InitialMerge 4 shows into SAA main and memory. RAB extracted algorithmIn from fact, SAAB we for never At the end of this step we have on disk suffix arrays for partitions  
partition access pair (k k (of total size plus k(k − 1)/2 order arrays for each possible pair of partitions (of total size kN). 
Algorithm initializeMerge 
This is all the information we need to efficiently perform the merge. As a result of merge we produce the suffix tree for the entire input string X. We are doing this without 
loading the entire input string into main memory. In fact, we never access X anymore. 
14 
1. for each SA_bufu 
2. read first m start positions 
from disk suffix array SAu 
Algorithm initializeMerge 
1. for each SA_bufu 
2. read first m start positions 
3. for each R_bufuv 
4. read first m/k LCP+partitionBit from Ruv 
from disk suffix array SAu 
3. for each R_bufuv 
4. read first m/k LCP+partitionBit from Ruv 
5. for each SA_bufu 
6. insert SA_bufu[0] into heap 
5. for each SA_bufu 
6. insert SA_bufu[0] into heap 
Fig. 5. The pseudocode for buffer allocation as the initial step for merge. 
Fig. 5. The pseudocode for buffer allocation as the initial step for merge.
CompareSuffix algorithm 
15 
Algorithm compareSuffix (Si from partition u, 
Sj from partition v) 
1. if (u = = v) 
2. return -1 //Si lex Sj, since they are sorted 
//in increasing order inside each partition 
Algorithm compareSuffix (Si from partition u, 
3. if (u  v) 
4. if (partitionBit in R_bufuv[Sj from partition v) 
1. if (u = = v) 
current pointer] = = 0) 
5. return 2. return -1 -1 //Si lex Sj, //Si since they are sorted 
lex Sj 
6. else 
//in increasing order inside each partition 
7. return1 //Si lex Sj 
8. if (u  v) 
9. if (partitionBit in R_bufvu[current pointer] = = 0) 
10. return 1 //Sj lex Si 
11. else 
12. return -1 //Sj lex Si 
3. if (u  v) 
4. if (partitionBit in R_bufuv[current pointer] = = 0) 
5. return -1 //Si lex Sj 
6. else 
7. return1 //Si lex Sj 
8. if (u  v) 
9. if (partitionBit in R_bufvu[current pointer] = = 0) 
10. return 1 //Sj lex Si 
11. else 
12. return -1 //Sj lex Si 
Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrays 
created during pairwise suffix sorting. 
Algorithm for suffix comparison which uses the pairwise suffix information from the order during pairwise suffix sorting.
in all relevant order buffers. We insert into the heap the we continue in a similar way until all suffixes are merged. 
Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrays 
AdvancePointers algorithm 
Pseudocode of current pointer advancing in suffix array buffer and the order 16 
created during pairwise suffix sorting. 
Algorithm advancePointers (partition ID u) 
1. SA_bufu.current_pointer++ 
2. if reached the end of SA_bufu 
3. refill SA_bufu from disk-based SAu 
4. for (i=0; iu; i++) 
5. R_bufiu.current_pointer++ 
6. if reached the end of R_bufiu 
7. refill R_bufiu from disk-based Riu 
8. for (i=u; ik; i++) 
9. R_bufui.current_pointer++ 
10. if reached the end of R_bufui 
11. refill R_bufui from disk-based Rui 
pointers for all the order array buffers containing information about partition u are also 
advanced by 1, as shown in pseudocode of Figure 7. This means we have determined the 
order of the current suffix of partition u, and we need to consider the next element both 
in SA BUFu and in all relevant order buffers. We insert into the heap the next suffix of 
partition u, and we continue in a similar way until all suffixes are merged. 
Algorithm advancePointers (partition ID u) 
1. SA_bufu.current_pointer++ 
2. if reached the end of SA_bufu 
3. refill SA_bufu from disk-based SAu 
4. for (i=0; iu; i++) 
5. R_bufiu.current_pointer++ 
6. if reached the end of R_bufiu 
7. refill R_bufiu from disk-based Riu 
8. for (i=u; ik; i++) 
9. R_bufui.current_pointer++ 
10. if reached the end of R_bufui 
11. refill R_bufui from disk-based Rui 
Fig. 7. Pseudocode of current pointer advancing in suffix array buffer and the order buffers.
ST 作成例例 
• X: ababc 
• SA: 02134 
• binary alphabet: a: 00, b: 01, c: 10 
• 本⼿手法では ST は binary alphabet で表現 
– 必ず 2 分⽊木になるので扱いやすい 
17
ST with binary alphabet 
18 
ST with alphabet 
ST with binary alphabet 
S4 
S4 
S0S2S1S3 
S0S2S1S3 
abbc 
abccabcc 
010 
0011 
0001101000011010 
index や LCP が bit 単位か 
byte (⽂文字) 単位かの違い
⽊木のどこに挿⼊入するか 
• SA の順で挿⼊入するので,必ず右端のパス 
内のどこかに internal node を挿⼊入 
• LCP を⾒見見てどこに挿⼊入するか決定 
19 
L1 
L1 
L1 
inserted 
internal node 
L1 
L1inserted 
leaf node 
L1
Insert s0 then s2 
20 
X: ababc 
Xb: 0001000110 
SA: 02134 
Insert s0 (0001000110, I0, L0)Insert s2 (000110, I4, L4) 
S0 
S0 
I0 
L4 
S2 
I0+4 I4+4 
I0 
IX: index is X (in unit of bit) 
LX: LCP is X (in unit of bit)
Insert s1 
21 
X: ababc 
Xb: 0001000110 
SA: 02134 
Insert s2 (000110, I4, L4)Insert s1 (01000110, I2, L1) 
S0 
I0 
L4 
S2 
I4 I8 
Ix: index is x (in unit of bit) 
Lx: LCP is x (in unit of bit) 
S0 
I0+1 
L4-1 
S2 
I4 I8 
I0 
L1 
S1 
I2+1
Insert s3 
22 
X: ababc 
Xb: 0001000110 
SA: 02134 
Insert s1 (01000110, I2, L1)Insert s3 (0110, I6, L2) 
Ix: index is x (in unit of bit) 
Lx: LCP is x (in unit of bit) 
S0 
I1 
L3 
S2 
I4 I8 
I0 
L1 
S1 
I3+1 
S0 
I1 
L3 
S2 
I4 I8 
I0 
L1 
S1 
I3 
I3 
L2-1 
S3 
I6+2
Insert s4 
23 
X: ababc 
Xb: 0001000110 
SA: 02134 
Insert s3 (0110, I6, L2)Insert s4 (10, I8, L0) 
Ix: index is x (in unit of bit) 
Lx: LCP is x (in unit of bit) 
S0 
I1 
L3 
S2 
I4 I8 
I0 
L1 
S1 
I4 
I3 
L1 
S3 
I8 
S0 
I1 
L3 
S2 
I4 I8 
I0+0 
L1-0 
S1 
I4 
I3 
L1 
S3 
I8 
I0 
L0 
S4 
I8+0
Index と LCP の意味 
24 
S0 
I1 
L3 
S2 
0 10 
I0 
L1 
I0+1+3 I4+1+3 
I3 
L1 
S1 
I2+1+1 
S3 
I6+1+1 
I0 
L0 
S4 
I8 
001 1 
000110 10 000110 
10 
Xb: 0001000110 
idx: 0123456789
ST 出⼒力力フォーマット 
• 3 つのデータ 
– Internal node ⽤用の array 
– Leaf node ⽤用の array 
– 最⼤大 suffix の prefix (実体): divider 
• divider を使って検索索すべきファイルを特定 
25 
Internal node structure (順番は不不明)Leaf node structure 
Left childRight childIndex 
4bytes4bytes4bytes 
partition id (1byte) 
Index 
4bytes 
partition id (1byte)
他⼿手法との⽐比較 
• TDD and ST-merge (2005年年) 
– 6MB メモリ 20MB データで 8 時間 
• Trellis+SB (2008年年) 
– 3GB データ 512MB メモリ で 11 時間 
– ランダムアクセス 95% 減 
• DiGeST (2008年年) 
– ランダムアクセス 98% 減 
• どれもこれも分割して処理理後 merge する 
• ディスクのランダムアクセスをいかに減らすか 
がポイント 26
TDD which works with uncompressed inputs. The results 評価: 他⼿手法との⽐比較 
Program Time, hours 
TDD 125 
B2ST 3 
• ⼊入⼒力力 3GB の DNA (1⽂文字1byte) 
• 使⽤用メモリ 600MB 
27 
Trellis+SB 11 
of different suffix tree construction algorithms for approximately which is larger than the total allocated main memory. 
suffix tree for the above 3GB input in 125 hours. of Trellis+SB reported in [20]. The value in Figure similar settings on a comparable machine. 
divided the 3GB into partitions of 1GB each and built
produced an intermediate on-disk output of size 234GB. Despite this, the merge in only 59 minutes, scanning all this on-disk data in sequential manner 2514 suffix tree files of total 215GB. 
評価: ⼊入⼒力力サイズを変えて⽐比較 
Total 
time, 
min 
Merge, 
min 
Pairwise 
sorting, 
min 
Number of 
partition 
pairs 
Number 
of 
partitions 
Input 
size, 
GB 
6 3 3 441 27 468 
8 4 6 730 34 764 
10 5 10 1100 42 1142 
• ⼊入⼒力力データは DNA を複数結合したもの 
• メモリは 2GB 使⽤用 
• Pairwise sorting は k^2 で効いてくるが並列列化で 
きそう 
• Merge はほぼリニア 
28 
12 6 15 1462 59 1521 
Running time (min) of B2ST for different sets (approximately 6, 8, 10 and 12GB) of genomic 2GB of main memory. 
example shows that we need a large temporary disk space for scaling up the Specifically, we need D = k2p = kN disk space to store the order arrays partition pairs. Since the number of partitions is k = N/M, from D = N2/M the size of the largest input that we can process with M bytes of internal bytes of disk space. If we substitute the common values for modern computers,
所感 
• もしかしてメモリのランダムアクセスよりも 
ディスクのシーケンシャルアクセスの⽅方がス 
ループット⼤大きかったりする??? 
• k を定数と⾒見見做せるケースでしか使えない 
– 実質的に partition id に 1 byte, suffix index に 4 byte 
の計 5 bytes (N  2^40) 程度度のデータが対象 
– Induce sort でやっているように再帰的に⽐比較して 
O(kN) à O(N) にできないものか 
• tail の件といい dividers の件といい,LCP が巨⼤大 
な場合をあまり考慮していない 
– 遺伝⼦子が対象ならそれで良良いが... 
29

メモリより大きなデータの Sufix Array 構築方法の紹介

  • 1.
    メモリより⼤大きいデータの Suffix Tree構築⽅方法の紹介 2014-08-29 星野 喬 1
  • 2.
    概要 • 主に以下の論論⽂文の紹介 • Suffix trees for inputs larger than main memory – Marina Barsky, Ulrike Stege and Alex Thomo – Information Systems archive Volume 36 Issue 3, May, 2011 – CIKM2009 での short paper の発展版 2
  • 3.
    論論⽂文の主張 • メモリより⼤大きいデータのsuffix tree の 構築⽅方法の提案 • 既存の構築⽅方法と違い,ディスクアクセ スはほぼシーケンシャルアクセスのみ 3
  • 4.
    Suffix tree とは(復復習) • ⾼高速な⽂文字列列操 作のための suffix を⽤用いた⽊木構造 – 葉葉ノード: suffix を表す – 節ノード: suffixes 間の共通 prefix を表す 4 most N −1 internal nodes in the tree. Hence, the maximum number linear in N. The tree’s total space is linear in N in the case that each stored in a constant space. Fortunately, this is the case for an implicit substrings by their positions. R a b a b c 0 1 2 3 4 0 ab abc 1-1 b 4-4 abc 1 c 4 3 c c 2 0-1 2-4 4-4 2-4 4-4 = ababc. For clarity, the explicit edge labels are shown, which are represented as in the actual suffix tree. Each suffix Si can be found by concatenating substrings
  • 5.
    提案アルゴリズム: B2ST •Big tree, Big string Suffix Tree construction algorithm • 3 つのフェーズ – (1) Input partitioning – (2) Pair-wise sorting – (3) Merging 5
  • 6.
    変数の意味 • X:⼊入⼒力力⽂文字列列 (⻑⾧長さN) • Xij: X[i]からX[j-1]までの部分⽂文字列列 • Si: i 番⽬目の suffix (0 = i N) • LCPij: Si と Sj の longest common prefix • |LCPij|: LCPij の⻑⾧長さ • SA: suffix array • ST: suffix tree 6
  • 7.
    partition, and itspositions are not included into the suffix array of the partition, positions are not included into the suffix array of the partition. This ensures suffix Input partitioning Si of a partition Xu (0 ≤ i |Xu|, 0 ≤ u k) will be sorted as representative of a suffix Sup+i of X. In practice, for real-life DNA sequences, the tail is negligibly small compared to the size of the partition itself (it never exceeded • X を k 個の Xu に分割 (0 = u k) • k = 2N/M (M: メモリサイズ) • 各 Xu は tail を持つ – Xu 内での suffix sorting 時の区別のため – tail と同じ部分⽂文字列列が含まれないように – 遺伝⼦子だと⾼高々 1000 くらい characters in our experiments with DNA databases). Consider the example in Figure 2. It shows the partitioning of input string ababaaabbabbbabaabab into four partitions. The main memory can hold up to 16 the input at a time. To facilitate the example, we represent our input as stands for 0-bit, b stands for 1-bit). In this illustration we also refer to the as A, B, C, and D in order to distinguish them from the numbers representing positions. Note that the tail of partition XB is substring bbb, which never = aabba. Partition A Partition B Partition C Partition D a b a b a a a b b a b b b a b a a b a b 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Input string X = ababaaabbabbbabaabab and its four partitions. The tail of partition B 7is
  • 8.
    Pair-wise sorting •全ての XuXv について SA を LCP 付きで構 築 (SAuv) – XuXv: Xu と Xv を結合したもの (u v) – オンメモリで処理理できるように k を調整済 – SA 内で隣隣り合う suffix 同⼠士の LCP も保存 • 出⼒力力 – SAu: Xu の SA – Ruv: SAuv 内の LCP と uv を表す bit の列列 8
  • 9.
    In this stepwe generate suffix arrays for each pair of partitions. The pseudocode for this step is shown in Figure 3. We concatenate every possible pair u, v of partitions with their tails (0 ≤ u k − 1, u + SAu 1 ≤ v k, main memory and build the suffix array と u v) into Ruv string XuXv. のWe 例例load this input into the SAuv with attached LCP length information for each suffix. The suffixes which start in tail positions are excluded from the output suffix array, they serve only for determining the relative order of suffixes starting at the end of each partition. An LCP entry of SAuv is the length of the longest common prefix of each suffix in SAuv with its immediate predecessor. Figure • SAu 4 shows size: such an 4N/array k for + the α pair A, B of partitions for the same input string as in Figure 2. From each SAuv, we extract – N/k 2^32 が前提 • Ruv size: 8N/k + α • SAu は k 個 • Ruv は k(k-1)/2 個 • 計 4kN + α: O(kN) 9 for determining the relative order of suffixes starting at the end of LCP entry of SAuv is the length of the longest common prefix of each immediate predecessor. Figure 4 shows such an array for the pair the same input string as in Figure 2. From each SAuv, we extract SAAB (in memory) suffix start LCP partition bit 5 1 3 1 2 5 4 2 4 3 0 2 1 3 2 3 0 2 3 1 A B A A B B A A B B RAB SAAB (in memory) suffix start LCP partition bit 5 1 3 1 2 5 4 2 4 3 0 2 1 3 2 3 0 2 3 1 A B A A B B A A B B LCP 0 2 1 3 2 3 0 2 3 1 partition bit A B A A B B A A B B SAA 5 3 1 4 2 RAB LCP 0 2 1 3 2 3 0 2 3 1 partition bit A B A A B B A A B B SAA written 5 3 to 1 disk 4 2 written to disk with LCP information for a pair of partitions A and B. Two structures are the suffix array of partition A and (2) the order array RAB storing the relative B. These two structures are written to disk. Fig. 4. Suffix array SAAB with LCP information for a pair of partitions A and B. Two structures are extracted from SAAB: (1) the suffix array of partition A and (2) the order array RAB storing the relative order of suffixes in A and B. These two structures are written to disk.
  • 10.
    Pair-wise sorting pseudocode 10 Algorithm pairwiseSorting input: k partitions of string X 1. for Algorithm (u=0; upairwiseSorting k-1; i++) 2. for input: (v=1; k partitions vk; v++) of string X 3. 1. concatenate for (u=0; uk-1; 2. for (v=1; vk; XuXv i++) v++) and load into RAM 4. 3. build concatenate suffix array XuXv with and load LCP into SAuv RAM 5. 4. during build sequential suffix array scan with LCP of SAuv SAuv 6. 5. if during v==k-sequential 1 //last scan chunk of SAuv 6. output if v==k-1 to //last chunk 7. 7. output to disk disk SAv SAv 8. 8. output output to to disk disk SAu SAu 9. 9. output output to to disk disk Ruv Ruv //order //order array array Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs. Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs. Step 2: Suffix sorting in partition pairs
  • 11.
    Merging • 2つの部品で構成 – SAu と Ruv のストリームから SA のストリー ムを作る – SA のストリームから ST を作る • オンメモリデータ – SA_buf_u: SAu ⼊入⼒力力バッファ(size m, k 個) – R_buf_uv: Ruv ⼊入⼒力力バッファ(size 2m/k, k(k-1)/2 個) – ST_buf: ST 出⼒力力バッファ (多分 size M-km) 11
  • 12.
    SA ストリーム作成 12 SAu ... Ruv ... HeapSA with LCP Heap 構造を使って SAu (0 = u k) の中で⼀一番⼩小さい suffix を持つ u を 選んで出⼒力力
  • 13.
    5. Merge algorithm 13 What happens with each suffix in the output buffer is the subject pseudocode of merge is shown in Figure 8. Algorithm merge 1. lastTransferred = null 2. while heap is not empty 3. remove smallest suffix Si of partition u from the top of the heap 4. rebalance heap 5. lcp = 0 6. if lastTransferred is not null 7. v = lastTransferred.partitionID 8. lcp = LCP from R_bufuv [current_pointer] 9. create leaf for Si using lcp in ST_buf 10. advancePointers (u) 11. lastTransferred=Si 12. if ST_buf is full 13. store Si (max suffix) as a pointer to the current tree 14. write ST_buf to disk 4. rebalance heap 5. lcp = 0 6. if lastTransferred is not null 7. v = lastTransferred.partitionID 8. lcp = LCP from R_bufuv [current_pointer] 9. create leaf for Si using lcp in ST_buf 10. advancePointers (u) 11. lastTransferred=Si 12. if ST_buf is full 13. store Si (max suffix) as a pointer to the current tree 14. write ST_buf to disk 15. lastTransferred = null 16. Sj = get next suffix from SA_bufu 17. if Sj is not null 18. insert Sj into heap Fig. 8. The general pseudocode for merge. lcp = 0 6. if lastTransferred is not null 7. v = lastTransferred.partitionID 8. lcp = LCP from R_bufuv [current_pointer] 9. create leaf for Si using lcp in ST_buf 10. advancePointers (u) 11. lastTransferred=Si 12. if ST_buf is full 13. store Si (max suffix) as a pointer to the current tree 14. write ST_buf to disk 15. lastTransferred = null 16. Sj = get next suffix from SA_bufu 17. if Sj is not null 18. insert Sj into heap Fig. 8. The general pseudocode for merge.
  • 14.
    the information weneed to efficiently perform the merge. As produce information. Since each Ruv contains an information only about two partitions, we need to the use suffix one bit to tree represent for the the partition entire ID input in Ruv. string Specifically, X. we We use are 0 for doing u entire v (input u v). Figure string InitialMerge 4 shows into SAA main and memory. RAB extracted algorithmIn from fact, SAAB we for never At the end of this step we have on disk suffix arrays for partitions partition access pair (k k (of total size plus k(k − 1)/2 order arrays for each possible pair of partitions (of total size kN). Algorithm initializeMerge This is all the information we need to efficiently perform the merge. As a result of merge we produce the suffix tree for the entire input string X. We are doing this without loading the entire input string into main memory. In fact, we never access X anymore. 14 1. for each SA_bufu 2. read first m start positions from disk suffix array SAu Algorithm initializeMerge 1. for each SA_bufu 2. read first m start positions 3. for each R_bufuv 4. read first m/k LCP+partitionBit from Ruv from disk suffix array SAu 3. for each R_bufuv 4. read first m/k LCP+partitionBit from Ruv 5. for each SA_bufu 6. insert SA_bufu[0] into heap 5. for each SA_bufu 6. insert SA_bufu[0] into heap Fig. 5. The pseudocode for buffer allocation as the initial step for merge. Fig. 5. The pseudocode for buffer allocation as the initial step for merge.
  • 15.
    CompareSuffix algorithm 15 Algorithm compareSuffix (Si from partition u, Sj from partition v) 1. if (u = = v) 2. return -1 //Si lex Sj, since they are sorted //in increasing order inside each partition Algorithm compareSuffix (Si from partition u, 3. if (u v) 4. if (partitionBit in R_bufuv[Sj from partition v) 1. if (u = = v) current pointer] = = 0) 5. return 2. return -1 -1 //Si lex Sj, //Si since they are sorted lex Sj 6. else //in increasing order inside each partition 7. return1 //Si lex Sj 8. if (u v) 9. if (partitionBit in R_bufvu[current pointer] = = 0) 10. return 1 //Sj lex Si 11. else 12. return -1 //Sj lex Si 3. if (u v) 4. if (partitionBit in R_bufuv[current pointer] = = 0) 5. return -1 //Si lex Sj 6. else 7. return1 //Si lex Sj 8. if (u v) 9. if (partitionBit in R_bufvu[current pointer] = = 0) 10. return 1 //Sj lex Si 11. else 12. return -1 //Sj lex Si Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrays created during pairwise suffix sorting. Algorithm for suffix comparison which uses the pairwise suffix information from the order during pairwise suffix sorting.
  • 16.
    in all relevantorder buffers. We insert into the heap the we continue in a similar way until all suffixes are merged. Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrays AdvancePointers algorithm Pseudocode of current pointer advancing in suffix array buffer and the order 16 created during pairwise suffix sorting. Algorithm advancePointers (partition ID u) 1. SA_bufu.current_pointer++ 2. if reached the end of SA_bufu 3. refill SA_bufu from disk-based SAu 4. for (i=0; iu; i++) 5. R_bufiu.current_pointer++ 6. if reached the end of R_bufiu 7. refill R_bufiu from disk-based Riu 8. for (i=u; ik; i++) 9. R_bufui.current_pointer++ 10. if reached the end of R_bufui 11. refill R_bufui from disk-based Rui pointers for all the order array buffers containing information about partition u are also advanced by 1, as shown in pseudocode of Figure 7. This means we have determined the order of the current suffix of partition u, and we need to consider the next element both in SA BUFu and in all relevant order buffers. We insert into the heap the next suffix of partition u, and we continue in a similar way until all suffixes are merged. Algorithm advancePointers (partition ID u) 1. SA_bufu.current_pointer++ 2. if reached the end of SA_bufu 3. refill SA_bufu from disk-based SAu 4. for (i=0; iu; i++) 5. R_bufiu.current_pointer++ 6. if reached the end of R_bufiu 7. refill R_bufiu from disk-based Riu 8. for (i=u; ik; i++) 9. R_bufui.current_pointer++ 10. if reached the end of R_bufui 11. refill R_bufui from disk-based Rui Fig. 7. Pseudocode of current pointer advancing in suffix array buffer and the order buffers.
  • 17.
    ST 作成例例 •X: ababc • SA: 02134 • binary alphabet: a: 00, b: 01, c: 10 • 本⼿手法では ST は binary alphabet で表現 – 必ず 2 分⽊木になるので扱いやすい 17
  • 18.
    ST with binaryalphabet 18 ST with alphabet ST with binary alphabet S4 S4 S0S2S1S3 S0S2S1S3 abbc abccabcc 010 0011 0001101000011010 index や LCP が bit 単位か byte (⽂文字) 単位かの違い
  • 19.
    ⽊木のどこに挿⼊入するか • SAの順で挿⼊入するので,必ず右端のパス 内のどこかに internal node を挿⼊入 • LCP を⾒見見てどこに挿⼊入するか決定 19 L1 L1 L1 inserted internal node L1 L1inserted leaf node L1
  • 20.
    Insert s0 thens2 20 X: ababc Xb: 0001000110 SA: 02134 Insert s0 (0001000110, I0, L0)Insert s2 (000110, I4, L4) S0 S0 I0 L4 S2 I0+4 I4+4 I0 IX: index is X (in unit of bit) LX: LCP is X (in unit of bit)
  • 21.
    Insert s1 21 X: ababc Xb: 0001000110 SA: 02134 Insert s2 (000110, I4, L4)Insert s1 (01000110, I2, L1) S0 I0 L4 S2 I4 I8 Ix: index is x (in unit of bit) Lx: LCP is x (in unit of bit) S0 I0+1 L4-1 S2 I4 I8 I0 L1 S1 I2+1
  • 22.
    Insert s3 22 X: ababc Xb: 0001000110 SA: 02134 Insert s1 (01000110, I2, L1)Insert s3 (0110, I6, L2) Ix: index is x (in unit of bit) Lx: LCP is x (in unit of bit) S0 I1 L3 S2 I4 I8 I0 L1 S1 I3+1 S0 I1 L3 S2 I4 I8 I0 L1 S1 I3 I3 L2-1 S3 I6+2
  • 23.
    Insert s4 23 X: ababc Xb: 0001000110 SA: 02134 Insert s3 (0110, I6, L2)Insert s4 (10, I8, L0) Ix: index is x (in unit of bit) Lx: LCP is x (in unit of bit) S0 I1 L3 S2 I4 I8 I0 L1 S1 I4 I3 L1 S3 I8 S0 I1 L3 S2 I4 I8 I0+0 L1-0 S1 I4 I3 L1 S3 I8 I0 L0 S4 I8+0
  • 24.
    Index と LCPの意味 24 S0 I1 L3 S2 0 10 I0 L1 I0+1+3 I4+1+3 I3 L1 S1 I2+1+1 S3 I6+1+1 I0 L0 S4 I8 001 1 000110 10 000110 10 Xb: 0001000110 idx: 0123456789
  • 25.
    ST 出⼒力力フォーマット •3 つのデータ – Internal node ⽤用の array – Leaf node ⽤用の array – 最⼤大 suffix の prefix (実体): divider • divider を使って検索索すべきファイルを特定 25 Internal node structure (順番は不不明)Leaf node structure Left childRight childIndex 4bytes4bytes4bytes partition id (1byte) Index 4bytes partition id (1byte)
  • 26.
    他⼿手法との⽐比較 • TDDand ST-merge (2005年年) – 6MB メモリ 20MB データで 8 時間 • Trellis+SB (2008年年) – 3GB データ 512MB メモリ で 11 時間 – ランダムアクセス 95% 減 • DiGeST (2008年年) – ランダムアクセス 98% 減 • どれもこれも分割して処理理後 merge する • ディスクのランダムアクセスをいかに減らすか がポイント 26
  • 27.
    TDD which workswith uncompressed inputs. The results 評価: 他⼿手法との⽐比較 Program Time, hours TDD 125 B2ST 3 • ⼊入⼒力力 3GB の DNA (1⽂文字1byte) • 使⽤用メモリ 600MB 27 Trellis+SB 11 of different suffix tree construction algorithms for approximately which is larger than the total allocated main memory. suffix tree for the above 3GB input in 125 hours. of Trellis+SB reported in [20]. The value in Figure similar settings on a comparable machine. divided the 3GB into partitions of 1GB each and built
  • 28.
    produced an intermediateon-disk output of size 234GB. Despite this, the merge in only 59 minutes, scanning all this on-disk data in sequential manner 2514 suffix tree files of total 215GB. 評価: ⼊入⼒力力サイズを変えて⽐比較 Total time, min Merge, min Pairwise sorting, min Number of partition pairs Number of partitions Input size, GB 6 3 3 441 27 468 8 4 6 730 34 764 10 5 10 1100 42 1142 • ⼊入⼒力力データは DNA を複数結合したもの • メモリは 2GB 使⽤用 • Pairwise sorting は k^2 で効いてくるが並列列化で きそう • Merge はほぼリニア 28 12 6 15 1462 59 1521 Running time (min) of B2ST for different sets (approximately 6, 8, 10 and 12GB) of genomic 2GB of main memory. example shows that we need a large temporary disk space for scaling up the Specifically, we need D = k2p = kN disk space to store the order arrays partition pairs. Since the number of partitions is k = N/M, from D = N2/M the size of the largest input that we can process with M bytes of internal bytes of disk space. If we substitute the common values for modern computers,
  • 29.
    所感 • もしかしてメモリのランダムアクセスよりも ディスクのシーケンシャルアクセスの⽅方がス ループット⼤大きかったりする??? • k を定数と⾒見見做せるケースでしか使えない – 実質的に partition id に 1 byte, suffix index に 4 byte の計 5 bytes (N 2^40) 程度度のデータが対象 – Induce sort でやっているように再帰的に⽐比較して O(kN) à O(N) にできないものか • tail の件といい dividers の件といい,LCP が巨⼤大 な場合をあまり考慮していない – 遺伝⼦子が対象ならそれで良良いが... 29