KEMBAR78
LazySort: NVM Sorting Algorithm | PDF | Solid State Drive | Computer Data Storage
0% found this document useful (0 votes)
40 views16 pages

LazySort: NVM Sorting Algorithm

The document describes a new sorting algorithm called LazySort that is designed specifically for non-volatile memory. LazySort aims to improve sorting efficiency by taking advantage of non-volatile memory's byte-addressing capabilities and reducing the number of write operations. It works in two stages - first generating runs of ordered data segments and then merging those runs. The authors implemented LazySort on a real non-volatile memory and DRAM platform and found it significantly reduced sorting time compared to traditional external sorting algorithms, by up to 93.08%, and reduced the number of write operations by 49.50%.

Uploaded by

mochammad.23089
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views16 pages

LazySort: NVM Sorting Algorithm

The document describes a new sorting algorithm called LazySort that is designed specifically for non-volatile memory. LazySort aims to improve sorting efficiency by taking advantage of non-volatile memory's byte-addressing capabilities and reducing the number of write operations. It works in two stages - first generating runs of ordered data segments and then merging those runs. The authors implemented LazySort on a real non-volatile memory and DRAM platform and found it significantly reduced sorting time compared to traditional external sorting algorithms, by up to 93.08%, and reduced the number of write operations by 49.50%.

Uploaded by

mochammad.23089
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Information Sciences 641 (2023) 119137

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

LazySort: A customized sorting algorithm for non-volatile memory


Yang Liu a , Yang Ou b,∗ , Wenhan Chen a , Zhiguang Chen c , Nong Xiao a
a
Institute for Quantum Information & State Key Laboratory of High-Performance Computing, College of Computer, National University of Defense
Technology, China
b
College of Computer, National University of Defense Technology, China
c
School of Computer Science and Engineering, Sun Yat-sen University, China

A R T I C L E I N F O A B S T R A C T

Keywords: Due to the development of new non-volatile storage technologies, bridging the access delay gap
External sorting between storage and memory has been promoted. Therefore, since the volume of data continues
Non-volatile memory to increase, the need for new types of non-volatile storage for high-performance computing
Lazy processing
and database management systems significantly increases. As a basic algorithm of the database
Data-intensive computing
Ordered data segment
management system, external sorting is widely used but suffers from numerous I/O operations
caused by the frequent data exchange between the memory and external memory. Non-volatile
memory (NVM) is a promising solution. However, the traditional external sorting algorithm is
only designed for disk storage and cannot be used for NVM. To solve this issue, we propose
LazySort, which is a novel external sorting algorithm that exploits the NVM byte addressing
mechanism and the orderly distribution of data to improve the sorting efficiency. Specifically,
LazySort first detects the ordered data segment in the memory and then records the storage
location of the data segment as well as its maximum/minimum values in an index table named
ITable. Next, to reduce memory usage and accelerate the sorting procedure, LazySort performs
an optimization strategy RunMerge to merge non-intersecting data blocks according to the range
of ITable records. To verify the performance of LazySort, we built an experimental platform
with the NVM-DRAM storage architecture combining NVM and dynamic random access memory
(DRAM) and conducted a series of experiments. The experimental results show that LazySort is
more efficient than the traditional external sorting algorithm, since the sorting time of LazySort is
shortened by 93.08% and the number of NVM write operations is reduced by 49.50%. Moreover,
the advantage of LazySort over traditional algorithms is more significant when the data amount
increases.

1. Introduction

The rapid development of new non-volatile memory (NVM) technology have significantly boosted the storage market. Despite its
high read and write speed, dynamic random access memory (DRAM) is volatile and will cause data loss when a sudden power failure
occurs. In addition, DRAM is reported to have reached the physical limit of its technological developments, and its density cannot be
improved [16]. New non-volatile memory technology can solve the significant difference in the access delay between traditional per-
sistent external storage and memory. Therefore, due to the rich characteristics of recent non-volatile memory technology, optimizing
large data processing algorithms has become the focus of current research.

* Corresponding author.
E-mail address: michaelouyang@163.com (Y. Ou).

https://doi.org/10.1016/j.ins.2023.119137
Received 5 December 2022; Received in revised form 5 May 2023; Accepted 9 May 2023
Available online 11 May 2023
0020-0255/© 2023 Elsevier Inc. All rights reserved.
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

As a basic algorithm in data-intensive applications, external sorting can manage very large amounts of data and immediately read
into memory. External sorting algorithms are widely used in large data processing frameworks such as Spark [29], Hadoop [20],
Storm [23], and chaotic systems [27,26]. In addition, the databases RockDB and LevelDB also involve sorting algorithms [2]. The
external sorting algorithm is I/O-intensive and can generate data exchange between memory and external storage, thus affecting the
algorithm performance [5,14].
Most hard disk drive (HDD)-based external sorting algorithms improve efficiency by reducing random access. With the devel-
opment of flash memory technology, solid-state drives (SSDs) are gradually replacing HDDs in high-performance computing due to
its advantages including low access latency, high I/O bandwidth, random access, and internal concurrency. SSDs have neither an
inner mechanical arm nor seeks time overhead; thus, SSDs have a better performance regarding random read/write. However, the
algorithm that uses HDDs and SSDs directly uses block addressing, which makes it difficult for existing solutions to match NVM
features.
NVM can solve the traditional “storage wall” problem because of its byte addressing ability, non-volatility, high density, and low
energy consumption, showing potential to replace DRAM as the main memory technology [17,24,3]. However, similar to DRAM,
NVM has read-write asymmetry due to a read latency and a much higher write latency. In addition, both the durability and stability
of NVM will decrease after experiencing many writes (approximately 108 times) [8]. Therefore, we need a customized external
sorting algorithm for NVM that considers these characteristics. In this paper, we focus on the Intel Optane desktop DC Persistent
Memory Module, which is a popular NVM that has revolutionized storage space technologies. Therefore, the NVM mentioned later
in this article refers specifically to the Intel Optane desktop DC Persistent Memory. We consider that there are three challenges in
implementing external sorting with NVM. The first challenge is placing data in DRAM-NVM hybrid memory. Another challenge lies
in taking full advantage of the NVM byte addressing mechanism [4] since it solves the transfer problem at the block level. Finally,
the lifespan of NVM should be considered. Overall, an ideal external sorting algorithm for NVM will improve the efficiency with
fewer writing operations.

1.1. Study motivations

Due to the exponentially growing volume of generated data, sorting algorithms should be more scalable to manage large datasets
with limited memory. Although various methods have been proposed to improve the performance of the external sorting algorithm,
few have taken the I/O request and the byte-addressable feature of NVM into consideration, and there is no method yet designing an
external merge sort for NVM. In short, the underlying inefficiency and write amplification issues of performing external sorting on
NVM are ignored. This motivates us to design an external sorting algorithm suitable for the NVM-DRAM hybrid storage architecture.

1.2. Contributions

In this paper, we design and optimize the external sorting algorithm for NVM. The contributions of this paper are summarized as
follows:

• We propose an external sorting algorithm for NVM called LazySort, which exploits the NVM byte addressing and the locally
ordered data to reduce the write operation to NVM. LazySort contains two stages, i.e., the run generation stage and the merge
stage, showing evident superiority to the traditional external sorting algorithm on the NVM.
• We propose an optimization strategy called RunMerge for the merge stage of LazySort. It merges non-intersecting data blocks
according to the range of ITable records, thus reducing the total number of runs and memory usage.
• We build a real NVM-DRAM experimental platform and conduct comprehensive experiments to verify the performance of
LazySort. Experimental results show that LazySort has a better time performance and fewer NVM writes. Compared with the
external sorting algorithm, the sorting time is shortened by 93.08%, and the number of NVM write operations is reduced by
49.50%.

The structure of this article is as follows. In the first part, the study introduction, study motivations and contributions are
presented. In the second part, the background and problem statement are elaborated. In the third part, new non-volatile memory
technology, which is a traditional external sorting, is introduced. In the fourth part, the design idea of the LazySort algorithm is
described, and the I/O overhead is addressed. In the fifth part, we present the experimental test results and analysis. Finally, in the
sixth part, our research work is summarized, and potential future research directions are discussed.

2. Background and related works

In April 2019, Intel unveiled NVM as a new technology that has similarities to SSD and DRAM. Moreover, NVM is a subversive
technology that has introduced revolutionary progress to the storage industry and has been widely considered by academia and
industry. Persistent memory achieves DRAM-level low latency and storage-level persistence and capacity by moving more data closer
to the CPU. NVM is 1000 times faster than the NAND Flash [25], its cost is only half that of DRAM, its service life is 1000 times that
of NAND, and its density is ten times that of traditional storage.
Optane NVM has received wide attention because of its byte addressability, random access, and good performance. The read
and write performance of Optane NVM is better than that of SSD. In terms of random write performance, when the granularity of

2
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

I/O is less than 4 KB, the throughput of NVM ext4 is 20-40 times higher than that of SSD ext4. When the granularity of I/O is
more significant than 4 KB, the throughput of NVM ext4 is 10 times higher than that of SSD. There are two main modes of NVM:
memory mode and application direct access mode. In memory mode, data are easily lost. The application of the direct access mode
is byte-addressable persistent memory.
NVM uses 3DX Point technology [13], which is essentially different from that of NAND. 3D XPoint is based on impedance
technology, through many attribute changes to change the impedance value of the unit to distinguish between 0 and 1.
External sorting sorts data that are too large to be read into memory at once. However, it is I/O- intensive, since it will produce
many reading and writing operations to out-of-memory allocations. External sorting typically contains two stages: the run formation
stage and the merge stage. The run formation stage divides the large file to be sorted into multiple data blocks, reads each data block
into memory for sorting, and writes the sorted data back to external memory. After the first stage, a plurality of ordered data blocks
is stored in the external memory. The merge stage merges the ordered data blocks into a final ordered data block file. To date, many
external sorting algorithms have emerged with various motivations.
Merge on-the-run external sorting (MONTRES) exploits the random access and high I/O bandwidth of SSDs and reduces the total
I/O overhead by optimizing the algorithm design [10]. MONTRES first detects the minimum value of each block in the input data,
constructs the min-index index table, and reads the data into memory for sorting according to the order of block min from small to
large. In the merging stage, MONTRES mainly adopts the data structure of the min-max heap. By constructing the run index table,
according to the order of block min in the index table from small to large, the data are read into memory to participate in merging,
and the value smaller than the next block-min can be written directly to the final output file. Therefore, the merging phase can be
reduced to be completed in one iteration, thus solving many I/O overhead problems of multiple merging iterations.
FMsort [11] uses the internal concurrency of SSD and the characteristics of high I/O bandwidth to accelerate. The algorithm
constructs the block min index table by calculating the minimum value of each block. The primary function of the index table is
to determine the order in which data blocks are read from out-of-memory to memory in the merge phase. At the same time, the
algorithm adds input cache pages in memory to achieve the prereading operation of the data. FMsort not only achieves the overlap of
CPU processing and I/O operation but also uses the internal concurrency of SSDs to read multiple data pages simultaneously. FMsort
is the first algorithm that uses the characteristics of SSDs to optimize the merging phase of external sorting.
Shen proposes SMR-EMS (Shingled Magnetic Recording based External Merge Sort) strategy for SMR-based large-scale storage
systems to alleviate the negative impacts of sequential write constraint and enhance the performance of external sorting on SMR
drives via utilizing the concept of storage-centric computing [7]. Gao takes advantage of SSD’s internal parallelism to speed up the
sorting process [6]. Liu proposes an external sorting algorithm based on active SSDs, called active sort [12]. An active SSD is a kind
of “intelligent” SSD that can perform some computing functions within the SSD. Active SSDs migrate part of the data processing tasks
from the host side to the SSD side, reducing the amount of data migrated. At the same time, the algorithm takes advantage of the
high I/O bandwidth in SSDs to speed up the execution of external sorting and improve the algorithm’s performance. Compared with
traditional external merge sorting, active sorting reduces write operations by 40%.
Based on the specific scenario in which the amount of data is slightly more significant than the memory capacity and the test
data are orderly, Kanza et al. [9] proposed an effective external sorting algorithm based on flash memory portable mobile devices.
The core idea of the algorithm is to reduce the energy consumption of flash memory and prolong the service life by reducing the
generation of intermediate data and writing back to external memory. Andreou, P., et al. [1] proposed an external sorting algorithm
based on flash sensor devices called FSort. FSort can reduce the energy consumption and improve the performance of the algorithm
by reducing write operations to external memory. The core idea of the algorithm is to use the permutation selection algorithm to
reduce the number of runs by generating longer-run files. Therefore, the number of iterations in the merger phase is reduced, the
I/O load of external memory is reduced, and the service life of flash memory is prolonged.
Park proposes an external sorting algorithm based on mobile flash devices called FAST (Flash-Aware Sorting) [19]. Flash mobile
devices are mobile phones, electronic cameras, and music-playing mobile devices. With the development of flash memory technology
and price reduction, flash memory technology has been widely used in various mobile devices. Mobile devices are characterized by
minimal memory capacity. Therefore, in the case of a large amount of data, optimizing the external sorting algorithm is necessary to
reduce the write operation overhead of flash memory.

3. Research method

With the widespread use of data-intensive applications and personal mobile devices, data volume has skyrocketed. Quickly
processing terabytes (TB) and petabytes (PB) of data has become a key problem in storage systems [22]. By 2025, the total amount of
data is estimated to reach 185ZB [18]. The key technical problem is storing data efficiently and safely and processing and responding
quickly.
The external sorting algorithm is a basic algorithm in data-intensive applications. When the data are too large, the memory
capacity cannot accommodate all the data simultaneously. In this case, an external sorting algorithm is needed. The external merge
sort algorithm is the most commonly used external sort algorithm. External merge sort is divided into a memory sort phase (run
generation phase) and a merge phase (run merge phase). The main idea of the run generation phase is to split large input files into
smaller files that can fit in multiple memory spaces. Each small file read into memory is sorted to generate an ordered data block
called run. Run needs to be written back to the external storage device. The run merge phase takes multiple ordered runs and merges
multiple iterations until all data are completely ordered.

3
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Fig. 1. External sorting phases.

Table 1
Notations.

Notation Definition

𝑁 The size of the initial input file


𝐸 The size of the memory input buffer
𝑅 The total number of generated Runs
𝐶𝑅 Unitary read operation cost
𝐶𝑊 Unitary write operation cost

Therefore, external sorting is an I/O-intensive algorithm. External merge sort generates many read and write operations in
external memory during iteration, affecting the algorithm performance and the service life of external memory devices. The ability
to process data quickly and respond to results is an important evaluation factor for data-intensive applications. It is of urgent practical
significance to optimize the design of the external sorting algorithm to accelerate the execution of the external sorting algorithm,
reduce the number of reads and write operations to the storage media, and thus extend the service life of external storage devices by
utilizing the rich characteristics of new non-volatile storage media.

3.1. External sorting algorithm

External sorting is generally divided into two phases, as illustrated in Fig. 1. To facilitate describing and evaluating the algorithm,
we use the notation summarized in Table 1. Unsorted data are initially stored on an external storage device (SSD in the picture). We
consider an input file containing 𝑁 values. The first phase becomes the run generation phase. The data are divided into N/E parts
according to the size of the input buffer, and they are sorted separately. This stage reads E pieces of data into memory each time and
then sorts them using the memory sorting algorithm. The resulting ordered block of data is called a run. We define 𝑅 = 𝑁∕𝐸, which
is the number of runs. The run generation phase results in 𝑅 read operations on the input file to load data and 𝑅 write operations
for writing the generated runs (cache effects are ignored). The second phase becomes the run merge phase. The smallest part of the
data in the run is written into the input buffer for sorting. When the number of generated runs R exceeds E, the merge process needs
⌈log𝐸 𝑅⌉ passes to generate the final sorted file from all the generated runs. Thus, the run merge phase will result in 𝑁log𝐸 𝑅 read
and write operations, and the total I/O cost of the traditional merge sort algorithm is given by

𝐼∕𝑂_𝑐𝑜𝑠𝑡 = 𝑁 ⋅ (𝐶𝑅 + 𝐶𝑊 ) + 𝑁 ⋅ ⌈log𝐸 𝑅⌉ ⋅ (𝐶𝑅 + 𝐶𝑊 ) (1)


The first two expressions represent the total cost of the run generation and run merge phases, respectively.

3.2. Non-volatile memory (NVM)

NVM is widely considered because of its large capacity and non-volatility. Moreover, it has the most potential to replace DRAM.
Table 2 describes the comparison of the performance parameters of several memories. Unlike block storage devices, NVM has the
characteristics of byte addressing and can perform direct memory access operations. Byte addressing means that the minimum
addressing unit of storage space is bytes. NVM has lower read and write access latency and can maintain data loss when the system is
powered down. NVM is currently regarded as the storage device with the most development potential. Existing NVM research mainly
includes the following:

• Data consistency.
• NVM-based file system.
• NVM-based data structure optimization.
• Other popular issues.

4
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Table 2
Storage media performance comparison [21].

HDD SATA SSD NVMe SSD NVM

Sequential write 260 MB/s 336 MB/s 1387 MB/s 1368 MB/s
Random read 1.2 MB/s 327 MB/s 1387 MB/s 1252 MB/s
Sequential read 262 MB/s 433 MB/s 1672 MB/s 1252 MB/s
Read latency 3-5 ms 15-35 μs 30-85 μs 20-60 μs
Write latency 3-5 ms 200-500 μs 20-30 μs 20-150 μs
Non-volatile Yes Yes Yes Yes
Byte-addressable No No No Yes

Fig. 2. Lazysort storage system architecture.

Regarding the read and write speed, NVM is 1000 times faster than SSD. The lifespan (number of writes) of NVM is 1000 times
that of NAND Flash, and the density is nearly ten times higher than that of traditional storage. The density of NVM is higher than that
of DRAM and can provide a more extensive data storage capacity. Moreover, the cost of NVM is only half that of DRAM. However,
the read and write latency is slightly longer than that of DRAM, so 3D XPoint technology cannot wholly replace DRAM as the main
memory [28].

4. LazySort design

In this section, the design scheme of LazySort is detailed. First, in a database system, it is relatively common that the data
distribution is locally ordered. Originally, ordered data resulted in out-of-order data due to the partial updates of data. For example,
there are 100 GB of ultimately ordered data in the original database, and the user randomly updates 1 GB of data. Second, the
traditional external merge sort is not aware of the value and order of the data and rereads and sorts all the data, which increases
the I/O overhead and time overhead. On the other hand, the traditional external sorting algorithm reads and writes data in units
of pages, and it cannot fully utilize its byte addressing characteristics when directly applied to NVM. In addition, the traditional
external sorting algorithm has many I/O operations, and too many writers will cause damage to the NVM. Therefore, we propose an
external sorting algorithm for NVM to take full advantage of the rich features. The LazySort storage system architecture is shown in
Fig. 2.
The client writes and updates data in the system, and the detailed steps are shown in Fig. 3. The data set written by the client
is first sorted, as shown in steps 1-5 in Fig. 3. Then there’s the LazySort part. LazySort consists of two distinct phases. (1) The run
generation phase. There are two parts to this stage, as shown in steps 6-10 in Fig. 3. The first part is looking for an unorder data
segment called “FUDS (for unorder data segment)” and sorting the unorder data to generate a run. The second part is constructing an
ordered interval index table by detecting the partially ordered interval of the input data. The partial order data are only recorded and
not written back to reduce the I/O cost, thereby reducing the processing time of the algorithm. At the same time, it can effectively
prolong the service life of the NVM. The address of the ordered data segment and the maximum and minimum values of the run and
ID are recorded. (2) The run merge phase, as shown in steps 11-12 in Fig. 3. To improve memory utilization, we merge the runs
without data intersection. The result is a fully ordered dataset.

5
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Fig. 3. Overview of the framework.

Table 3
Algorithm Parameters.

Notation Definition

𝑄 Size of raw data


𝑎𝑖 𝑖-th data block read into memory
𝐵 Ordered data size
𝐷 Ordered data size threshold
𝑂𝑅 Number of original ordered data segments
𝑀𝑅 number of merged runs based on ITable
𝐾 Number of ORuns and Runs
𝑟𝑖 𝑖-th ORuns and Runs
𝐸 Input buffer size
𝑅 Number of Runs

To describe the two different phases, we use the notations defined in Table 3. We define the raw data size as 𝑄, the time of
reading data into the memory as 𝑎, the ordered data segment size as 𝐵, the threshold of the ordered data segment as 𝐷, and the
size of the input buffer as 𝐸. There are K total ordered runs (ORuns) and Runs. 𝑂𝑅 represents the number of original ordered data
segments. 𝑀𝑅 represents the number of merged runs based on ITable.

4.1. Run generation phase

Algorithm 1 aims to generate intermediate sorted files called runs. We read 𝑎𝑖 into DRAM. First, we look for an ordered data
segment. When the ordered data segment is greater than or equal to b, we mark it as an ordered run (ORun) and discard them
directly from memory instead of writing them back to NVM. Second, we place the remaining unsorted data segment into the input
buffer and perform internal sorting to generate an ordered data block called run. The run needs to be written back to NVM. Finally,

6
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Fig. 4. The process of finding an ordered data segment.

Fig. 5. The process of unsorted data sorting to generate runs.

we record the maximum and minimum values of Run and ORun and the starting position of the data segment and write them into
ITable.
Fig. 4 describes the process of finding an ordered data segment. The yellow part represents NVM, where the original data are
stored. After we read part of the data into DRAM, we perform data retrieval to find a continuous and orderly data segment. Fig. 4
assumes that the threshold of the continuous data segment is three. That is, when three or more consecutive ordered data points
were obtained, we mark it as ordered and colored it as green. After we perform data retrieval, we find a continuous and ordered data
segment {1, 7, 8, 10, 12, 15}. We define the green data as ORun. The remaining data are placed in the input buffer to await sorting.
The smaller the parameter of the ordered interval threshold D is, the lower the requirement for the length of the ordered interval.
It is conducive to retrieving more ordered intervals and can better reduce the amount of data written back to the external storage.
However, because more location information of the ordered intervals needs to be recorded, the overhead of recording will be
increased. If the ordered threshold D is set larger, the evaluation criteria for the ordered data interval are stricter. In the case of the
same ordered data distribution, a longer ordered data interval threshold can be retrieved. However, the number of retrieved ordered
intervals is also less, effectively reducing the memory overhead of the index table creation. Therefore, choosing an appropriate size
of D is crucial.
Fig. 5 describes the stage of generating a run by sorting the order data. When the input buffer is filled with unsorted data, sorting
begins to generate a run, and it is written back to NVM. The memory records the maximum and minimum values of Run and ID to
ITable.

4.2. Run merge phase

After multiple iterations, the run merge stage merges multiple ordered intermediate results generated in the first stage into a final
ordered data file. The loser tree mainly implements the merge phase. The loser tree is a complete binary tree with a high search
efficiency and can retrieve the maximum value within the time complexity of log(𝑁) [15].

7
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Algorithm 1 FUDS.
Input: raw data 𝑎1 , ⋯ , 𝑎𝑛 and threshold for ordered data segments 𝐷
Output: sorted runs 𝑟1 , ⋯ , 𝑟𝑅 and ITable

1: for 𝑖 from 1 to 𝑛 do
2: read 𝑎𝑖
3: search sorted data segment 𝐵
4: ORun = 𝐵
5: 𝑂𝑅𝑢𝑛_𝑚𝑎𝑥 = max(ORun)
6: 𝑂𝑅𝑢𝑛_𝑚𝑖𝑛 = min(ORun)
7: 𝑂𝑅𝑢𝑛_𝐼𝐷 = ID(ORun)
8: if 𝐵 ≥ 𝐷 then
9: write 𝑂𝑅𝑢𝑛_𝑚𝑎𝑥 into ITable
10: write 𝑂𝑅𝑢𝑛_𝑚𝑖𝑛 into ITable
11: write 𝑂𝑅𝑢𝑛_𝐼𝐷 into ITable
12: else
13: write unsorted segments into the input buffer
14: if 𝑤𝑟𝑖𝑡𝑒𝐷𝑎𝑡𝑎𝑆𝑖𝑧𝑒 = 𝐸 then
15: 𝑅𝑢𝑛 ← sort input buffer data
16: write 𝑅𝑢𝑛 back to NVM
17: write 𝑅𝑢𝑛_𝑚𝑎𝑥 into ITable
18: write 𝑅𝑢𝑛_𝑚𝑖𝑛 into ITable
19: write 𝑅𝑢𝑛_𝐼𝐷 into ITable
20: end if
21: end if
22: end for

To fully use the byte addressing features of NVM, we propose an Algorithm 2 called “run merge”. First, Algorithm 2 uses the
minimum value as the key to sorting ITable. At this stage, ORun and Run are both called Run. Then, the minimum and maximum
values of the different runs are compared. When a run’s maximum value is less than the minimum value, we merge the two runs into
a new Run. Then, the maximum and minimum values and ID data are updated in ITable. Finally, 𝑅 runs are generated for merge
sorting.

Algorithm 2 Run merge phase.


Input: sorted runs 𝑟1 , ⋯ , 𝑟𝐾 and ITable
Output: sorted data 𝑄

1: for 𝑖 from 1 to 𝑛 do
2: if 𝐼𝑇 𝑎𝑏𝑙𝑒.𝑟𝑢𝑛𝑚 𝑎𝑥𝑖 ≤ 𝐼𝑇 𝑎𝑏𝑙𝑒.𝑟𝑢𝑛𝑚 𝑖𝑛𝑖+1 then
3: merge 𝑟𝑖 and 𝑟𝑖+1 into a new run 𝑟
4: write 𝑟.𝑚𝑎𝑥 into ITable
5: write 𝑟.𝑚𝑖𝑛 into ITable
6: write 𝑟.𝑖𝑑 into ITable
7: end if
8: end for
9: while has not yet processed all pages do
10: if There are 𝐵 Runs in the memory then
11: Read 𝑟1 , ⋯ , 𝑟𝑅 into DRAM
12: Sort(𝑟1 , ⋯ , 𝑟𝑅 )
13: Output minimum key into buffer
14: if The output buffer is full then
15: Flush the output buffer to the flash chip
16: end if
17: end if
18: end while

Because each ordered data interval is not stored continuously on NVM, the interval length is also inconsistent. Therefore, it is
necessary to sequentially read the minimum value of each intermediate result from the NVM in the merge phase, which will generate
random access operations. The random read and write performance of traditional disks is extremely poor. However, NVM can provide
random access, fast read and write speed, and speed up the execution of the merge phase. This algorithm can utilize the random
access feature of NVM.
An example of a run merge is depicted in Fig. 6. The maximum value of Run 2 is 43, and the minimum value of Run 3 is 48.
Thus, the two runs are merged. The maximum value after the merge is 94, which is still less than the minimum value of Run 6 and
continues to be merged into the final Run 2. Similarly, the maximum value of Run 4 is 19, which is less than the minimum value of
21 for Run 5. Thus, the two runs are merged into a new Run 4. ITable is updated to obtain the final result.
In the merge phase, the runs generated by Algorithm 2 are merged. The minimum values of each run are added to the leaf node
of the loser tree, and a merge comparison operation is performed, as shown in Fig. 7. The parent node records the node information
of the loser, and the winner continues to participate in the next round of comparison. After the match, the minimum value among all

8
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Fig. 6. Detailed description of the run merger.

Fig. 7. Schematic diagram of the merge phase.

leaf nodes can be retrieved. Then, we continue to add the following value to the loser tree, iterate, and proceed to the next round.
The current minimum value can be retrieved each time, and the merge phase ends once all data have been processed. A completely
ordered result file is produced at the end.
The merge phase is described with a concrete example. The first stage detects partially ordered intervals in the input data and
records the offset of the partially ordered data intervals on the NVM. As shown in Fig. 5, the green part of the data segment is the
originally ordered data interval (5, 7, 8, 10, 12, 15) and (6, 9, 10). At the same time, the sorted intermediate results are also stored
on NVM. The data segment in the yellow part of the figure is the original unsorted data after the sorting operation, and the sorted
intermediate results are written back to NVM.

4.3. I/O analysis of LazySort

The I/O cost of LazySort consists of two sub-costs: the run generation phase cost and the run merge phase cost, both in terms of
read and write operations.

9
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Table 4
Storage media performance comparison.

Machine Attribute Specification

Type Intel Xeon Gold 6230N


CPU(M1)
Base frequency 2.30 GHz

NVM capacity 2x128GB (256GB)


MEM(M1)
DRAM capacity 6x32 GB (192 GB)

Release version CentOS 7.6


OS(M1)
Kernel Linux 3.10

Type Intel Xeon Gold 5218


CPU(M2)
Base frequency 2.30 GHz

NVM capacity 6x256 GB (1536 GB)


MEM(M2)
DRAM capacity 6x32GB (192GB)

Release version Ubuntu 18.04.3 LTS


OS(M2)
Kernel Linux 4.15.0

In the run generation phase, we first look for an ordered data segment based on the threshold it defines. This operation requires a
scan of the entire input file, which results in R read operations: 𝑅 ⋅ 𝐶𝑅 and write back operations:(𝑁 − 𝑂𝑅) ⋅ 𝐶𝑊 . The maximum and
minimum values of the ordered data segments are recorded in ITable. This operation does not add a read operation but adds a write
operation, which requires a record of the sequence number and maximum and minimum run value, as shown in Fig. 4 and Fig. 5.
The recorded data is a long long type, so it requires 3 × 8 bytes, which results in R write operations: 𝑅 ⋅ 24 bytes.
In the run merge phase, we will first merge runs that do not cross the data ranges in ITable to reduce the merge overhead. This
operation requires reading all ITable, resulting in a read cost of 𝑅 ⋅ 24 bytes. The remaining runs are then merged. When the number
of generated runs (𝑁∕𝐸 − 𝑂𝑅 − 𝑀𝑅) exceeds E, the merge process takes log𝐸 (𝑁∕𝐸 − 𝑂𝑅 − 𝑀𝑅) runs to generate the final sort file
from all generated runs. Therefore, running the merge phase results in 𝑁 ⋅ log𝐸 (𝑁∕𝐸 − 𝑂𝑅 − 𝑀𝑅) read and write operations, and the
total 𝐼∕𝑂 cost of the LazySort algorithm is:
𝑁
𝑁 ⋅ 𝐶𝑅 + (𝑁 − 𝑂𝑅) ⋅ 𝐶𝑊 + 𝑅 ⋅ 24 + (𝐶𝑅 + 𝐶𝑊 ) ⋅ 𝑁 ⋅ log𝐸 ( − 𝑂𝑅 − 𝑀𝑅) (2)
𝐸
By comparing Equations (1) and (2), we can see that the written back data of the LazySort algorithm in the run generation phase
stage is significantly reduced. The data read and write back operations are reduced in the run merge phase.

5. Experimental results

In this section, the methodology used to evaluate the relevance of LazySort is described. We compare and analyze the LazySort
proposed in this paper with the merge sort algorithm as a benchmark, and we measure the total running time of the algorithm and
the total reads and writes to NVM. The results show that LazySort performs better as the data’s original ordering increases.

5.1. Experimental platform and dataset

In this article, experiments are conducted on two machines with truly persistent memory. They have different sizes (Intel Optane
desktop DC Persistent Memory Module) and other processors. NVM is configured in 100% App direct mode, so the software can
directly access NVM through byte addressing. We use the ndctl tool to configure the region into devdax mode. Moreover, applica-
tions can access NVM by mapping it to user space through the mmap() system call. The details of the relevant parameters of the
experimental platform are shown in Table 4.
The programming process uses the persistent memory development kit (PMDK), which is an ever-improving collection of libraries
and tools. These libraries build on the DAX functionality of the operating system and allow applications to access persistent memory-
mapped files. The implementation uses the libpmem library of PMDK. The Libpmem library is a low-level persistent memory library.
The pmem_map_file( ) function is called to create a persistent memory file, map the file, and obtain a pointer to the file. If the
address is a single variable that points to persistent memory, calling the pmem_persist() function can perform persistent operations.
The pmem_unmap() function is called to unmap the relationship. The pmem_map_file() function is called to map the file and obtain
a pointer to the file. Because the DRAM memory is limited and the size is constantly changing in the experiment, it is necessary
to divide the large data file into multiple small files and read the data from NVM to DRAM in batches. Notably, each algorithm is
allocated the same memory space.
We used 41 different datasets to test the sorting algorithm. A dataset is a randomly generated collection of integers. The data size
is divided into four groups: 100 MB, 500 MB, 100 GB, 180 GB, and 200 GB. The original data of the dataset are completely ordered.
Since the data are updated locally, the data are out of order. The data disorder ratio refers to randomly updating the original data
according to the proportion, resulting in the original data being out of order. The lower the rate of updates is, the better the local
order of the data. The higher the update rate is, the less ordered the data. For each data size, the data disorder ratios in different
datasets are set as 1%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%.

10
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Fig. 8. Algorithm running time under different unsorted radio frequencies (100 MB 500 MB).

Table 5
Run time of the two algorithms.

Ratio 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

LazySort(500M) 6.15 11.83 17.49 22.82 27.96 31.61 35.21 40 41.44 43 45


External Sorting (500M) 64.86 37.01 36.72 38.81 39.15 41.05 42.06 43 44.02 45 46.87
LazySort(100M) 3 6.56 6 6.1 6.3 6.45 6.56 6.9 7.1 7.2 7.3
External Sorting(100M) 13 8.5 7 7.15 7.1 7 7.2 7.3 7.5 7.7 7.85

5.2. Experimental results

In the experiment, we evaluated the effect of three different variables on the algorithm’s performance. In theory, the more ordered
the data are, the better the performance of the LazySort algorithm. Therefore, we evaluated the performance comparison between
LazySort and external sorting under different unsorted ratios. The performance of the external sorting algorithm varies significantly
under different total data volumes. Therefore, we conduct test analysis on different total data volumes. Finally, we tested the effect
of different distributions of unsorted data on the algorithm. The distributions of unsorted data refer to the different situations of the
length of continuous and ordered data when the proportion of unsorted data is specific. Notably, data sizes of 100 MB and 500 MB
are used in Machine 1, and the data sizes of 100 GB and 180 GB are used in Machine 2.

∙ The impact of the dataset disorder proportion on the performance.


We first evaluate the effect on the algorithm running time under different data sizes and different proportions of data disorder.
In this experiment, Threshold D is set to 4 KB, and the memory is set to 100 MB. The variable is the data disorder ratio.
Fig. 8 shows the algorithm changes with increasing unsorted rates when the total data volume is 100 M and 500 M. The abscissa
represents the proportion of data disorder, and the ordinate is the processing time of the algorithm. The disorder ratio refers
to the percentage of the original data being randomly updated. The higher the update ratio is, the worse the data locality. The
algorithm running time includes the sum of all the time from the algorithm reading the original test data file to the fully ordered
data writing back to the NVM. We can see that LazySort outperforms external sorting overall by up to 4.3 and 10.5 when the
data are 100 M and 500 M, respectively. As the unsorted ratio increases, the performance of LazySort gradually resembles that
of external sorting, but the execution of LazySort is still better. To display the experimental results more clearly, we provide the
running time data Table 5.
To fully prove the algorithm’s feasibility, we further expand the size of the dataset to 100 GB and 180 GB for experimental
analysis. Fig. 9 shows that all the overall times of the algorithms increase as the data volume increases. As the disorder ratio
increases, external sorting gradually provides a better performance. However, when the disordered ratio is 100%, LazySort still
has certain advantages because there will also be a small number of randomly ordered data segments in the original disordered
data, which reduces the overhead to a certain extent. For example, when the data are 100 GB, LazySort still performs 1.2 times
better than the external sorting algorithm.
Therefore, when the data are distributed more orderly, it will lead to the problem of large-value data remaining in memory.
This situation will cause the frequency of DRAM and NVM data page swapping to decrease, thereby prolonging the algorithm’s
execution time. The more distributed the data size, the more evenly distributed the data read from Run into memory in the
merge phase. If the data range of each run is similar, there will be no long-term retention of large-value data in memory. The

11
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Fig. 9. Algorithm running time under different unsorted radio (100 GB and 180 GB).

Fig. 10. Comparison of data written by different unsorted ratios.

numerical consumption in each run is the same, and the data are written back to the NVM after the merge sort also increases.
Memory utilization is also improved, and the frequency of memory page swapping is accelerated, thereby speeding up the
execution performance of the external merge sort algorithm.
To further show the read and write situation, we test the amount of read and write data in the execution process of the two
algorithms under different disorder ratios on 100 GB. The total amount of data read by the two algorithms is the same, but
LazySort reduces the amount of written data. The amount of data is shown in Fig. 10. Fig. 10 shows that LazySort reduces the
total amount and percentage of data written back to NVM compared to external sorting. When the disorder ratio is 1% and 10%,
the total amount of data written back by LazySort is reduced by 49.5% and 45.03%, respectively. As the disorder ratio gradually
increases, the percentage of LazySort to reduce the amount of written data also gradually decreases. In the worst case, LazySort
still reduces the amount of data written back by 13.56%.
∙ The impact of different data sizes on the performance.
On Machine 1, we verified the effect of small amounts of data on the algorithm. We tested six datasets with different data sizes,
including 100 MB, 200 MB, 300 MB, 400 MB, 500 MB, and 600 MB. On Machine 2, the effects of different large datasets on the
algorithm were verified, and 10 GB, 50 GB, 100 GB, 150 GB, and 200 GB datasets were tested. Both sets of tests were performed
under the condition that the proportion of unordered data was 1% and Threshold D was 4 KB. As seen from Fig. 11 and Fig. 12,
the execution time of the LazySort algorithm is much lower than that of the baseline algorithm in the case of different amounts
of data.
When the ratio of unsorted data is the same, the performance improvement of LazySort improves with increased dataset sizes.
As the amount of data increases, the ordered intervals in the test data will also increase, and LazySort can retrieve more ordered
data intervals, reducing the total amount of data written back. We can see from Fig. 11 and Fig. 12 that as the size of the dataset
increases, the processing time of LazySort increases slowly, but the time spent on the red line external sorting increases rapidly.

12
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Fig. 11. Comparison of the running times with different dataset sizes (MB).

Fig. 12. Comparison of the running time with different dataset sizes (GB).

LazySort has better performance under different scale dataset sizes. In the case of a large amount of data, the performance
advantage is more prominent. When the test data are 150 GB, external sorting takes 6.63 hours, but LazySort only takes 29
minutes. When the test data are 200 GB, the execution time of external sorting is 7.35 hours, but the LazySort algorithm only
takes 36 minutes. After testing and analysis, it can be seen that the performance of LazySort is improved by more than 92% for
large-scale test datasets with different data totals.
LazySort can perform better in locally ordered data with a large dataset size. The test data were ultimately ordered because a
partial update of the data caused some data to be out of order. However, the test data still present an orderly local distribution.
For example, when the disorder ratio is 1%, approximately 1 GB of the 100 GB of test data is randomly updated, causing the
original data to become disordered. External sorting reorders all data and writes it back to NVM, and many intermediate results
are generated during the algorithm’s execution, which increases I/O overhead and time overhead. However, LazySort reduces the
generation of intermediate results by retrieving partially ordered test data intervals, thereby improving the algorithm execution
efficiency.
∙ The impact of the continuous ordered threshold on the performance.
From the above two sets of experimental analyses, the sorting efficiency of the algorithm is directly related to the disorder ratio.
However, different ordinal thresholds are also vital parameters that affect the algorithm’s efficiency. Therefore, we conduct a
performance analysis of different sequential ordinal thresholds. The continuously ordered threshold is a concept defined in the
LazySort algorithm. LazySort detects the ordering of continuous data intervals in the algorithm. If the total data volume of a
continuous and ordered interval exceeds the threshold, then this ordered interval is regarded as an ordered intermediate result.
We do not need to reorder the interval and write it back to the persistent external memory. We only need to record the position
offset of the ordered interval on the NVM and build the ordered interval position index table. However, when the number of
detected ordered data intervals is less than the threshold, the data in this interval are regarded as unordered and need to be

13
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Fig. 13. Comparison of the running times with different threshold values.

sorted and then written back to external memory. The primary purpose is to control the number of ordered intervals retrieved
and to record continuously ordered data intervals with long lengths as much as possible instead of recording a large number of
short ordered data intervals.
Increasing the sequential order threshold has two main advantages. One advantage is to reduce the memory overhead of the
index table because the memory overhead of the index table is proportional to the number of ordered intervals. The second
advantage is that reducing the number of ordered intervals reduces the input cache overhead in the merge phase. Because
DRAM space is limited, the input cache needs to reserve data space for each in-order intermediate result to participate in the
merge phase. Therefore, reducing the number of ordered intervals can improve memory utilization. The input cache can reserve
more space for each ordered interval, improving the read and write granularity of the merge phase and improving the algorithm
execution efficiency.
However, reducing the number of ordered intervals will increase the write-back operation of out-of-order data, which will also
affect the execution performance of the algorithm. Therefore, these two dimensions need to be weighed, and this factor must
be considered comprehensively. Therefore, we perform further analyses through experiments. The total test data volume is 100
MB, and the unsorted data ratio is 1%. The sequential order thresholds are set to 2 KB, 4 KB, 6 KB, 8 KB, 12 KB, 14 KB, and 16
KB.
The effects of the different sequential order thresholds on the performance of the algorithm are studied experimentally, and the
results show that the lower the threshold is, the better the effect. However, the threshold’s impact on the algorithm’s performance
has two parts: the granularity of the data written back and the number of merge sort intervals. As seen in Fig. 13, the baseline
does not change with the change in the order threshold. Because external sorting does not need to detect the ordered interval
of the input data, we can see that the processing time of the LazySort algorithm increases gradually as the ordering threshold
increases. Because the length of the ordered interval is not strict when the ordered threshold is set relatively small, more
continuous ordered data intervals can be detected in the first stage, reducing the number of operations for data write-back.
At the same time, it also increases the memory consumption of the record index, but the memory cost is still relatively small
and can be ignored. However, as the ordered threshold increases, only long continuous ordered intervals are recorded when
detecting ordered intervals that exceed the threshold. The detected ordered intervals decrease, increasing the amount of data
that LazySort writes back to NVM and extending the processing time of the algorithm.
∙ The impact of the data distribution on the performance.
In the case of the same disordered ratio, the different distributions of disordered data may have a specific impact on the
algorithm’s performance. The different distribution of unordered data means that when updating the original ordered data, a
certain proportion of the data is selected for random updating. If the distribution of the data selected to be updated is relatively
scattered, the granularity of the update is relatively small. Under the condition that the total amount of updated data is certain,
the location of the updated data is scattered, resulting in the original ultimately ordered data being cut into many ordered data
blocks. In this case, the length of the continuous ordered data interval is generally relatively short, and there are few continuous
ordered intervals whose length exceeds the set order threshold. As a result, the number of ordered intervals that can be retrieved
decreases, increasing the amount of data written back to the NVM. If the updated data distribution is relatively concentrated, the
update granularity is relatively large. Generally, it refers to the continuous update of data in an interval. Then, when the total
amount of updated data is constant, the more robust the data selected to be updated are distributed, and the more fragmented
the original data will be. Therefore, the more consecutive ordered intervals detected, the greater the advantages of LazySort can
be exerted, and the amount of data written back to NVM can be reduced.

14
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

Fig. 14. Comparison of the running times with different update granularities.

Therefore, we conduct algorithm performance tests on different unordered data distributions. The total amount of data is 500
MB, the available memory is 125 MB, the sequential order data threshold is set to 8 KB, and the data disorder ratio is 10%. The
update granularity is set to 512 B, 1 KB, 2 KB, 4 KB, 8 KB, 16 KB, and 32 KB.
Fig. 14 shows that the external sorting performance is unchanged. The processing time of LazySort decreases as the update
granularity increases. Because the update granularity is relatively small, the proportion of the updated data is specific, resulting
in a scattered distribution of randomly selected and updated data. The original data have a more robust order and randomness,
so the LazySort processing time is relatively long. However, as the update granularity increases, there are more continuous
ordered intervals, and the length of the ordered intervals is relatively long. In this case, LazySort can perform better and reduce
the amount of data written back to NVM.

6. Conclusion

The new type of non-volatile memory brings new characteristics to the storage system. It has the advantages of non-volatility,
a low power consumption and a large capacity, but it also has the problems of writing delays and shorter lifespans. In the context
of significantly increased data and frequent database system updates, optimizing the external sorting algorithm by reducing NVM
write operations is crucial. To solve this problem, we improve the external sorting algorithm in the actual device of Intel Optane
DC PMM and propose the algorithm LazySort. LazySort can fully use the performance characteristics of the new type of non-volatile
memory with the data distribution characteristics. LazySort effectively reduces the amount of data written back to NVM, improves
the algorithm execution performance, and prolongs the service life of NVM. LazySort reduces the write time compared to external
sorting and takes full advantage of the NVM byte addressing feature. We first perform a lookup on the ordered data segment and then
record the maximum and minimum values and the value of the ordered data segment into an ITable. Second, the out-of-order data
segment is placed into the input buffer to wait for sorting. When the input buffer is complete, these data will be sorted to generate
runs. In the merge run phase, we take full advantage of the NVM byte addressing feature. Moreover, the run merge produces different
run sizes, reducing the total number of runs. Experiments show that LazySort’s performance gradually approaches that of external
sorting with increased disordered ratio data. In the optimal case, the optimized performance is 93.08%, and 49.50% of the amount of
data written back to NVM is reduced. We also compare the performance changes under different threshold conditions. The running
time of LazySort gradually decreases as the threshold increases and finally stabilizes.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.

Data availability

Data will be made available on request.

Acknowledgements

We thank the reviewers for their insightful feedback to improve this paper. This work is supported by the National Key R&D
Program of China under Grant No. 2021YFB0300103, the National Natural Science Foundation of China under Grant No. 62272499,

15
Y. Liu, Y. Ou, W. Chen et al. Information Sciences 641 (2023) 119137

61832020, the Major Program of Guangdong Basic and Applied Research No. 2019B030302002 and Guangdong Province Special
Support Program for Cultivating High-Level Talents: 2021TQ06X160.

References

[1] P. Andreou, O. Spanos, D. Zeinalipour-Yazti, G. Samaras, P.K. Chrysanthis, Fsort: external sorting on flash-based sensor devices, in: M.A. Nascimento, N. Tatbul
(Eds.), Proceedings of the 6th Workshop on Data Management for Sensor Networks, in Conjunction with VLDB, DMSN 2009, Lyon, France, August 24, 2009,
ACM, 2009.
[2] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, R.E. Gruber, Bigtable: a distributed storage system for structured
data, ACM Trans. Comput. Syst. (TOCS) 26 (2008) 1–26.
[3] J. Coburn, A.M. Caulfield, A. Akel, L.M. Grupp, R.K. Gupta, R. Jhala, S. Swanson, Nv-heaps: making persistent objects fast and safe with next-generation,
non-volatile memories, ACM SIGARCH Comput. Archit. News 39 (2011) 105–118.
[4] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, MIT Press, 2022.
[5] G. Dlamini, F. Jolha, Z. Kholmatova, G. Succi, Meta-analytical comparison of energy consumed by two sorting algorithms, Inf. Sci. 582 (2022) 767–777, https://
doi.org/10.1016/j.ins.2021.09.061, https://www.sciencedirect.com/science/article/pii/S0020025521009981.
[6] Y.C.L. Gao, Fssort: external sort for solid state drives, in: 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid),
2021.
[7] Chih-Hsuan Chen, Shuo-Han Chen, Yu-Pei Liang, Tseng-Yi Chen, Tsan-sheng Hsu, Hsin-Wen Wei, Wei-Kuan Shih, Facilitating external sorting on smr-based
large-scale storage systems, Future Gener. Comput. Syst. (2021) 333–348.
[8] J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y.J. Soh, Z. Wang, Y. Xu, S.R. Dulloor, et al., Basic performance measurements of the intel optane
dc persistent memory module, arXiv preprint, arXiv:1903.05714, 2019.
[9] Y. Kanza, H. Yaari, External sorting on flash storage: reducing cell wearing and increasing efficiency by avoiding intermediate writes, VLDB J. 25 (2016)
495–518, https://doi.org/10.1007/s00778-016-0426-5.
[10] A. Laga, J. Boukhobza, F. Singhoff, M. Koskas, MONTRES: merge on-the-run external sorting algorithm for large data volumes on SSD based storage systems,
IEEE Trans. Comput. 66 (2017) 1689–1702, https://doi.org/10.1109/TC.2017.2706678.
[11] J. Lee, H. Roh, S. Park, External mergesort for flash-based solid state drives, IEEE Trans. Comput. 65 (2016) 1518–1527, https://doi.org/10.1109/TC.2015.
2451631.
[12] Y. Lee, L.C. Quero, S. Kim, J. Kim, S. Maeng, Activesort: efficient external sorting using active ssds in the mapreduce framework, Future Gener. Comput. Syst.
65 (2016) 76–89, https://doi.org/10.1016/j.future.2016.03.003.
[13] J. Liu, S. Chen, Initial experience with 3d xpoint main memory, Distrib. Parallel Databases 38 (2020) 865–880, https://doi.org/10.1007/s10619-019-07277-8.
[14] J. Lobo, S. Kuwelkar, Performance analysis of merge sort algorithms, in: 2020 International Conference on Electronics and Sustainable Communication Systems
(ICESC), IEEE, 2020, pp. 110–115.
[15] Y. Luo, Z. Chu, P. Jin, S. Wan, Efficient sorting and join on nvm-based hybrid memory, in: International Conference on Algorithms and Architectures for Parallel
Processing, Springer, 2020, pp. 15–30.
[16] O. Mutlu, Main Memory Scaling: Challenges and Solution Directions, Springer, New York, 2015, pp. 127–153.
[17] D. Narayanan, O. Hodson, Whole-system persistence, in: Proceedings of the Seventeenth International Conference on Architectural Support for Programming
Languages and Operating Systems, 2012, pp. 401–410.
[18] N. Narayanaswamy, Flash memory integration: performance and energy issues, Comput. Rev. 60 (2019) 52.
[19] H. Park, K. Shim, FAST: flash-aware external sorting for mobile database systems, J. Syst. Softw. 82 (2009) 1298–1312, https://doi.org/10.1016/j.jss.2009.02.
028.
[20] K. Shvachko, H. Kuang, S. Radia, R. Chansler, The hadoop distributed file system, in: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies
(MSST), IEEE, 2010, pp. 1–10.
[21] T.H. Takano, A prompt report on the performance of intel optane dc persistent memory module, IEICE Trans. Inf. Syst. (2020) 1168–1172.
[22] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J.S. Sarma, R. Murthy, H. Liu, Data warehousing and analytics infrastructure at facebook, in: ACM, 2010.
[23] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al., Storm@ twitter, in: Proceedings of the
2014 ACM SIGMOD International Conference on Management of Data, 2014, pp. 147–156.
[24] S. Venkataraman, N. Tolia, P. Ranganathan, R.H. Campbell, Consistent and durable data structures for non-volatile byte-addressable memory, in: File and Storage
Technologies, 2011.
[25] D. Waddington, M. Kunitomi, C. Dickey, S. Rao, A. Abboud, J. Tran, Evaluation of intel 3d-xpoint NVDIMM technology for memory-intensive genomic workloads,
in: Proceedings of the International Symposium on Memory Systems, MEMSYS 2019, Washington, DC, USA, September 30 - October 03, 2019, ACM, 2019.
[26] Y. Xian, X. Wang, Fractal sorting matrix and its application on chaotic image encryption, Inf. Sci. 547 (2021) 1154–1169.
[27] Y. Xian, X. Wang, L. Teng, X. Yan, Q. Li, X. Wang, Cryptographic system based on double parameters fractal sorting vector and new spatiotemporal chaotic
system, Inf. Sci. 596 (2022) 304–320.
[28] J. Xu, L. Zhang, A. Memaripour, A. Gangadharaiah, A. Borase, T.B. Da Silva, S. Swanson, A. Rudoff, Nova-fortis: a fault-tolerant non-volatile main memory file
system, in: Proceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 478–496.
[29] M. Zaharia, R.S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M.J. Franklin, et al., Apache spark: a unified engine for big
data processing, Commun. ACM 59 (2016) 56–65.

16

You might also like