Unit IV: Cluster Analysis and Mining
Introduction to cluster analysis, Mining complex type of data: Multidimensional analysis and
descriptive mining of complex data objects, Spatial databases, Multimedia databases,
Mining time series and sequence data, Mining text databases, Mining World Wide Web,
Data Chunking Techniques.
Assignment: Mining time series and sequence data
Data Mining - Cluster Analysis
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of
similar objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based on data similarity and
then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to changes and helps single
out useful features that distinguish different groups.
Applications of Cluster Analysis
Clustering analysis is broadly used in many applications such as market research, pattern recognition,
data analysis, and image processing.
Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionalities and gain insight into structures inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth observation database. It
also helps in the identification of groups of houses in a city according to house type, value, and
geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to
observe characteristics of each cluster.
Requirements of Clustering in Data Mining
The following points throw light on why clustering is required in data mining −
Scalability − We need highly scalable clustering algorithms to deal with large databases.
Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any
kind of data such as interval-based (numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting
clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able to handle low-dimensional
data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability − The clustering results should be interpretable, comprehensible, and usable.
Clustering Methods
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It
means that it will classify the data into k groups, which satisfy the following
requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember −
For a given number of partitions (say k), the partitioning method will create an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving objects from one
group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects.
We can classify hierarchical methods on the basis of how the hierarchical
decomposition is formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are
close to one another. It keep on doing so until all of the groups are merged into one
or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into
smaller clusters. It is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or splitting is done, it can
never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Perform careful analysis of object linkages at each hierarchical partitioning.
Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to group
objects into micro-clusters, and then performing macro-clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some
threshold, i.e., for each data point within a given cluster, the radius of a given cluster
has to contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data
for a given model. This method locates the clusters by clustering the density
function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or
application-oriented constraints. A constraint refers to the user expectation or the
properties of desired clustering results. Constraints provide us with an interactive
way of communication with the clustering process. Constraints can be specified by
the user or the application requirement.
*Multidimensional Analysis and Descriptive Mining of Complex Data
Objects
What are complex data object?
1. Many advanced, data-intensive applications, such as scientific research and engineering
design, need to store, access, and analyze complex but relatively structured data objects.
2. These objects cannot be represented as simple and uniformly structured records (i.e., tuples)
in data relations.
3. These kinds of systems deal with the efficient storage and access of vast amounts of disk-
based complex structured data objects.
4. These systems organize a large set of complex data objects into classes, which are in turn
organized into class/subclass hierarchies.
5. Each object in a class is associated with (1) an object-identifier, (2) a set of attributes that
may contain sophisticated data structures, set- or list-valued data, class composition
hierarchies, multimedia data, and (3) a set of methods that specify the computational routines
or rules associated with the object class.
How can we Generalize the Structured Data
Typically, set-valued data can be generalized by:-
(1) Generalization of each value in the set to its corresponding higher-level concept,
(2) Derivation of the general behavior of the set, such as the number of elements in the set,
the types or value ranges in the set, the weighted average for numerical data, or the major
clusters formed by the set.
Suppose that the hobby of a person is a set-valued attribute containing the set of values
{tennis, hockey, soccer, violin, this set can be generalized to a set of high-level concepts,
such as {sports, music, computer games}
How Aggregation and Approximation is done in Spatial and Multimedia Data
Generalization
1. Aggregation and approximation are another important means of generalization
2. In a spatial merge, it is necessary to not only merge the regions of similar types within the
same general class but also to compute the total areas, average density, or other aggregate
functions while ignoring some scattered regions with different types if they are unimportant
to the study.
3. Spatial aggregation and approximation. Suppose that we have different pieces of land for
various purposes of agricultural usage, such as the planting of vegetables, grains, and fruits.
These pieces can be merged or aggregated into one large piece of agricultural land by a
spatial merge. However, such a piece of agricultural land may contain highways, houses, and
small stores. If the majority of the land is used for agriculture, the scattered regions for other
purposes can be ignored, and the whole region can be claimed as an agricultural area
by approximation
4. A multimedia database may contain complex texts, graphics, images, video fragments, maps,
voice, music, and other forms of audio/video information
5. Generalization on multimedia data can be performed by recognition and extraction of the
essential features and/or general patterns of such data. There are many ways to extract such
information. For an image, the size, color, shape, texture, orientation, and relative positions
and structures of the contained objects or regions in the image can be extracted by
aggregation and/or approximation. For a segment of music, its melody can be summarized
based on the approximate patterns that repeatedly occur in the segment, while its style can be
summarized based on its tone, tempo, or the major musical instruments played.
What is Spatial Data Mining?
1. A spatial database stores a large amount of space-related data, such as maps, preprocessed
remote sensing or medical imaging data, and VLSI chip layout data. Spatial databases have
many features distinguishing them from relational databases.
2. Spatial data mining refers to the extraction of knowledge, spatial relationships, or other
interesting patterns not explicitly stored in spatial databases. Such mining demands an
integration of data mining with spatial database technologies.
3. It can be used for understanding spatial data, discovering spatial relationships and
relationships between spatial and nonspatial data, constructing spatial knowledge bases,
reorganizing spatial databases, and optimizing spatial queries.
4. It is expected to have wide applications in geographic information systems, geomarketing,
remote sensing, image database exploration, medical imaging, navigation, traffic control,
environmental studies, and many other areas where spatial data are used.
5. A crucial challenge to spatial data mining is the exploration of efficient spatial data mining
techniques due to the huge amount of spatial data and the complexity of spatial data types and
spatial access methods.
“Can we construct a spatial data warehouse?” Yes, as with relational data, we can integrate
spatial data to construct a data warehouse that facilitates spatial data mining. A spatial data
warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of both
spatial and nonspatial data in support of spatial data mining and spatial-datarelated decision-
making processes.
There are three types of dimensions in a spatial data cube
1. A nonspatial dimension contains only nonspatial data. (such
as “hot” for temperature and “wet” for precipitation)
2. A spatial-to-nonspatial dimension is a dimension whose primitive-level data are spatial but
whose generalization, starting at a certain high level, becomes nonspatial.eg:-city
3. A spatial-to-spatial dimension is a dimension whose primitive level and all of its highlevel
generalized data are spatial.eg:equitemp.
Two types of measures in a spatial data cube:
1. A numerical measure contains only numerical data. For example, one measure in a
spatial data warehouse could be the monthly revenue of a region
2. A spatial measure contains a collection of pointers to spatial
objects.eg temperature and precipitation
There are several challenging issues regarding the construction and utilization of spatial
datawarehouses.
1. The first challenge is the integration of spatial data from heterogeneous sources and systems.
2. The second challenge is the realization of fast and flexible on-line analytical processing in
spatial data warehouses.
Mining Spatial Association
A spatial association rule is of the form A=>B [s%;c%], where A and B are sets of spatial or
nonspatial predicates, s% is the support of the rule, and c%is the confidence of the rule. For
example, the following
is a spatial association rule: is a(X; “school”)^close to(X; “sports center”))=>close to(X;
“park”) [0:5%;80%].
This rule states that 80% of schools that are close to sports centers are also close to parks, and
0.5% of the data belongs to such a case.
------------------------------------------------------------------------------------------------
What is Multimedia Data Mining?
A multimedia database system stores and manages a large collection of multimedia data, such
as audio, video, image, graphics, speech, text, document, and hypertext data, which contain
text, text markups, and linkages.
Similarity Search in Multimedia Data
For similarity searching in multimedia data, we consider two main families of multimedia
indexing and retrieval systems:
(1) description-based retrieval systems, which build indices and perform object retrieval
based on image descriptions, such as keywords, captions, size, and time of creation;
(2) content-based retrieval systems, which support retrieval based on the image content,
such as color histogram, texture, pattern, image topology, and the shape of objects and their
layouts and locations within the image.
In a content-based image retrieval system, there are often two kinds of queries:
Image sample- based queries and image feature specification queries.
Image-sample-based queries find all of the images that are similar to the given image
sample. This search compares the feature vector (or signature) extracted from the sample with
the feature vectors of images that have already been extracted and indexed in the image
database. Based on this comparison, images that are close to the sample image are returned.
Image feature specification queries specify or sketch image features like color, texture, or
shape, which are translated into a feature vector to be matched with the feature vectors of the
images in the database
Mining Associations in Multimedia Data
1. Associations between image content and non image content features:
A rule like “If at least 50% of the upper part of the picture is blue, then it is likely to
represent sky” belongs to this category since it links the image content to the keyword sky.
2. Associations among image contents that are not related to spatial relationships: A rule like
“If a picture contains two blue squares, then it is likely to contain one red circle a swell”
belongs to this category since the associations are all regarding image contents.
3. Associations among image contents related to spatial relationships: A rule like “If a red
triangle is between two yellow squares, then it is likely a big oval-shaped object is
underneath” belongs to this category since it associates objects in the image with spatial
relationship.
Several approaches have been proposed and studied for similarity-based retrieval in image
databases, based on image signature
1. Color histogram–based signature: In this approach, the signature of an image includes color
histograms based on the color composition of an image regardless of its scale or orientation.
This method does not contain any information about shape, image topology, or texture.
2. Multifeature composed signature: In this approach, the signature of an image includes a
composition of multiple features: color histogram, shape, image topology, and texture.
“Can we construct a data cube for multimedia data analysis?” To facilitate the
multidimensional analysis of large multimedia databases, multimedia data cubes can be
designed and constructed in a manner similar to that for traditional data cubes from relational
data. A multimedia data cube can contain additional dimensions and measures for multimedia
information, such as color, texture, and shape.
*What is Text Mining?
Text databases (or document databases), which consist of large collections of documents
from various sources, such as news articles, research papers, books, digital libraries, e-mail
messages, and Web pages. Text databases are rapidly growing due to the increasing amount
of information available in electronic form, such as electronic publications, various kinds of
electronic documents, e-mail, and the World Wide Web .
What is IR(Information Retrieval System)?
A typical information retrieval problem is to locate relevant documents in a document
collection based on a user’s query, which is often some keywords describing an information
need, although it could also be an example relevant document. In such a search problem, a
user takes the initiative to “pull” the relevant information out from the collection; this is most
appropriate when a user has some ad hoc (i.e., short-term)information need, such as finding
information to buy a used car. When a user has a long-term information need (e.g., a
researcher’s interests), a retrieval system may also take the initiative to “push” any newly
arrived information item to a user if the item is judged as being relevant to the user’s
information need. Such an information access
process is called information filtering, and the corresponding systems are often
called filtering
systems or recommender systems.
Basic Measures for Text Retrieval: Precision and Recall
“Suppose that a text retrieval system has just retrieved a number of documents for me based
on my input in the form of a query. How can we assess how accurate or correct the system
was?”
Precision: This is the percentage of retrieved documents that are in fact relevant to
the query (i.e., “correct” responses). It is formally defined as
Recall: This is the percentage of documents that are relevant to the query and were,
in fact, retrieved. It is formally defined as
How Mining theWorld WideWeb is done?
The World Wide web serves as a huge,widely distributed, global information service center
for news, advertisements, consumer information, financial management, education,
government, e-commerce, and many other information services. The Web also contains
a rich and dynamic collection of hyperlink information and Web page access and usage
information, providing rich sources for data mining.
challenges for effective resource and knowledge discovery in web
1. The Web seems to be too huge for effective data warehousing and data mining. The size of
the Web is in the order of hundreds of terabytes and is still growing rapidly. Many
organizations and societies place most of their public-accessible information on the Web. It is
barely possible to set up a data warehouse to replicate, store, or integrate all of the data on the
Web.
2. The complexity of Web pages is far greater than that of any traditional text document
collection. Web pages lack a unifying structure.
3. The Web is a highly dynamic information source. Not only does the Web grow rapidly,
but its information is also constantly updated.
4. TheWeb serves a broad diversity of user communities. The Internet currently connects
more than 100 million workstations, and its user community is still rapidly expanding.
These challenges have promoted research into efficient and effective discovery and use of
resources on the Internet.
Mining the WWW
1. Mining theWeb Page Layout Structure
The basic structure of a Web page is its DOM(Document Object Model) structure. The DOM
structure of a Web page is a tree structure, where every HTML tag in the page corresponds to
a node in the DOM tree. The Web page can be segmented by some predefined structural tags.
Thus the DOM structure can be used to facilitate information extraction.
Here, we introduce an algorithm called VIsion-based Page Segmentation (VIPS).
VIPS aims to extract the semantic structure of a Web page based on its visual presentation
2. Mining the Web’s Link Structures to Identify Authoritative Web Pages
The Web consists not only of pages, but also of hyperlinks pointing from one page to
another.
These hyperlinks contain an enormous amount of latent human annotation that can help
automatically infer the notion of authority. These properties of Web link structures have led
researchers to consider another important category of Web pages called a hub. A hub is one
or a set ofWeb pages that provides collections of links to authorities.
*What are the various Data Mining Applications?
Data Mining for Financial Data Analysis-
Design and construction of data warehouses for multidimensional data analysis and
data mining: Financial data collected in the banking and financial industry are often
relatively complete, reliable, and of high quality, which facilitates systematic data analysis
and data mining. One may like to view the debt and revenue changes by month, by region, by
sector, and by other factors, along with maximum, minimum, total, average, trend, and other
statistical information.
Loan payment prediction and customer credit policy analysis: Loan payment prediction
and customer credit analysis are critical to the business of a bank. Many factors can strongly
or weakly influence loan payment performance and customer credit rating.
Classification and clustering of customers for targeted marketing : Classification
and clustering methods can be used for customer group identification and targeted
marketing.
For example, we can use classification to identify the most crucial factors that may influence
a customer’s decision regarding banking. Customers with similar behaviors regarding loan
payments may be identified by multidimensional clustering techniques.
Detection of money laundering and other financial crimes: To detect money laundering
and other financial crimes, it is important to integrate information from multiple
databases (like bank transaction databases, and federal or state crime history databases), as
long as they are potentially related to the study
Data Mining for the Retail Industry
Design and construction of data warehouses based on the benefits of data
mining: Because retail data cover a wide spectrum (including sales, customers, employees,
goods transportation, consumption, and services), there can be many ways to design a data
warehouse for this industry.
Multidimensional analysis of sales, customers, products, time, and region:
The retail industry requires timely information regarding customer needs, product
sales, trends,and fashions, as well as the quality, cost, profit, and service of commodities
Analysis of the effectiveness of sales campaigns: The retail industry conducts sales
campaigns using advertisements, coupons, and various kinds of discounts and bonuses
to promote products and attract customers
Customer retention—analysis of customer loyalty: With customer loyalty card
information,
one can register sequences of purchases of particular customers. Customer loyalty and
purchase trends can be analyzed systematically
Product recommendation and cross-referencing of items: By mining associations
from sales records, one may discover that a customer who buys a digital camera is
likely to buy another set of items. Such information can be used to form product
recommendations. Collaborative recommender systems use data mining techniques
to make personalized product recommendations during live customer transactions,
based on the opinions of other customers
Data Mining for the Telecommunication Industry
Fraudulent pattern analysis and the identification of unusual patterns:
Fraudulent activity costs the telecommunication industry millions of dollars per year. It
is important to (1) identify potentially fraudulent users and their atypical usage patterns;
(2) detect attempts to gain fraudulent entry to customer accounts; and(3) discover unusual
patterns that may need special attention, such as busy-hour frustrated call attempts, switch
and route congestion patterns, and periodic calls from automatic dial-out equipment (like fax
machines) that have been improperly programmed
Multidimensional association and sequential pattern analysis: The discovery of
association
and sequential patterns in multidimensional analysis can be used to promote
telecommunication services. For example, suppose you would like to find usage patterns for a
set of communication services by customer group, by month, and by time of day.
Mobile telecommunication services: Mobile telecommunication, Web and information
services, and mobile computing are becoming increasingly integrated and
common in our work and life.
Explain the Social Impacts of Data Mining?
Ubiquitous data mining is the ever presence of data mining in many aspects of our daily
lives. It can influence how we shop, work, search for information, and use a computer, as
well as our leisure time, health, and well-being. In invisible data mining, “smart” software,
such as Web search engines, customer-adaptive Web services (e.g., using recommender
algorithms), e-mail managers, and so on, incorporates data mining into its functional
components, often unbeknownst to the user.
From grocery stores that print personalized coupons on customer receipts to on-line stores
that recommend additional items based on customer interests, data mining has innovatively
influenced what we buy, the way we shop, as well as our experience while shopping.
Data mining has shaped the on-line shopping experience. Many shoppers routinely turn to
on-line stores to purchase books, music, movies, and toys
Many companies increasingly use data mining for customer relationship management
(CRM), which helps provide more customized, personal service addressing individual
customer’s needs, in lieu of mass marketing
While you are viewing the results of your Google query, various ads pop up relating
to your query. Google’s strategy of tailoring advertising to match the user’s interests
is
successful—it has increased the clicks for the companies involved by four to five times.
Web-wide tracking is a technology that tracks a user across each site she visits. So,while
Surfing the Web, information about every site you visit may be recorded,which can provide
marketers with information reflecting your interests, lifestyle, and habits
Finally, data mining can contribute toward our health and well-being. Several
pharmaceutical companies use data mining software to analyze data when developing drugs
and to find associations between patients, drugs, and outcomes. It is also being used to detect
beneficial side effects of drugs
What are the Major concern of Data mining?
A major social concern of data mining is the issue of privacy and data security, particularly
as the amount of data collected on individuals continues to grow. Fair information practices
were established for privacy and data protection and cover aspects regarding the collection
and use of personal data. Data mining for counterterrorism can benefit homeland security and
save lives, yet raises additional concerns for privacy due to the possible access of personal
data. Efforts towards
ensuring privacy and data security include the development of privacy-preserving data
mining (which deals with obtaining valid data mining results without learning the underlying
data values) and data security–enhancing techniques (such as encryption)
What are the Recent trends in Data mining?
Trends in data mining include further efforts toward the exploration of new application areas,
improved scalable and interactive methods (including constraint-based mining), the
integration of data mining with data warehousing and database systems, the standardization
of data mining languages, visualization methods, and new methods for handling complex data
types. Other trends include biological data mining, mining software bugs, Web mining,
distributed and real-time mining, graph mining, social network analysis, multi relational and
multi database data mining, data privacy protection, and data security
Data chunking
Data deduplication is an emerging technology that introduces reduction of storage utilization and an efficient way
of handling data replication in the backup environment. In cloud data storage, the deduplication technology plays
a major role in the virtual machine framework, data sharing network, and structured and unstructured data
handling by social media and, also, disaster recovery. In the deduplication technology, data are broken down into
multiple pieces called “chunks” and every chunk is identified with a unique hash identifier. These identifiers are
used to compare the chunks with previously stored chunks and verified for duplication. Since the chunking
algorithm is the first step involved in getting efficient data deduplication ratio and throughput, it is very important
in the deduplication scenario.
What is data chunking? How can chunking help to organize large multidimensional datasets for both fast
and flexible data access? How should chunk shapes and sizes be chosen? Can software such as netCDF-4 or
HDF5 provide better defaults for chunking? If you're interested in those questions and some of the issues
they raise, read on ...
Is anyone still there? OK, let's start with a real-world example of the improvements possible with chunking
in netCDF-4. You may be surprised, as I was, by the results. Maybe looking at examples in this post will help
provide some guidance for effective use of chunking in other similar cases.
First let's consider a single large 3D variable from the NCEP North American Regional Reanalysis
representing air temperature (if you must know, it's at the 200 millibars level, at every 3 hours, on a 32.463
km resolution grid, over 33 years from 1979-01-01 through 2012-07-31):
dimensions:
y = 277 ;
x = 349 ;
time = 98128 ;
variables:
float T(time, y, x);
Of course the file has lots of other metadata specifying units, coordinate system, and data provenance, but in
terms of size it's mostly just that one big variable: 9.5 billion values comprising 38 GB of data.
This file is an example of PrettyBigData (PBD). Even though you can store it on a relatively cheap flash
drive, it's too big to deal with quickly. Just copying it to a 7200 rpm spinning disk takes close to 20 minutes.
Even copying to fast solid state disk (SSD) takes over 4 minutes. For a human-scale comparison, its close to
the storage used for a blu-ray version of a typical movie, about 50 GB. (As an example of ReallyBigData
(RBD), a data volume beyond the comprehension of ordinary humans, consider the 3D, 48 frame per second
version of "The Hobbit, Director's Cut".)
Access Patterns make a difference
For now, let's ignore issues of compression, and just consider putting that file on a server and permitting
remote access to small subsets of the data. Two common access patterns are:
1. Get a 1D time-series of all the data from a specific spatial grid point.
2. Get a 2D spatial cross section of all the data at a specific time.
The first kind of access is asking for the 1D array of values on one of the red lines, pictured on the left,
below; the second is asking for the 2D array of values on one of the green planes pictured on the right.
With a conventional contiguous (index-order) storage layout, the time dimension varies most slowly, y
varies faster, and x varies fastest. In this case, the spatial access is fast (0.013 sec) and the time series access
is slow (180 sec, which is 14,000 times slower). If we instead want the time series to be quick, we can
reorganize the data so x or y is the most slowly varying dimension and time varies fastest, resulting in fast
time-series access (0.012 sec) and slow spatial access (200 sec, 17,000 times slower). In either case, the slow
access is so slow that it makes the data essentially inaccessible for all practical purposes, e.g. in analysis or
visualization.
But what if we want both kinds of access to be relatively fast? Well, we could punt and make two versions of
the file, each organized appropriately for the kind of access desired. But that solution doesn't scale well. For
N-dimensional data, you would need N copies of the data to support optimal access along any axis, and N!
copies to support optimal access to any cross-section defined by some subset of the N dimensions.
A better solution, known for at least 30 years, is the use of chunking, storing multidimensional data in
multi-dimensional rectangular chunks to speed up slow accesses at the cost of slowing down fast accesses.
Programs that access chunked data can be oblivious to whether or how chunking is used. Chunking is
supported in the HDF5 layer of netCDF-4 files, and is one of the features, along with per-chunk
compression, that led to a proposal to use HDF5 as a storage layer for netCDF-4 in 2002.
Benefits of Chunking
I think the benefits of chunking are under-appreciated.
Large performance gains are possible with good choices of chunk shapes and sizes. Chunking also supports
efficiently extending multidimensional data along multiple axes (in netCDF-4, this is called "multiple
unlimited dimensions") as well as efficient per-chunk compression, so reading a subset of a compressed
variable doesn't require uncompressing the whole variable.
So why isn't chunking more widely used? I think reasons include at least the following:
1. Advice for how to choose chunk shapes and sizes for specific patterns of access is lacking.
2. Default chunk shapes and sizes for libraries such as netCDF-4 and HDF5 work poorly in some
common cases.
3. It's costly to rewrite big datasets that use conventional contiguous layouts to use chunking instead.
For example, even if you can fit the whole variable, uncompressed, in memory, chunking a 38GB
variable can take 20 or 30 minutes.
This series of posts and better guidance in software documentation will begin to address the first problem.
HDF5 already has a start with a white paper on chunking.
The second reason for under-use of chunking is not so easily addressed. Unfortunately, there are no general-
purpose chunking defaults that are optimal for all uses. Different patterns of access lead to different chunk
shapes and sizes for optimum access. Optimizing for a single specific pattern of access can degrade
performance for other access patterns.
Finally, the cost of chunking data means that you either need to get it right when the data is created, or the
data must be important enough that the cost of rechunking for many read accesses is justified. In the latter
case, you may want to consider acquiring a computing platform with lots of memory and SSD, just for the
purpose of rechunking important datasets.
What a difference the shape makes
We have claimed that good choices of chunk shapes and sizes can make large datasets useful for access in
multiple ways. For the specific example we've chosen, how well do the netCDF-4 library defaults work, and
what's the best we can do by tailoring the chunking to the specific access patterns we've chosen, 1D time
series at a point and 2D spatial access at a specific time?
Here's a table of timings for various shapes and sizes of chunks, using conventional local 7200 rpm spinning
disk with 4096-byte physical disk blocks, the kind of storage that's still prevalent on desk-top and
departmental scale platforms:
Storage layout, Read time Read spatial Performance bias
chunk shapes series (sec) slice (sec) (slowest / fastest)
Contiguous favoring time range 0.013 180 14000
Contiguous favoring spatial slice 200 0.012 17000
Default (all axes equal) chunks, 4673 x 12 x 16 1.4 34 24
36 KB chunks, 92 x 9 x 11 2.4 1.7 1.4
8 KB chunks, 46 x 6 x 8 1.4 1.1 1.2
We've already seen the timings in the first two rows of this table, showing huge performance bias when
using contiguous layouts. The third row shows the current netCDF-4 default for chunking this data,
choosing chunk sizes close to 4 MB and trying to equalize the number of chunks along any axis. This turns
out not to be particularly good for trying to balance 1D and 2D accesses. The fourth row shows results of
smaller chunk sizes, using shapes that provide a better balance between time series and spatial slice
accesses for this dataset.
I think the last row of this table supports the main point to be made in this first posting on chunking data.
By creating or rewriting important large multidimensional datasets using appropriate chunking, you can
tailor their access characteristics to make them more useful. Proper use of chunking can support more than
one common query pattern.
That's enough for now. In part 2, we'll discuss how to determine good chunk shapes, present a general way
to balance access times for 1D and 2D accesses in 3D variables, say something about generalizations to
higher dimension variables, and provide examples of rechunking times using the nccopy and h5repack
utilities.
In later posts, we plan to offer opinions on related issues, possibly including
chunk size tradeoffs: small chunks vs. large chunks
chunking support in tools
chunking and compression
complexity of the general rechunking problem
problems with big disk blocks
worst case performance
space-filling curves for improving access locality
In the meantime, Keep On Chunkin' ...
A note about timings
The times we quoted above are averaged from multiple consecutive runs on a desktop Linux system
(2.27GHz Xeon CPU, 4 cores, 24 GB of memory, 7200 rpm disk on a SATA-3 bus). Each timing uses a
different set of chunks, so we are not exploiting chunk caching. Before reading a subset of data from a file,
we run a command to flush and clear all the disk caches in memory, so running the timing repeatedly yields
nearly the same time. The command to clear disk caches varies from system to system. On Linux, it requires
privileges to set the SUID bit on a shell script:
#!/bin/sh
# Flush and clear the disk caches.
sync
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
On OSX, the only privilege required is knowledge of the "purge" command, which we wrap in a shell script
to make our benchmarks work the same on OSX as on Linux:
#!/bin/sh
# Flush and clear the disk caches.
purge