KEMBAR78
Efficient Caching for Developing Regions | PDF | Cpu Cache | Proxy Server
0% found this document useful (0 votes)
219 views18 pages

Efficient Caching for Developing Regions

1. HashCache is designed to serve caching needs in developing world environments with low resources. It can be configured for small classrooms or large networks using different indexing schemes. 2. HashCache stores cached objects on disk based on a hash of their URL. It aims to reduce the memory needed for indexing cached objects to allow caching of more objects than traditional caches within the same memory constraints. 3. The design of HashCache focuses on eliminating the need for a full in-memory index while still providing reasonable performance through techniques like caching frequently accessed metadata in memory and using minimal indexing selectively to improve performance further.

Uploaded by

Hari S Pillai
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views18 pages

Efficient Caching for Developing Regions

1. HashCache is designed to serve caching needs in developing world environments with low resources. It can be configured for small classrooms or large networks using different indexing schemes. 2. HashCache stores cached objects on disk based on a hash of their URL. It aims to reduce the memory needed for indexing cached objects to allow caching of more objects than traditional caches within the same memory constraints. 3. The design of HashCache focuses on eliminating the need for a full in-memory index while still providing reasonable performance through techniques like caching frequently accessed metadata in memory and using minimal indexing selectively to improve performance further.

Uploaded by

Hari S Pillai
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 18

SEMINAR REPORT 2010 HASH CACHE

1. Introduction

HashCache is designed to serve the needs of developingworld environments, starting


with classrooms but working toward backbone networks. In addition to good
performance with low resource consumption, HashCache provides a number of
additional benefits suitable for developing-world usage: (a) many HashCache policies
can be tailored to use main memory in proportion to system activity, instead of cache
size; (b) unlike commercial caching appliances, HashCache does not need to be the
sole application running on the machine; (c) by simply choosing the appropriate
indexing scheme, the same cache software can be configured as a low-resource end
user cache appropriate for small classrooms, as well as a high-performance backbone
cache for higher levels of the network; (d) in its lowest-memory configurations,
HashCache can run on laptop-class hardware attached to External multi-terabyte
storage (via USB, for example), a scenario not even possible with existing designs;
and (e) HashCache provides a flexible caching layer, allowing it to be used not only
for Web proxies, but also for other cache-oriented storage systems. A previous
analysis of Web traffic in developing regions shows great potential for improving
Web performance. According to the study, kiosks in Ghana and Cambodia, with 10
to 15 users per day, have downloaded over 100 GB of data within a few months,
involving 12 to 14 million URLs. The authors argue for the need for applications that
can perform HTTP caching, chunk caching for large downloads and other forms of
caching techniques to improve the Web performance. With the introduction of
personal laptops into these areas, it is reasonable to expect even higher network traffic
volumes. Since HashCache can be shared by many applications and is not HTTP-
specific, it avoids the problem of diminishing returns seen with large HTTP-only
caches. Hash- Cache can be used by both a Web proxy and a WAN accelerator, which
stores pieces of network traffic to provide protocol-independent network compression.
This combination allows the Web cache to store static Web content, and then use the
WAN accelerator to reduce redundancy in dynamically-generated content, such as
news sites,Wikipedia, or even locally-generated content, all of which may be marked
uncacheable, but which tend to only change slowly over time. While modern Web
pages may be large, they tend to be composed of many small objects, such as dozens
of small embedded images. These objects, along with tiny fragments of cached
DEPT OF IT 1 KMCTCE
SEMINAR REPORT 2010 HASH CACHE
network traffic from a WAN accelerator, put pressure on traditional caching
approaches using in-memory indexing. A Web proxy running on a terabyte-sized
HashCache can provide a large HTTP store, allowing us to not only cache a wide
range of traffic, but also speculatively preload content during off-peak hours.
Furthermore, this kind of system can be driven from a typical OLPC-class laptop,
with only 256MB of total RAM. One such laptop can act as a cache server for the rest
of the laptops in the deployment, eliminating the need for separate server class
hardware. In comparison, using current Web proxies, these laptops could only index
30GB of disk space.

DEPT OF IT 2 KMCTCE
SEMINAR REPORT 2010 HASH CACHE

2. System principle

While typical Web proxies implement a number of features, such as HTTP protocol
handling, connection management, DNS and in-memory object caching, their
performance is generally dominated by their file system organization. As such, we
focus on the file system component because it determines the overall performance of
a proxy in terms of the peak request rate and object cacheability. With regard to file
systems, the two main optimizations employed by proxy servers are hashing and
indexing objects by their URLs, and using raw disk to bypass file system
inefficiencies. We discuss both of these aspects below.

The Harvest cache introduced the design of storing objects by a hash of their
URLs, and keeping an inmemory index of objects stored on disk. Typically, two
levels of subdirectories were created, with the fan-out of each level configurable. The
high-order bits of the hash were used to select the appropriate directories, and the file
was ultimately named by the hash value. This approach not only provided a simple
file organization, but it also allowed most queries for the presence of objects to be
served from memory, instead of requiring disk access. The older CERN proxy, by
contrast, stored objects by creating directories that matched the components of the
URL. By hashing the URL, Harvest was able to con trol both the depth and fan-out of
the directories used to store objects. The CERN proxy, Harvest, and its descendant,
Squid, all used the file systems provided by the operating system, simplifying the
proxy and eliminating the need for controlling the on-disk layout. The next step in the

DEPT OF IT 3 KMCTCE
SEMINAR REPORT 2010 HASH CACHE
evolution of proxy design was using raw disk and custom file systems to eliminate
multiple levels of directory traversals and disk head seeks associated with them. The
in-memory index now stored the location on disk where the object was stored,
eliminating the need for multiple seeks to find the start of the object. 1 The first block
of the on-disk file typically includes extra metadata that is too big to be held in
memory, such as the complete URL, full response headers, and location of subsequent
parts of the object, if any, and is followed by the content fetched from the origin
server. In order to fully utilize the disk writing throughput, those blocks are often
maintained consecutively, using a technique similar to log-structured file system(LFS)
. Unlike LFS, which is expected to retain files until deleted by the user, cache file
systems can often perform disk cache replacement in LIFO order, even if other
approaches are used for main memory cache replacement. Table 1 summarizes the
object lookup and storage management of various proxy implementations that have
been used to build Web caches.

The upper bound on the number of cacheable objects per proxy is a function of
available disk cache and physical memory size. Attempting to use more memory than
the machine’s physical memory can be catastrophic for caches, since unpredictable
page faults in the application can degrade performance to the point of unusability.
When these applications run as a service at network access points, which is typically
the case, all users then suffer extra latency when page faults occur. The components

DEPT OF IT 4 KMCTCE
SEMINAR REPORT 2010 HASH CACHE
of the in-memory index vary from system to system. Each entry has some object-
specific information, such as its hash value and object size. Information such as the
location on disk, which disk, and which generation of log, to avoid problems with log
wrapping. The entries typically are stored in a chain per hash bin, and a doubly-linked
LRU list across all index entries. Finally, to shorten hash bin traversals (and the
associated TLB pressure), the number of hash bins is typically set to roughly the
number of entries. Using these fields and their sizes, the total consumption per index
entry can be as low as 32 bytes per object, but given that the average Web object is
roughly 8KB (where a page may have tens of objects), even 32 bytes per object
represents an in-memory index storage that is 1/256 the size of the on-disk storage.
With a more realistic index structure, which can include a larger hash value,
expiration time, and other fields, the index entry can be well over 80 bytes (as in the
case of Squid), causing the in-memory index to exceed 1% of the on-disk storage size.
With a single 1TB drive, the in-memory index alone would be over 10GB. Increasing
performance by using multiple disks would then require tens of gigabytes of
RAM.Reducing the RAM needed for indexing is desirable for several scenarios. Since
the growth in disk capacities has been exceeding the growth of RAM capacity for
some time, this trend will lead to systems where the disk cannot be fully indexed due
to a lack of RAM. Dedicated RAM also effectively limits the degree of
multiprogramming of the system, so as processors get faster relative to network
speeds, one may wish to consolidate multiple functions on a single server. WAN
accelerators, for example, cache network data , so having very large storage can
reduce bandwidth consumption more than HTTP proxies alone. Similarly, even in
HTTP proxies, RAM may be more useful as a hot object cache than as an index, as is
the case in reverse proxies (server accelerators) and content distribution networks.
One goal in designing HashCache is to determine how much index memory is really
necessary.

DEPT OF IT 5 KMCTCE
SEMINAR REPORT 2010 HASH CACHE

3. Design

In this section, we present the design of HashCache and show how performance can
be scaled with available memory. We begin by showing how to eliminate the in-
memory index while still obtaining reasonable performance, and then we show how
selective use of minimal indexing can improve performance.

3.1 Removing the In-Memory Index

We start by removing the in-memory index entirely, and incrementally introducing


minimal metadata to systematically improve performance. To remove the in-memory
index, we have to address the two functions the inmemory index serves: indicating the
existence of an object and specifying its location on disk. Using filesystem directories
to store objects by hash has its own performance problems, so we seek an alternative
solution – treating the disk as a simple hashtable. HashCache-Basic, the simplest
design option in the HashCache family, treats part of the disk as a fixed-size, non-
chained hash table, with one object stored in each bin. This portion is called the Disk
Table. It hashes the object name (a URL in the case of aWeb cache) and then
calculates the hash value modulo the number of bins to determine the location of the
corresponding file on disk. To avoid false positives from hash collisions, each stored
object contains metadata, including the original object name, which is compared with
the requested object name to confirm an actual match. New objects for a bin are
simply written over any previous object. Since objects may be larger than the fixed-
size bins in the Disk Table, we introduce a circular log that contains the remaining
portion of large objects. The object metadata stored in each Disk Table bin also
includes the location in the log, the object size, and the log generation number.

DEPT OF IT 6 KMCTCE
SEMINAR REPORT 2010 HASH CACHE

HashCache-Set: Objects with hash value I search through the i/N th set for the first
block of a file. Later blocks are in the circular log. Some arrows are shown crossed to
illustrate that objects that map on to a set can be placed anywhere in the set.

The performance impact of these decisions is as follows: in comparison to high-


performance caches, HashCache-Basic will have an increase in hash collisions
(reducing cache hit rates), and will require a disk access on every request, even cache
misses. Storing objects will require one seek per object (due to the hash randomizing
the location), and possibly an additional write to the circular log.

3.2 Collision Control Mechanism

While in-memory indexes can use hash chaining to eliminate the problem of hash
values mapped to the same bin, doing so for an on-disk index would require many
random disk seeks to walk a hash bin, so we devise a simpler and more efficient
approach while retaining most of the benefits.In HashCache-Set, we expand the Disk
Table to become an N-way set-associative hash table, where each bin can store N
elements. Each element still contains metadata with the full object name, size, and
location in the circular log of any remaining part of the object. Since these locations
are contiguous on disk, and since short reads have much lower latency than seeks,
reading all of the members of the set takes only marginally more time than reading
just one element. This approach reduces the impact of popular objects mapping to the

DEPT OF IT 7 KMCTCE
SEMINAR REPORT 2010 HASH CACHE
same hash bin, while only slightly increasing the time to access an object. While
HashCache-Set eliminates problems stemming from collisions in the hash bins, it still
has several problems: it requires disk access for cache misses, and lacks an efficient
mechanism for cache replacement within the set. Implementing something like LRU
within the set using the on-diskmechanismwould require a potential disk write on
every cache hit, reducing performance.

3.3 Avoiding Seeks for Cache Misses

Requiring a disk seek to determine a cache miss is a major issue for workloads with
low cache hit rates, since an index-less cache would spend most of its disk time
confirming cache misses. This behavior would add extra latency for the end-user, and
provide no benefit. To address the problem of requiring seeks for cache misses, we
introduce the first HashCache policy with any in-memory index, but employ several
optimizations to keep the index much smaller than traditional approaches. As a
starting point, we consider storing in main memory an H-bit hash values for each
cached object. These hash values can be stored in a two-dimensional array which
corresponds to the Disk Table, with one row for each bin, and N columns
corresponding to the N-way associativity. An LRU cache replacement policy would
need forward and reverse pointers per object to maintain the LRU list, bringing the
per-object RAM cost to (H + 64) bits assuming 32-bit pointers. However, we can
reducethis storage as follows.

Table 3: Summary of HashCache policies, with Squid and commercial entries


included for comparison. Main memory consumption values assume an average
object size of 8KB.

DEPT OF IT 8 KMCTCE
SEMINAR REPORT 2010 HASH CACHE

First, we note that all the entries in an N-entry set share the same modulo hash value
(%S) where S is the number of sets in the Disk Table. We can drop the lowest log(S)
bits from each hash value with no loss, reducing the hash storage to only H - log(S)
bits per object. Secondly, we note that cache replacement policies only need to be
implemented within the N-entry set, so LRU can be implemented by simply ranking
the entries from 0 to N-1, thereby using only log(N) bits per entry. We can further
choose to keep in-memory indexes for only some sets, not all sets, so we can restrict
the number of in-memory entries based on request rate, rather than cache size. This
approach keeps sets in an LRU fashion, and fetches the in-memory index for a set
from disk on demand. By keeping only partial sets, we need to also keep a bin number
with each set, LRU pointers per set, and a hash table to find a given set in memory.
Deciding when to use a complete two-dimensional array versus partial sets with bin
numbers and LRU pointers depends on the size of the hash value and the set
associativity. Assuming 8-way associativity and the 8 most significant hash bits per
object, the break-even point is around 50% – once more than half the sets will be
stored in memory, it is cheaper to remove the LRU pointers and bin number, and just
keep all of the sets. If the full array is kept in memory, we call it HashCache-SetMem,
and if only a subset are kept in memory, we call it HashCache-SetMemLRU. With a
low hash collision rate, HashCache-SetMem can determine most cache misses
without accessing disk, whereas HashCache-SetMemLRU, with its tunable memory
consumption, will need disk accesses for some fraction of the misses. However, once
a set is in memory, performing intra-set cache replacement decisions requires no disk
access for policy maintenance. Writing objects to disk will still require disk access.

3.4 Optimizing CacheWrites

With the previous optimizations, cache hits require one seek for small files, and cache
misses require no seeks (excluding false positives from hash collisions) if the
associated set’s metadata is in memory. Cache writes still require seeks, since object
locations are dictated by their hash values, leaving HashCache at a performance
disadvantage to high-performance caches that can write all content to a circular log.
This performance problem is not an issue for caches with low request rates, but will
become a problem for higher request rate workloads. To address this problem, we
DEPT OF IT 9 KMCTCE
SEMINAR REPORT 2010 HASH CACHE
introduce a new policy, HashCache-Log, that eliminates the Disk Table and treats the
disk as a log, similar to the high-performance caches. For some or all objects, we store
an additional offset (32 or 64 bits) specifying the location on disk. We retain the N-
way set associativity and per-set LRU replacement because they eliminate disk seeks
for cache misses with compact implementation. While this approach significantly
increases memory consumption, it can also yield a large performance advantage, so
this tradeoff is useful in many situations. However, even when adding the log
location, the in-memory index is still much smaller than traditional caches. For
example, for 8-way set associativity, per-set LRU requires 3 bits per entry, and 8 bits
per entry can minimize hash collisions within the set. Adding a 32-bit log position
increases the per-entry size from 11 bits to 43 bits, but virtually eliminates the impact
of write traffic, since all writes can now be accumulated and written in one disk seek.
Additionally, we need a few bits (assume 4) to record the log generation number,
driving the total to 47 bits. Even at 47 bits per entry, HashCache-Log still uses
indexes that are a factor of 6-12 times smaller than current high-performance proxies.
We can reduce this overhead even further if we exploitWeb object popularity, where
half of the objects are rarely, if ever, re-referenced . In this case, we can drop half of
the log positions from the in-memory index, and just store them on disk, reducing the
average perentry size to only 31 bits, for a small loss in performance. HashCache-
LogLRU allows the number of log position entries per set to be configured, typically
using N 2 log positions per N-object set. The remaining log offsets in the set are
stored on the disk as a small contiguous file. Keeping this file and the in-memory
index in sync requires a few writes reducing the performance by a small amount. The
in-memory index size, in this case, is 9-20 times smaller than traditional high-
performance systems.

3.5 Prefetching Cache Reads

With all of the previous optimizations, caching storage can require as little as 1 seek
per object read for small objects, with no penalty for cache misses, and virtually no
cost for cache writes that are batched together and written to the end of the circular
log. However, even this performance can be further improved, by noting that
prefetching multiple objects per read can amortize the read cost per object. Correlated
access can arise in situations like Web pages, where multiple small objects may be
DEPT OF IT 10 KMCTCE
SEMINAR REPORT 2010 HASH CACHE
embedded in the HTML of a page, resulting in many objects being accessed together
during a small time period. Grouping these objects together on disk would reduce disk
seeks for reading and writing. The remaining blocks for these pages can all be
coalesced together in the log and written together so that reading them can be faster,
ideally with one seek.

Table 4: Throughputs for techniques, rr = peak request rate, chr = cache hit rate, cbr =
cacheability rate, rel = average number of related objects, t = peak disk seek rate – all
calculations include read prefetching, so the results for Log and Grouped are the
same. To exclude the effects of read prefetching, simply set rel to one.

The only change necessary to support this policy is to keep a content length (in
blocks) for all of the related content written at the same time, so that it can be read
together in one seek. When multiple related objects are read together, the system will
perform reads at less than one seek per read on average. This approach can be applied
to many of the previously described Hash- Cache policies, and only requires that the
application using HashCache provide some information about which objects are
related. Assuming prefetch lengths of no more than 256 blocks, this policy only
requires 8 bits per index entry being read. In the case of HashCache- LogLRU, only
the entries with in-memory log position information need the additional length
information. Otherwise, this length can also be stored on disk. As a result, adding this
prefetching to HashCache-LogLRU only increases the in-memory index size to 35
bits per object, assuming half the entries of each set contain a log position and
prefetch length.
DEPT OF IT 11 KMCTCE
SEMINAR REPORT 2010 HASH CACHE
.
3.6 Expected Throughput

To understand the throughput implications of the various HashCache schemes, we


analyze their expected performance under various conditions. The maximum request
rate(rr) is a function of the disk seek rate, the hit rate, the miss rate, and the write rate.
The write rate is required because not all objects that are fetched due to cache misses
are cacheable. Table 4 presents throughputs for each system as a function of these
parameters. The cache hit rate(chr) is simply a number between 0 and 1, as is the
cacheability rate (cbr). Since the miss rate is (1 - chr), the write rate can be
represented as (1 - chr) · cbr. The peak disk seek rate(t) is a measured quantity that
is hardware-dependent, and the average number of related objects(rel) is always a
positive number. Due to space constraints, we omit the derivations for these
calculations. These throughputs are conservative estimates because we do not take
into account the in-memory hot object cache, where some portion of the main
memory is used as a cache for frequently used objects, which can further improve
throughput.

DEPT OF IT 12 KMCTCE
SEMINAR REPORT 2010 HASH CACHE

4.HashCache Implementation

We implement a common HashCache filesystem I/O layer so that we can easily use
the same interface with different applications. We expose this interface via POSIX-
like calls, such as open(), read(), write(), close(), seek(), etc., to operate on files being
cached. Rather than operate directly on raw disk, HashCache uses a large file in the
standard Linux ext2/ext3 filesystem, which does not require root privilege. Creating
this zero-filled large file on a fresh ext2/ext3 filesystem typically creates a mostly
contiguous on-disk layout. It creates large files on each physical disk and multiplexes
them for performance. The HashCache filesystem is used by the Hash- Cache Web
proxy cache as well as other applications we are developing.

4.1 External Indexing Interface

HashCache provides a simple indexing interface to support other applications. Given


a key as input, the interface returns a data structure containing the file descriptors for
the Disk Table file and the contiguous log file (if required), the location of the
requested content, and metadata such as the length of the contiguous blocks belonging
to the item, etc. We implement the interface for each indexing policy we have
described in the previous section. Using the data returned from the interface one can
utilize the POSIX calls to handle data transfers to and from the disk. Calls to the
interface can block if disk access is needed, but multiple calls can be in flight at the
same time. The interface consists of roughly 600 lines of code, compared to 21000
lines for the HashCache Web Proxy.

4.2 HashCache Proxy

The HashCache Web Proxy is implemented as an event-driven main process with


cooperating helper processes/ threads handling all blocking operations, such as DNS
lookups and disk I/Os, similar to the design of Flash . When the main event loop
receives a URL request from a client, it searches the in-memory hot-object cache to
see if the requested content is already in memory. In case of a cache miss, it looks up
the URL using one of the HashCache indexing policies. Disk I/O helper processes use
DEPT OF IT 13 KMCTCE
SEMINAR REPORT 2010 HASH CACHE
the HashCache filesystem I/O interface to read the object blocks into memory or to
write the fetched object to disk. To minimize inter-process communication (IPC)
between the main process and the helpers, only beacons are exchanged on IPC
channels and the actual data transfer is done via shared memory.

4.3 Flexible Memory Management

HTTP workloads will often have a small set of objects that are very popular, which
can be cached in main memory to serve multiple requests, thus saving disk I/O.
Generally, the larger the in-memory cache, the better the proxy’s performance.
HashCache proxies can be configured to use all the free memory on a system without
unduly harming other applications. To achieve this goal, we implement the hot object
cache via anonymous mmap() calls so that the operating system can evict pages as
memory pressure dictates. Before the HashCache proxy uses the hot object cache, it
checks thememory residency of the page via the mincore() system call, and
simply treats any missing page as a miss in the hot object cache. The hot object cache
is managed as an LRU list and unwanted objects or pages no longer in main memory
can be unmapped. This approach allows the Hash- Cache proxy to use the entire main
memory when no other applications need it, and to seamlessly reduce its memory
consumption when there is memory pressure in the system. In order to maximize the
disk writing throughput, the HashCache proxy buffers recently-downloaded objects so
that many objects can be written in one batch (often to a circular log). These dirty
objects can be served from memory while waiting to be written to disk. This dirty
object cache reduces redundant downloads during flash crowds because many popular
HTTP objects are usually requested by multiple clients. HashCache also provides for
grouping related objects to disk so that they can be read together later, providing the
benefits of prefetching. The HashCache proxy uses this feature to amortize disk seeks
over multiple objects, thereby obtaining higher read performance. One commercial
system parses HTML to explicitly find embedded objects , but we use a simpler
approach – simply grouping downloads by the same client that occur within a small
time window and that have the same HTTP Referrer field. We have found that this
approach works well in practice, with much less implementation complexity.

DEPT OF IT 14 KMCTCE
SEMINAR REPORT 2010 HASH CACHE

4.4 Parameter Selection

For the implementation, we choose some design parameters such as the block size, the
set size, and the hash size. Choosing the block size is a tradeoff between space usage
and the number of seeks necessary to read small objects. In the analysis of object sizes
from a live, widely-usedWeb cache called CoDeeN . We see that nearly 75% of
objects are less than 8KB, while 87.2%are less than 16KB. Choosing an 8KB block
would yield better disk usage, but would requiremultiple seeks for 25% of all objects.
Choosing the larger block size wastes some space, but may increase performance.
Since the choice of block size influences the set size, we make the decisions based on
the performance of current disks.

Table 6 shows the average number of seeks per second of three recent SATA disks
(18, 60 and 150 GB each). We notice the sharp degradation beyond 64KB, so we use
that as the set size. Since 64KB can hold 4 blocks of 16KB each or 8 blocks of 8KB
each, we opt for an 8KB block size to achieve 8-way set associativity. With 8 objects
per set, we choose to keep 8 bits of hash value per object for the in-memory indexes,
to reduce the chance of collisions. This kind of an analysis can be automatically
performed during initial system configuration, and are the only parameters needed
once the specific HashCache policy is chosen.

DEPT OF IT 15 KMCTCE
SEMINAR REPORT 2010 HASH CACHE

5. Testing and Deployments

HashCache is currently deployed at two different locations in Africa, at the Obafemi


Awolowo University (OAU) in Nigeria and at the Kokrobitey Institute (KI) in Ghana.
At OAU, it runs on their university server which has a 100 GB hard drive, 2 GB
memory and a dual core Xeon processor. For Internet connection, they pay $5,000 per
month for a 2 Mbps satellite link to an ISP in Europe and the link has a high variance
ICMP ping time from Princeton ranging 500 to 1200 ms. We installed HashCache-
Log on the machine but were asked to limit resource usage for HashCache to 50 GB
disk space and no more than 300 MB of physical memory. The server is running other
services such as a E-mail service and a firewall for the department and it is also used
for general computation for the students. Due to privacy issues we were not able to
analyze the logs from this deployment but the administrator has described the system
as useful and also noticed the significant memory and CPU usage reduction when
compared to Squid. At KI, HashCache runs on a wireless router for a small
department on a 2 Mbps LAN. The Internet connection is through a 256 Kbps sub-
marine link to Europe and the link has a ping latency ranging from 200 to 500 ms.
The router has a 30 GB disk and 128 MB of main memory and we were asked to use
20 GB of disk space and as little memory as possible. This prompted us to use the
HashCache-Set policy as there are only 25 to 40 peopleusing the router every day.
Logging is disabled on this machine as well since we were asked not to consume
network bandwidth on transferring the logs. In both these deployments we have used
HashCache policies to improve theWeb performance while consuming minimum
amount of resource. Other solutions like Squid would not have been able to meet
these resource constraints while providing any reasonable service. People at both
places told us that the idea of a faster Internet to popularWeb sites seemed like a
distant dream until we discussed the complete capabilities of HashCache. We are
currently working with OLPC to deploy HashCache at more locations with the OLPC
XS servers.

DEPT OF IT 16 KMCTCE
SEMINAR REPORT 2010 HASH CACHE

6. Conclusion and Future Work

HashCache, is a highperformance configurable cache storage for the developing


regions. HashCache provides a range of configurations that scale from using no
memory for indexing to ones that require only one-tenth as much as current high-
performance approaches. It provides this flexibility without sacrificing performance –
its lowest-resource configuration has performance comparable to free software
systems, while its high-end performance is comparable to the best commercial
systems. These configurations allow memory consumption and performance to be
tailored to application needs, and break the link between storage size and in-memory
index size that has been commonly used in caching systems for the past decade. The
benefits of HashCache’s low resource consumption allow it to share hardware with
other applications, share the filesystem, and to scale to storage sizes well beyond what
present approaches provide. On top of the HashCache storage layer, we have built a
Web caching proxy, the HashCache Proxy, which can run using any of the
HashCache configurations. Using industry-standard benchmarks and a range of
hardware configurations, we have shown that HashCache performs competitively with
existing systems across a range of workloads. This approach provides an economy of
scale in HashCache deployments, allowing it to be powered fromlaptops, low
resource desktops, and even highresource servers. In all cases, HashCache either
performs competitively or outperforms other systems suited to that class of hardware.
With its operation flexibility and a range of available performance options,
HashCache is well suited to providing the infrastructure for caching applications in
developing regions. Not only does it provide competitive performance with the
stringent resource constraint , but also enables new opportunities that were not
possible with existing approaches. We believe that HashCache can become the basis
for a number of network caching services, and are actively working toward this goal.

DEPT OF IT 17 KMCTCE
SEMINAR REPORT 2010 HASH CACHE

References

[1] BLOOM, B. H. Space/time trade-offs in hash coding with allowable errors.


Communications of the ACM 13 (1970), 422–426.
[2] BREWER, E., GAUTHIER, P., AND MCEVOY, D. Long-term viability of large-
scale caches. In Proceedings of the 3rd International WWW Caching Workshop
(1998).
[3] CHANKHUNTHOD, A., DANZIG, P. B., NEERDAELS, C., SCHWARTZ, M.
F., AND WORRELL, K. J. A hierarchical internet object cache. In Proceedings of the
USENIX Annual Technical Conference (1996).
[4] COX, A. L., HU, Y. C., PAI, V. S., PAI, V. S., AND ZWAENEPOEL,W. Storage
and retrieval system for WEB cache. U.S. Patent 7231494, 2000.
[5] DU, B., DEMMER, M., AND BREWER, E. Analysis of WWW traffic in
Cambodia and Ghana. In Proceedings of the 15th International conference on World
Wide Web (WWW) (2006).
[6] FELDMANN, A., CACERES, R., DOUGLIS, F., GLASS, G., AND
RABINOVICH, M. Performance of web proxy caching in heterogeneous bandwidth
environments. In Proceedings of the 18th IEEE INFOCOM (1999).

DEPT OF IT 18 KMCTCE

You might also like