CRAWLING HIDDEN OBJECTS WITH KNN QUERIES
ABSTRACT
Many websites offering Location Based Services (LBS) provide a k NN search interface
that returns the top-k nearest-neighbor objects (e.g., nearest restaurants) for a given query
location. This paper addresses the problem of crawling all objects efficiently from an
LBS website, through the public k NN web search interface it provides. Specifically, we
develop crawling algorithm for 2D and higher-dimensional spaces, respectively, and
demonstrate through theoretical analysis that the overhead of our algorithms can be
bounded by a function of the number of dimensions and the number of crawled objects,
regardless of the underlying distributions of the objects. We also extend the algorithms to
leverage scenarios where certain auxiliary information about the underlying data
distribution, e.g., the population density of an area which is often positively correlated
with the density of LBS objects, is available. Extensive experiments on real-world
datasets demonstrate the superiority of our algorithms over the state-of-the-art
competitors in the literature.
CHAPTER 1
INTRODUCTION
Database Systems and Knowledgebase Systems share many common
principles. Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and
interaction between these two related fields of interest. DKE reaches a world-wide
audience of researchers, designers, managers and users. The major aim of the journal is to
identify, investigate and analyze the underlying principles in the design and effective use
of these systems.DKE achieves this aim by publishing original research results, technical
advances and news items concerning data engineering, knowledge engineering, and the
interface of these two fields.
Data & Knowledge Engineering (DKE) is a journal in database systems and
knowledgebase systems. It is published by Elsevier. It was founded in 1985, and is held
in over 250 academic libraries. The editor-in-chief is P.P. Chen (Dept. of Computer
Science, Louisiana State University, USA) This particular journal publishes 12 issues a
year. All articles from the Data & Knowledge Engineering journal can be viewed on
indexing services like Scopus and Science Citation Index.
The DKE delivers in-depth knowledge and competences on Data & Knowledge
Engineering, one of the most promising career areas for ambitious computer scientists. Its
subject area is "Engineering" for Data and for Knowledge, aiming to turn passive data
into exploitable knowledge:
It focusses on the representation, management and understanding of data and
knowledge assets. It encompasses technologies for the design and development of
advanced databases, knowledge bases and expert systems, methods for the extraction of
models and patterns from conventional data, texts and multimedia, modelling instruments
for the representation and updating of extracted knowledge. The Master DKE can be
studied on German or English and is thus open to students mastering either of the two
languages.
DKE COVERS THE FOLLOWING TOPICS:
1. Representation and Manipulation of Data & Knowledge: Conceptual data
models. Knowledge representation techniques. Data/knowledge manipulation languages
and techniques.
2. Architectures of database, expert, or knowledge-based systems: New
architectures for database / knowledge base / expert systems, design and implementation
techniques, languages and user interfaces, distributed architectures.
3. Construction of data/knowledge bases: Data / knowledge base design
methodologies and tools, data/knowledge acquisition methods,
integrity/security/maintenance issues.
4. Applications, case studies, and management issues: Data administration issues,
knowledge engineering practice, office and engineering applications.
5. Tools for specifying and developing Data and Knowledge Bases using tools
based on Linguistics or Human Machine Interface principles.
6. Communication aspects involved in implementing, designing and using KBSs
in Cyberspace.
AIMS & SCOPE
The scope includes the knowledge and data engineering aspects of computer
science, artificial intelligence, electrical engineering, computer engineering, and other
appropriate fields. This Transactions provides an international and interdisciplinary
forum to communicate results of new developments in knowledge and data engineering
and the feasibility studies of these ideas in hardware and software.
Specific areas to be covered are as follows: Fields and Areas of Knowledge and
Data Engineering: (a) Knowledge and data engineering aspects of knowledge based and
expert systems, (b) Artificial Intelligence techniques relating to knowledge and data
management, (c) Knowledge and data engineering tools and techniques, (d) Distributed
knowledge base and database processing, (e) Real-time knowledge bases and databases,
(f) Architectures for knowledge and data based systems, (g) Data management
methodologies, (h) Database design and modeling, (i) Query, design, and implementation
languages, (j) Integrity, security, and fault tolerance, (k) Distributed database control, (l)
Statistical databases, (m) System integration and modeling of these systems, (n)
Algorithms for these systems, (o) Performance evaluation of these algorithms, (p) Data
communications aspects of these systems, (q) Applications of these systems.
To reflect the current trends in knowledge and data engineering research and
development practice, TKDE gives priorities to submissions on the emerging topics,
including but not limited to big data and applications, new frontiers of knowledge and
data engineering, such as social networks, social media, and crowd sourcing.
Submissions purely focusing on the topics centered in some other sister IEEE
Transactions, such as core machine learning theory, pattern recognition, image
processing, computer vision, neural networks, and fuzzy systems, will not be considered.
CHAPTER 3
SYSTEM ANALYSIS
In this phase a detailed appraisal of the existing system is explained. This
appraisalincludes how the system works and what it does. It also includes findingout in
more detail- what are the problems with the system and what userrequires from the new
system or any new change in system.The output of this phase results in the detail model
of the system.The model describes the system functions and data and system
informationflow. The phase also contains the detail set of user requirements and
theserequirements are used to set objectives for the new system.
3.1 CURRENT SYSTEM:
It is important to note that the key technical challenge for crawling through a kNN
interface is to minimize the number of queries issued to the LBS service. The
requirement is caused by limitations imposed by most LBS services on the number
of queries allowed from an IP address or a user account (in case of an API service such as
Google Maps) for a given time period (e.g., one day). For example, Twitter limits the
search rate at 180 queries per 15 minute. Of course, no algorithm can possibly
accomplish the task without issuing at least n/k queries, where n is output size (i.e., the
number of crawled objects), because each query returns at most k of the n objects. As
such, we are bound to have an output-sensitive algorithm, which nevertheless should
have a query cost as close to n/k as possible.
3.2 SHORTCOMINGS OF THE CURRENT SYSTEM:
we study the problem of crawling the LBS through the restricted kNN search
interface
The crawling of an LBS database by issuing a small number of queries.
crawling through a kNN interface is to minimize the number of queries issued to
the LBS service.
3.3 PROPOSED SYSTEM:
We start with addressing the kNN crawling problem in 1-D spaces, and propose a 1-D
crawling algorithm with upper bound of the query cost being O(n/k), where n is the
number of output objects, and k is the top-k restriction. We then use the 1D algorithm as a
building block for kNN crawling over 2-D spaces, and present theoretical analysis which
shows that the query cost of the algorithm depends only on the number of output objects
n but not the data distribution in the spatial space. We extend the kNN crawling problem
to the general case of m-D spaces - which is the first such solution in the literature. Our
contributions also include a comprehensive set of experiments on both synthetic and real-
world data sets. The results demonstrate the superiority of our algorithms over the
existing solutions.
3.4 ADVANTAGE OF PROPOSED SYSTEM:
For 2-D space, we take external knowledge into consideration to improve the
crawling performance.
The experimental results show the effectiveness of our proposed algorithms.