KEMBAR78
Web Based Information Retrieval | PDF | Search Engine Indexing | Areas Of Computer Science
0% found this document useful (0 votes)
257 views83 pages

Web Based Information Retrieval

Final year Project that evaluates retrieval methods from internet content, describes the software development cycle and methodologies. It goes through Google algorithms and techniques, finally it demonstrate a set of tools (created as part pf the final year project in java) for retrieval and ranking information using neural networks.

Uploaded by

tpitikaris
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
257 views83 pages

Web Based Information Retrieval

Final year Project that evaluates retrieval methods from internet content, describes the software development cycle and methodologies. It goes through Google algorithms and techniques, finally it demonstrate a set of tools (created as part pf the final year project in java) for retrieval and ranking information using neural networks.

Uploaded by

tpitikaris
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

WEB BASED INFORMATION

RETRIEVAL
by
Theodoros Pitikaris
A thesis submitted in partial fulfill-
ment of the requirements for the
degree of:
BSc in Computing and
Information Technology
Department of Computing
University of Surrey
UNIVERSITY OF SURREY
ABSTRACT
WEB BASED INFORMATION RE-
TRIEVAL
by
Theodoros Pitikaris
Supervisory Committee: Dr . Bogdan Vrusias
Department of Computing
Dr. Nick Antonopoulos
Department of Computing
Web World Wide contains large sets of information. This characteristic of
Web however, can become a real pain for users who seek sources that
would be qualitative and relative, at the same time, to their informative
needs. In this Final Year project we try to examine some information re-
trieval methods over web stored information. The main focus is given on if
and how software agents could potentially enhance the information re-
trieval process.
Another topic that we examine in this final year project is the require-
ments, phases and evaluation process that are necessary in software de-
sign & production process.
Table of Contents
Introduction .............................. . ....................... . ................................ 1
Final year project objectives .............................. .. .. ......................... 3
Final year Project Structure .................. . ....................... .. .... . ............. .. ... ..... 3
Chapter 1. ANALYSIS I- LITERATURE REVIEW .... .. ......................... .. 5
GOOGLE search engine ......... ................................................ ......... 5
Text Retrieval Methods .............................................................. . .. 11
Natural Language processing ................................. . .............. . ....... . ...... ... .. 15
Neural Network as infrastructure in retrieval. ......... . ................ . .. . .... . .. . ........ 16
Latent Semantic Indexing ..................... . .... . ..... .... . ................... . .. .... .. ........ 16
Latent Semantic Algorithm .... .. .. ...... ....... . ...................... . ... .... .. ................. 17
Advantages of Neural Network Models over Traditional IR Models .... .... ...... ... 17
Special issues on web Information Retrieval ................. . .. . .. .... . .. . ....... . .. ..... . 18
TheAagent's Technology ........ ... .. .. . ... ...... . ..... ... .. ... ..... ........... . ....... 19
Introduction ........... ... ............... ..... ........... . .. ... ... .... ... ....... ... ... .. ...... . .. ....... 19
Categories of agents in more details ............................... . .. .... ............ ...... .. 20
Chapter 2. SYSTEM Development Process .................................. . .... 22
Definition of software development process ................ .... . .... .. ... ... .. . 22
System Development Life Cycle (SDLC) ...................... . .. . ...... .. ... .... 23
Agile Software Development in details ................ ... . ..... . ... .... . .... .. .. .. ... ........ 27
General Characteristics of SDLC ................................. . ... . .. .... . .... .. . 32
Requirement Gathering and Prioritization ... ................................... . 34
Software requirements analysis . .... . ..... ... ... .. ... . ... .... ... ... .. .. ....... ... ... .... . ...... 34
Requirements Gathering ................................... . .... . ............ . ..... .. .. 36
Problems & Difficulties .................................................................. 37
Main techniques of Information Gathering .................................................. 39
Chapter 3. Software Requirements Specification ............................. .41
Introduction .................................................... . ...... .. ... ........ ....... . 41
Identification .. . ................... ......... .. .. ... ... .. ........ . ........................... 41
System overview ........................................ ........... ......... ........ .. ... 42
Definitions, Acronyms, and Abbreviations ....................................... 43
Reference ............................................................... . ................ ... 43
Genera I Description .................................................. . ................... 43
User Personas and Characteristics ....................... .. ............... ..................... 43
Product Perspective ............................................................. ........... .. .. .. . .. 43
Overview of Functional Requirements ............................................ 44
Overview of Data Requirements ........... .. .. .......... ..... .... .... ..... ...... ... 45
General Constraints, Assumptions, Dependencies, Guidelines ........... 46
External Interface Requirements ......... .. .... ...... ... .. .... .. ............ .. .. .. . 47
Detailed Description of Functional Requirements ............................ .48
Performance Requirements ......................................................... ... ... .. .. .. .. 49
Quality Attributes .................................................................... ... .. ... ... .. ... 50
Other Requirements ............................................... .. ..... ....... .................... 50
Chapter 4. System Design .......................................................... ... 51
Methodology Chosen .................................................................... 51
System Overview ... ... . ....... ... ....... . .. ......... ...... ...................... .. .... ... 52
System Core and front- ends .......... .. ................... ... .. .. .. .. .......... ..... ...... .... 52
Project development process .... ... ... ..... ............ .. ... ..... ................ .... ........... 54
Chapter 5. Software Development PHASES in Details ...... .......... ....... 58
Design Overview ............................................ ...... . .. ................ .. ... 58
Facilities .. ......... . ................. ........ .... . ................... . ........................ ......... .. 58
The core system .... .. ............................... .. .. .. ............................. .. 59
Software development platform ............. . .......................... ......... ............... 59
Intergraded Development Environment Development.. ........................ .. ...... 60
System Design ............................................... .......... ....................... ..... ... 61
.Unit Testing ........................... .............. ................................ ........ .......... 69
Integration Testing ................ . ............................. .. .................. . ... ............ 70
Chapter 6. DISCUSSION ...................................... .. ....................... 72
Interesting parts during development process ................................ . 72
Prototype evaluation ................ .................................. .................. 72
Comments on the evaluation results and related work ..................... 74
Overall project Evaluation ................................ ............................. 75
Chapter 7. Conclusions ...... .... ..... ........................................... .. ..... 77
Future work ............................................................................. ... 78
INDEX .. ............. .. .... ...... ... .. .. ... ..... .... ....... ........ ........ ... ..................... 83
ii
LIST OF TABLES
Table 1 Agile vs Waterfall methodology (available from
http:/ /en. wikipedia.org/wiki/ Agile_software_development) ........................ .. 29
Table 2 Development Phases ........................................ .. ................... .. ................ 57
Table 3 Sample of a Matrix candidate for SVD .. .. .... .. ...... .. .. ................................... 64
List of figures
Figure 1 Google database development ................................................................ 6
Figure 2 The Waterfall Model .............................................................................. 26
Figure 4 Waterfall vs. Agile ................................................................................ 28
Figure 5 System Use Case ................ .. .. ....... .... .. .. .. ............ .. ......... .. ................... 62
Figure 6 System State Qiagram .......................................................................... 63
Figure 7 Users' opinion about the system ................ .... ............ .... .. ..................... 74
iii
Acknowledgments
The author wishes to express sincere appreciation to Mr Staurakakis
Emanuel and Mr Tsagatsakis John for their assistance in the preparation of
this Final year Project report.
iv
INTRODUCTION
In 2001 the Bank of Sweden Prize in Economic Sciences in Memory
of Alfred Nobel was awarded to James Mirrlees and William Vickrey
for their fundamental contributions to the theory of incentives under
asymmetric information.
With their work
(http://www .nobel.se/economics/laureates/2001/ecoadv. pdf)
they have validated not only the importance of the Information but
also the importance of accessibility over this information.
Nowadays everyone in west, especially after the development of the
internet, has access to large amount of data, in electronic or paper
form. The main problem that we usually face is that the volume of
this information is so large that we can not easily handle it, or worse
it has no use.
In order to take advantage of this information we need to categorize
it in thematic cohesion and thus to manageable data. A few decades
ago this was librarians' line, but as already mentioned the volume of
data has increased dramatically in such a degree(Society, 2004)
that the traditional methods of indexing are not in position to face
this new challenge.
The problem gets bigger when we need to categorize new documents
based on their content, of course in many documents their is an ab-
stract on top of them ; but in fact only scientific papers with a special
purpose have this form, for example an abstract is essential for a
paper but not for a newspaper or a magazine.
1
Some people believe that when we talk about retrieving dat<
through internet things are very easy; because there we have thE
assistance of search engines.
The Internet search engines are of the largest and most c o m m o n l ~
used. Huge databases of millions of Web pages typically index ever:
word on each one of the pages.
By using them, searchers expect to find every page that contains ar
occurrence of their search term, while the public in general hopes t <
find pages on the subject of the terms they enter.
The Web search engines and their databases although can find somE
pages that contain the search terms and an occasional page that i ~
actually about the concept represented by the search terms. But thE
majority of engines do not understand the content of the page; the'
only play with statistics and probability.
To make things worse lot of web designers, in order to attract mon
visitors for their pages, use common words in meta-keys or on th1
body of their web pages that do not have any relation with the con
tent of the Pag
[http:/ /webreference.com/ content/search/how. html].
In addition Knowledge isn't all the times well-structured and tha
generating extra difficulties in our effort to use and exploit the a
propriate data. Furthermore knowledge is a dynamic entity gene
ated as a result of social interaction between actors.
In order to overcome all the aforementioned deficiencies a numb
of Information agent suggest the use of intelligent agents or mul
agent environments. In that case autonomous units will guide, pro
2
Information Retrieval techniques. Also in this chapter there is a smooth
introduction in the Agent technology.
The above chapter makes an introduction to the System Development
process. In this chapter the most common System Development Tech-
niques is presented a smooth comparison between them.
The fourth chapter has the form of an official System Requirements Speci-
fication report.
The fifth chapter presents the methodology that was followed for the de-
velopment of the final year project. In addition a discussion about the sys-
tem overview and the project development is taken place.
The sixth chapter gives details about the each of the development phase,
the objectives, the challenges and the difficulties that was met in each
phase.
The final chapter contains the conclusion and points for future work.
4
CHAPTER 1. ANALYSIS I - LITERATURE REVIEW
GOOGLE search engine
Google is a key player in search engines marker owned by
Google Inc.
The mission statement of the company is to : "organize t he world's
information and make it universally accessible and useful. "
Among the largest search engine on the web, Google receives
over 200 million queries each day through its various services
(Economist 2006).
In 2006, Google has indexed over 25 billion web pages, 1.3 billion
images, and over one billion Usenet messages - in total, ap-
proximately 12 billion items. It also caches much of the content
that it indexes. Google operates other tools and services including
Google News, Google Suggest, Froogle, and Google Desktop
Search.
By checking in web archive website we can see that the size of
Google database is developing with high rates.
5
CHAPTER 1. ANALYSIS I- LITERATURE REVIEW
GOOGLE search engine
Google is a key player in search engines marker owned by
Google Inc.
The mission statement of the company is to : "organize the world' s
information and make it universally accessible and useful. "
Among the largest search engine on the web, Google receives
over 200 million queries each day through its various services
(Economist 2006).
In 2006, Google has indexed over 25 billion web pages, 1.3 billion
images, and over one billion Usenet messages - in total, ap-
proximately 12 billion items. It also caches much of the content
that it indexes. Google operates other tools and services including
Google News, Google Suggest, Froogle, and Google Desktop
Search.
By checking in web archive website we can see that the size of
Google database is developing with high rates.
5
Google database development
2,50E+10
2,00E+10
Ul
1,50E+10 ~
C'l
IV
1,00E+10 c.
5,00E+09
I
I
/
-41
__..
__.....---
Month
Figure 1 Google database development
To perform the above task Google use a special algorithm called
"Pagerank". PageRank is a patented method (an algorithm) to as-
sign a numerical weighting to each element of a hyperlinked set of
documents, such as the World Wide Web, with the purpose of
"measuring" its relative importance within the set. The algorithm
may be applied to any collection of entities with reciprocal quota-
tions and references. The numerical weight that it assigns to any
given element E is also called the PageRank of E and denoted by
PR(E) (Ther, 2993).
"PageRank is a probability distribution used to represent the likel i-
hood that a person randomly clicking on links will arrive at any par-
ticular page". PageRank can be calculated for any-size collection o
documents.
6
It is assumed in several research papers that the distribution is
evenly divided between all documents in the collection at the be-
ginning of the computational process. The PageRank computations
require several "iterations", through the collection to adjust ap-
proximate PageRank values to reflect the speculative true value.
The probability is expressed as a numeric value between 0 and 1 (0
and 100%) . That for a PageRank of 0.1 means that the probability
of a person by clicking on a random link will be directed to the
document a 10%
If we suppose the following web pages A, B,C and D. The init ial ap-
proximation of PageRank would be equal apportioned between
these 4 documents (PageRank =100/4=25).
If pages B, C, and D each only link to A, they would each add 0.25
PageRank to A and so PageRank would be:
P R(A) = P R(B ) + P R(C) + P R(D) .
If page B also points to page C while page D has links to all three
pages. The value of the link- pointers is divided among all the out-
bound links on a each page. Thus, page B gives a vote which con-
tribute 0.125 to page A and a vote which contribute 0.125 to page
C. D contribute to A's PageRank
b
PageRankojD
y:
No _of _ Pages _that _ D _ po int s
One iteration of equation (1) is equivalent to computin g
xt+1=Zxt, where xj t=PU) at iteration t. After convergence, we
have xT+1=xT, or xT=ZxT, which means xT is an ei genvector of
7
Z. Furthermore s
eigenvalue of
In other woros,
equal to the d c
mal ized number
The PageRan
ing on links \
between nodes
The probabi lity, a
damping factor d (
It also assumed
The damping facto
added to the proo ct -
coming Pa geRan sccres
a
).
So any page's PageRank is derived in large part from the PageR-
anks of other pages. The damping factor adjusts the derived value
downward.
Google recalculates PageRank value every t ime it gets and parses
the Web and rebuilds its index. Every t ime Google increases the
number of documents in its collection, the initial approximation of
PageRank decreases for all documents. If a page contains no links
to other pages, " it becomes a sink" and thus the random surfing
process is terminated in that case the random surfer picks another
URL at random and continues surfing again.
When calculat ing PageRank, the pages with no outbound links is
calculated as pages t hat link out to all other pages in the collection.
By doing that their PageRank scores are evenly distributed among
all other pages:
PR(pi) = 1- d + d I PR(p!)
N pjEM(pi ) L(pJ)
where pl,p2, ... ,pN are the pages under consideration, M( pi) is the
set of pages that link to page pi, L(pj) is the number of li nks com-
ing from page pj , and N is the total number of pages.
The PageRank values are the entries of the dominant eigenvector
of the modified adjacency matrix. This makes PageRank eigenvec-
tor to be:
A =
PR(pl)
PR(p2)
PR(pn)
9
Where A is the solution of the equation
(1-d) / N
(1 - d) / N
A.(pl ,pl) A.(pl ,p2)
A= +d
(1 - d) / N
Where the adjacency function ;.. (pi, PJ) is 0 if page Pi does not link
to pi, and normalized such that, for each j
I A. (pi , pJ) = 1
i =l
The above gives a variant of the eigenvector centrality measure.
The eigenvalues shows the direction where the vector is going to
move when it will be multiplied by himself. While the eigenvalue
shows the speed which the vector is going to move.
The values of the PageRank eigenvector are fast to approxi mate
and quite efficient.
In general if the probability of the random surfer to be on page j is
P(j ) = (1- [J) + fJL P(i) (1)
N icBj /Fi I
Where 13 is the probability that the random surfers will jump ran-
domly to a page and N= I WI where W is the set of all nodes, and Fi
is the set of pages page i links to, and Bi be the set pages which
link to page i.
10
I
Then the PageRank for page j is defined as this probability:
PRU)=PU). Because (1) is recursive, it must be iteratively evalu-
ated until PU) converges. Typically, the initial distribution for PU)
is uniform. PageRank is equivalent to the primary eigenvector of
the transition matrix Z:
Z = (1-b)[_!_] + {JM, With M,u = -
1
-. if there is an edge from
N NxN I F1l
i to j or 0 otherwise(Richardson, 2002) .
Text Retrieval Methods
In a conventional information retrieval system the stored text is
normally identified by sets of keywords known as index terms.
Requests for information are typically expressed by Boolean com-
binations of index terms, consisting of search terms and the Boo-
lean operators and, or, and not. The terms characterizing the
stored text may be assigned manually by trained personnel or al-
ternatively, automatic indexing methods can be used. In some
systems one can avoid the content analysis, or indexing opera-
tion, by using words contained in the text of the documents for
content identification. When all text words are used for document
identification (except for common words) we consider such sys-
tem to be a full text retrieval system (Zaphiris and Zacharia,
2001)
All existing approaches to the text retrieval are based on relevant
terms found in the text. A typical approach would be to identify
11
the individual words occurring in the documents. A stop list of
common function words (an, of, the, and, but, etc.) is used to de-
lete the high frequency words that are insufficiently specific to
represent the document content. A suffix stripping routine would
be applied to reduce the remaining relevant words to word stem
form. At that point the vector system would assign a weighting
factor to each term in the document to indicate term importance
and it would represent each document by a set, or vector, of
weighted word stems.
Same other systems, like the signature files, would generate the
signatures and store them in some sort of access structure like
sequential file or S-tree.
The most straightforward way of locating the documents that con-
tain a certain search string (term) is to search all documents for
the specified string (substring test). "String" is a sequence of char-
acters without "Don't Care Characters"(Mock, 1996). If the query is
a complicated Boolean expression that involves many search
strings, then we need an additional step, namely to determine
whether the term matches found by the substring tests satisfy the
Boolean expression (query resolution). The search time for this
automaton is linear on the document size, but the number of states
of the automaton may be exponential on the size of the regular ex-
pression. The obvious algorithm for the substring test is as follows :
Compare the characters of the search string against the corre-
sponding characters of the document .
If a mismatch occurs, shift the search string by one position to
the right and continue until either the string is found or the end of
the document is reached.
12
Although simple to implement, this algorithm is too slow. If m is
the length of the search string and n is the length of the document
(in characters), then it needs up to 0 (m * n) comparisons [2].
An other approach over text retrieval has to do with the "Signature
Files". In this method, each document yields a bit string ('signa-
ture'), using hashing on its words and superimposed coding. The
s,gna\:.ures are stored sequentially in a separate
file (signature file); which is much smaller than the Original file,
and can be searched much faster.
The main advantage of signature files is their low storage over-
head. However, the signature file size is proportional to the text
database size and that becomes a problem for massive text data-
bases like digital libraries.
The signature files contain hashed relevant terms from docu-
ments. Such hashed terms are called signatures and the files con-
taining them are signature files. There are several ways to extract
the signatures from the documents - four basic methods being
WS (Word Signatures), SC (Superimposed Coding), BC (BitBiock
Compression) and RL (Runlength Compression). For example, in
the Superimposed Coding (SC) the text database is divided into a
number of blocks. A_ block Bi is associated with a signature Si,
which is a fixed length bit vector. Si is obtained by hashing each
nontrivial word in the text block into a word signature and OR- ing
them into the block signature. The query is hashed, using the
same signature extraction method used for the documents, into
the query signature Sq. The document search is then done by
searching the signature file and retrieving a set of qualified signa-
tures [Si such that Si AND Sq = Sq] . There are designs for signa-
ture file storage structures or organizations like sequential organi-
zation, a transposed file organization or bit-slice organization, sin-
13
.. _
gle and multi-level organization(Goncalves et al.)s likeS 5-trees
(Deerwester et al., 1990) The most recent signature file organiza-
tion is called the partitioning approach whereby the signatures are
divided into partitions and then the search is limited to relevant
partitions. The motivation for the partitioning approach is a reduc-
tion in the search space as measured either by the signature re-
duction ratio (ratio of the number of signatures searched to the
maximum number of signatures) and by the partition reduction
ratio (ratio of the number of partitions searched to the maximum
number of partitions) Two approaches to the partitioned signature
files were published (Lee et al., 1995). One uses linear hashing to
hash a sequential signature file into partitions or data buckets
containing similar signatures The second approach uses a notion
of a key, which is a sub string selected from the signature by
specifying two parameters - the key starting position and the key
length. The signature file is then partitioned so that the signatures
containing one key are in one partition. The published perform-
ance results on the partitioned signature files are based mostly on
simulations of small text databases and are not conclusive. There
was not any attempt to address the scalability of the partitioned
signature files to massive text databases. The partitioned signa-
ture files grow linearly with the text database size and thus they
exhibit the same scalability problem as other text access struc-
tures .
On the other hand some scientist believes that we can use inver-
sion method in or:der to archive very good retrieval results. Each
document can be represented by a list of (key)words, which de-
scribe the contents of the document for retrieval purposes. Fast re-
trieval can be achieved if we invert on those keywords. The key-
words are stored, eg., alphabetically, in the 'index file'; for each
keyword we maintain a list of pointers to the qualifying documents
14
in the 'postings file'. This method is followed by almost all the
commercial systems (Faloutsos and Oard, 1995) .
Started from internet a method called metadata-based indexing
has gain a great position in librarian science. Metadata is not fully
data, but it is a kind of fellow traveler with data, supporting it from
the sidelines. A definition is that an element o{ metadata de-
scribes an information resource or helps provide access to an in-
formation resource'(Faloutsos and Oard, 1995).
In the context of Web pages on the Internet, the term metadata'
usually refers to an invisible attached to a Web page which facili -
tates collection of information by automatic indexers; the is invisi-
ble in the sense that it has no effect on the visual appearance of
the page when viewed using a standard Web browser such a Net-
scape TM or Microsoft's Internet Explorer TM .
Natural Language processing
Natural language processing techniques seek to enhance perform-
ance by matching the semantic content of queries with the seman-
tic content of documents [33, _49, 76). Although it has often been
claimed that deeper semantic interpretation of texts and/or queries
will be required before information retrieval can reach its full poten-
tial, a significant performance improvement from automated se-
mantic analysis techniques has yet to be demonstrated.
The boundary between natural language processing and shallower
information retrieval techniques is not as sharp as it might first ap-
pear, however. The commonly used stoplists, for example, are in-
tended o remove words with low semantic content. Use of phrases
15
as indexing terms is another example of integration of a simple
natural language processing technique with more traditional infor-
mation retrieval methods.
Neural Network as infrastructure in retrieval
The main idea in this class of methods is to use the spreading acti-
vation methods. The usual technique is to construct a thesaurus,
either manually or automatically, and then create one node in a
hidden layer to correspond to each concept in the thesaurus.
Jennings and Higuchi have reported results for a system designed
to filter USENET news articles in . Their Implementation achieves
reasonable performance in a large scale information filtering
task(Badal and Davies).
Latent Semantic Indexing
Latent Semantic Indexing (LSI) is a vector space information re-
trieval method which has demonstrated improved performance
over the traditional vector space. We begin with a basic implemen-
tation which captures the essence of the technique. From the com-
plete collection of documents a term document matrix is formed in
which each entry consists of an integer representing the number of
occurrences of a specific term in a specific document. The Singular
Value Decomposition (SVD) of this matrix is then computed and
small singular values are eliminated. The effectiveness of LSI de-
pends on the ability of the SVD to extract key features from the
term frequencies across a set of documents. In order to understand
this behaviour it is first necessary to develop an operational inter-
pretation of the three matrices which make up the SVD(Deerwester
et al., 1990).
16
Latent Semantic Algorithm
Latent semantic analysis (LSA) is used to define the theme of a
text and to generate summaries automatically . The theme informa-
tion - the already known information - in a text can be represented
as a vector in semantic space; the text provides new information
about this theme, potentially modifying and expanding the seman-
tic space itself. Vectors can similarly represent subsections of a
text . LSA can be used to select f rom each subsection the most typi-
cal and most important sentence, t hus generating a ki nd of sum-
mary automatically(Turney, 2005) .
Advantages of Neural Network Models o v er
Traditional IR Models
In neural network models, information is represented as a network
of weighted, interconnected nodes. In contrast to traditional infor-
mation processing methods, neural network models are "self-
processing" in that no external program operates on the network :
the network literally processes itself, with "intelligent behavior"
emerging from the local interactions that occur concurrently be-
tween the numerous network components (Reggia & Sutton,
1988). Neural network models in general are fundamentally differ-
ent from traditional information processing models in at least t wo
ways (Doszkocs, 1990).
First they are self-processing. Traditional information processing
models typically make use of a passive data structure, which is al-
ways manipulated by an active external process/procedure. In con-
trast, the nodes and links in a neural network are active processi ng
agents. There is typically no external active agent that operates on
them. "Intelligent behavior" is a global property of neural network
models.
17
Second, neural network models exhibit global system behaviors
derived from concurrent local interactions on their numerous com-
ponents. The external process that manipulated the underlying
data structures in traditional IR models typically has global access
to the entire network/rule set, and processing is strongly and ex-
plicitly sequentialized . Pandya and Macy (1996) have summarized
that neural networks are natural classifiers with significant and de-
sirable characteristics, which include but no limit to the follows.
Resistance to noise
Tolerance to distorted images/patterns (ability to generalize)
Superior ability to recognize partially occluded or degraded images
Potential for parallel processing
Furthermore LSA algorithm has a unique ability to be (Dumais) to
be work in cross-language environment with fully automatic corpus
analysis.
Speciai issues on web Information Retrieval
The techniques that should be followed when users are looking for
specific information on the internet are different enough from
those that are used in the conventional IR systems and that be-
cause there special issues relevant to user behaviour and the na-
ture of the data that is stored in WWW
At a glance we can determine some key reason which justifies the
special nature or the web information retrieval.
The internet users often provide very shorted length que-
ries, while they seem to be unwilling to provide more inputs. In
addition they do not pay the appropriate attention in the way that
they express their questions and that sometimes cause question to
18
be vague, thus the results returned from the search do not suit
with the real informative need of user.
The pages collection changes constantly, while thousands
of new pages are generated daily in the World Web, other are pre-
sented in a different way, and some are removed .
The information usefulness of every page varies . Certain
pages focus particularly in a subj ect, whil e provi de info' s about a
set of topics without any connection between them, some pages
work as directory services for other pages and sometime pages
are totally irrelevant t o the search t opic.
The quality of information curri ed by each page cannot be
verified in advanced. Even worse some authors provides inaccu-
rate data or even the page is structured is such way t o manipulate
user expectations (Spam) .
The pre-processing of all pages living on the www demands a
high cost in terms of time and space while must become a cont in-
ues process since www is a dynamic living entity.
TheAagent's Technology.
Introduction
In computer science software agent is an abstraction that describes
computer programs that can assists the user with comput er appli-
cations (Mosoud, 2004).
The term recently has been extended with the use of adj ectives
like intelligent (agent that employs AI techniques) " autonomous
agents (capable of modifying the way in which they achieve their
19
objectives) , distributed agents (being executed in deferent ma-
chines), multi-agent systems (distributed agents that in order to
accomplish their task must communicate with each other) , mobile
agents (agents that can relocate their execution to different proc-
essors) and more.
Generally speaking agent concept includes properties that make
them special in Computer Science field. Among others we can
referee to the following:
o autonomy: agents operate without the direct intervention
of humans or others, and have some kind of control over
their actions and internal state (Castelfranchi, 1995);
o social abilit y: agents interact with other agents (and
sometime with humans) via some agent-communication
language (Genesereth and Ketch pel, 1994);
o Reactivit y: " agents perceive their environment"(, (which
may be the physical world, a user via a graphical user in-
terface, a coll ection of other agents, the INTERNET, or
perhaps allof these combined), and respond in a timely
fashion to changes that occur in it (Wooldridge, 1995);
Categories of agents i n more details
Intelligent agents:
Intelligent agents' development is a branch of AI research.
These types of agents are called Intelligent because the have
special capabilities:
Ability to Learn
20
ffi
\,.
o Ability to Adapt
There are several ways that agents can be trai ned to better under-
stand user preferences by usi ng computati onal i ntel l i gence tech-
ni ques, neural networks, adapti ve fuzzy l ogi c etc.
llobile Agents:
This category of agent has the property that can spawn to another process-
i ng uni t l i ve there perform some operati ons or di e.
Distributed agents
Si nce agent desi gn permi ts the requi red resources to be i ncl uded i n thei r
descri pti on, i t i s rel ati vel y easy to desi gn software agents than can exe-
cute threads on di fferent systems and thus they become di stri buted
agents.
llulti-agent systems
Envi ronments that i ncl udes several agents whi ch are performi ng tasks
we cal l thi s envi ronments mul t-agent envi ronments. It very possi bl e
t hat i s such mul t i - agent syst ems agent l wi l l . not have al l dat a or al l
methods avai l abl e to achi eve an obj ecti ve and thus they wi l l have to ex-
cfi ange recourses wi th other agents .
,#
i;
i,,i:
s
2t
o Abi l i t y t o Adapt
Ther e ar e sever al ways t hat agent s can be t r ai ned t o bet t er under -
st and user pr ef er ences by usi ng comput at i onal i nt el l i gence t ech-
ni ques, neur al net wor ks, adapt i ve f uzzy l ogi c et c.
Mobi l e Agent s:
Thi s cat egor y of agent has t he pr oper t y t hat can spawn t o anot her pr ocess-
' r g
uni t l i ve t her e per f or m some oper at i ons or di e,
Di st r i but ed agent s
Si nce agent desi gn per mi t s t he r equi r ed r esour ces t o be i ncl uded i n t hei r
descr i pt i on, i t i s r el at i vel y easy t o desi gn sof t war e agent s t han can exe-
cut e t hr eads on di f f er ent syst ems and t hus t hey become di st r i but ed
agent s.
Mul t i - agent syst ems
Envi r onment s t hat i ncl udes sever al agent s whi ch ar e per f or mi ng t asks
, ' r e cal l t hi s envi r onment s mul t - agent envi r onment s. I t ver y possi bl e
: hat i s such mul t i - agent syst ems agent Q wi l l not have al l dat a or al l
. r et hods
avai l abl e t o achi eve an obj ect i ve and t hus t hey wi l l have t o ex-
: hange r ecour ses wi t h ot her agent s.
2l
CHAPTER 2, SYSTEM DEVELOPMENT PROCESS
Def i ni t i on of sof t war e devel opment
pr oc es s
Accordi ng to Wi ki pedi a (Wi ki pedi a, 2005) a software devel opment
process "i s a structure i mposed on the devel opment of a software
product. Synonyms include software life cycle and software proc-
eG. There are several model s for such processes, each descri bi ng
approaches to a vari ety of tasks or acti vi ti es that take pl ace duri ng
the process.'
In other words software devel opment process i s a set of methods
that i ntent to provi de gui del i nes about sel ecti ng, i mpl ementi ng and
monitoring a lifecycle for a software project.
Some of most wel l known model s are (Wi ki pedi a, 2005):
o Capability Maturity Model
o ISO 15504(Software Process Improvement Capabi l i ty
'
Determi nati on (SPICE))
o Si x Si gma
The aforementi oned model s 6re (wi th excepti on to the ISO 15504)
general proj ect management model s that can be appl i ed i n software
i ndustry i n order to control and gui de the software producti on proc-
ess.
22
Sy s t em Dev el opment Li f e Cy c l e ( SDLC)
The systems development life cycle (SDLC) is defined by United States
Depaftment of Justice (Justice, 2003) as "ilsg-&rryefe--C-e
-e]qpment-
processr although it is also a distinct process independent of software
or other Information Technology considerations. It is used by a sys-
tems analyst to develop an information system, including require-
ments, val i dati on, trai ni ng, and user ownershi p through i nvesti gati on,
anal ysi s, desi gn, i mpl ementati on and mai ntenance. SDLC i s al so
known as information systems development or application develop-
ment." o
23
T
Systems Development
Life-Cvcle
Life Cycle (SDLC)
Phases
trC-
d tr6rM6t i{Ytrei
Figure 1 SDLC Phases available from: http://www,usdoj.gov/jmd/irm/lifecycle/im-
ages/ch1.gi f
Vari ous SDLC methodol ogi es have been devel oped to gui de the processes
-.
i nvol ved. The most common are:
o The waterfal l model (Lowe, 1999): i n whi ch devel opment i s
pass-through the phases of:
1. Requi rements anpl ysi s (System servi ces, constrai nts, and
goal s are establ i shed. Defi ni ti ons are understandabl e by
both devel opers and customers,)
2. Desi gn (System & Software desi gn)
3. i mpl ementati on (Program uni ts are produced)
4. Testi ng & Debuggi ng: fi ndi ng bugs & defects and re-
duce/eliminate that in order to meet the specification.
24
irt*;
Integration: program units are integrated into the system.
The system as whole is tested to verify that it meets the
specifications.
Mai ntenance: enhanci ng and opti mi zi ng depl oyed software,
integrating new needs and correcting defects,
The term was i ntroduced i n 1970 by W. W. Royce;
. Rapid application development (RAD) sagest that products can
be devel oped faster wi th hi gher qual i ty by (Inc., 2000):
1. Using workshops or focus grouflb to gather requirements.
2, Prototypi ng and user testi ng of desi gns.
3. Re:using software components.
4, Fol l owi ng a schedul e that defers desi gn i mprovements to
the next product version.
5. Keepi ng revi ew meeti ngs and other team communi cati on
i nformal .
Joi nt appl i cati on devel opment (JAD) The Joi nt Appl i cati on De-
vel opment (JAD) methodol ogy ai ms to i nvol ve the cl i ent i n the de-
si gn and devel opment of an appl i cati on. Thi s i s accompl i shed
through a seri es of col l aborati ve workshops cal l ed JAD sessi ons.
. The fountain model
The spi ral model (Wi ki pedi a, 2OO6c): The spi ral methodol ogy
extends the waterfall model by introducing prototyping. It is gen-
erally chosen over the waterfall approach for large, expensive, nd
complicated projects.
Agile Software Development: Agile software development is a
conceptual framework for software engineering projects. There are
5.
6.
25
o
TimeLine
Figure 2 The Waterfall Model
a number of agi l e software deveropment methods, such as those
used by the Agi l e Al l i ance.
XP (Extreme
Programmi ng)
The Prototypi ng Methodol ogy
Figure 3 the Rad
process
Flow
I
I
26
Agi l e Sof t war e Dev el opment i n det ai l s
Whi l e tradi ti onal devel opment methodol ogi es gi ve emphasi s on
documentation process agile methodologies defines the teamwork
and communication as key factors to a successful system design &
implementation procedu re.
According to Agile Manifesto (Fowler, 2002), the agile methodology
declare are important aspects:
' Indi vi dual s and i nteracti ons over proCsses and tool s.
.Worki ng
software over comprehensi ve documentati on.
.Customer col l aborati on over contract negoti ati on.
.Respondi ng
to change over fol l owi ng a pl an.
Most agile methods attempt to minimize risk by developing soft-
ware in short tinderboxes, called iterations, with typical length be-
tween one and four weeks.
Every iteration is like a software project of its own, and includes
al l of the tasks necessary to.rel etse the mi ni -i ncrement of new
functi onal i ty: pl anni ng, requi rements anal ysi s, desi gn, codi ng,
testi ng, and documentati on.
Whi l e i terati on may not add enough functi onal i ty to rel easetthe
product, an agi l e software methodol ogy has as scope to be capa-
bl e of rel easi ng new software at the end of every i terati on,
At the end of each iteration, the team revaluates' project priori-
ti es.
27
Adaptive methods Predictive methods
Focus on adaptlng gulckly to changing reali-
Ues. When the needs of a project change, an
adaptive team changes as well. An adaptive
team will have difflculty describing exactly
what will happen In the future. The further
away a date is, the more vague an adaptive
method will be about what will happen on
that date. An adaptive team can report ex-
actly what tasks are being done next week,
but only whlch features are planned for next
month. when asked about a release six
months from now, an adaptive team may
only be able to report the mission statement
for the release, or a statement of expected
value vs, cost,
In contrast, focus on planning the future in
detall, A predlctlve team can report exactly
what features and tasks are planned for the
entire length of the development process.
Predictive teams have difficulty changing
directlon, The plan is typically optlmized for
the original destinaUon and changing direc-
tion can cause completed work to be thrown
away and done over differcntly. Predictive
teams wlll often institute a change control
board toOnsure that only the most valuable
changes are considercd,
vs I methodology
http ://en,wikl pedia. orglwikl/Agile_software_development)
XP
-
Extreme Programming
Extreme Programming (XP) is a targeting and well structured ap-
proach to software development. After almost years since the day
XP came to life, this way of sgFware development has already been
proven at many compani es of al l di fferent si zes and i ndustri es
worl d wi de.
XP success is based on the fact that it stresses customer satisfac-
tion. The methodology is designed to deliver "the software y$ur
customer needs when it is needed. XP provide the appropriate
knowledge to developers in order to assertively respond to chang-
ing customer requirements and needs, even when they are men-
tioned in a later phase
-
of Software Development Life cycle.
29
-
Thi s methodol ogy al so emphasi zes to team work. Managers/ cus-
tomers, and devel opers are al l part of a team wi th ul ti mate goal to
del i veri ng qual i ty software. XP i mpl ements a si mpl e but effi ci ent
way to empower the groupware styl e of devel opment,
The XP methodol ogy reference to the fol l owi ng pri nci pl es:
Feedback i s most useful i f i t i s done rapi dl y. In Extreme Pro-
grammi ng, contact wi th customers occurs very often, i n smal l i t-
qrati ons. The customer has cl ear i nsi ght i nto the system that i s
bei ng devel oped so he can provi de feedback and contri bute the
devel opment as needed.
Uni t tests al so contri bute to the rapi d feedback pri nci pl e, When
wri ti ng code, the uni t test provi des di rect feedback as to how the
system reacts to the changes one has made.
Assumi ng si mpl i ci t y i s about t r eat i ng ever y pr obl em as i f i t can be
sol ved "extremel y si mpl y" whi l e at the same ti me XP rej ects the
i dea of i nterface for "Future extensi on" and code reusabi l i ty as
pr i or i t i se si mpl i ci t y as mor e i mpor t ant .
Extreme Programmi ng suggests that performi ng l arge scal e
changes al l at once i ncl udes a hi gh possi bi l i t y of f ai l ur e. I nst ead
Extreme Programmi ng has i ntroduced the i dea of i ncremental
change; that consi sts on provi di ng many l i ttl e steps i n software
devel opment procedure i n*order to hel p the customer to achi eve
more control over the devel opment process and the system that i s
bei ng devel oped.
The pr i nci pl e of embr aci ng change i s not about wor ki ng agai nst
changes but embraci ng them therefore hel pi ng the devel opers i n
pr epar i ng f or t he i ncl usi on of new cust omer needs and demands
dur i ng t he next i t er at i on phase.
30
The Prototypang Methodology
The prototypi ng methodol ogy suggest that: " users can poi nt
to
features they don't like about an existing system (or indicate
when a feature i s mi ssi ng) more easi l y than they can descri be
what they thi nk they woul d l i ke i n an i magi nary system"
(Jenki ns, 1985).
Rather than force the user to try to understand and someti me
guess the many maj or and mi nor detai l s of an Informati on Sys-
tem presented i n form of document speci fi cati on, the devel oper
presents the user wi th a seri es of rough approxi mati ons (proto-
Wpes)
of the candi date computer system.
The prototype i s a worki ng model of the system, often i ncom-
pl ete. The IS devel oper i ni ti al l y meets wi th the user i n order to
gather enough i nformati on to bui l d a "rough" i ni ti al system pro-
totype, whi ch he then presents
to the user to exami ne and i n-
teract wi th i t, i n order to provi de feedback comments.
From thi s tangi bl e approxi mati on of the system, the user has
improved chances both to clarify system requirements, and to
express those reguirements to the developer, The developer
then takes account of the newly expressed requirements and
produces' a new prototype,
whi ch i s agai n presented to the user
for comments,
Thi s conti nues i terati ve process i s repeated, unti l there are no
new requests from customer. We can say that the four maj or
phases of the System Li fe Cycl e methodol ogy--anal ysi s, desi gn,
31
The Prototyping Methodology
The prototyping methodology suggest that: " users can point to
features they don't like about an existing system (or indicate
when a feature i s mi ssi ng) more easi l y than they can descri be
what they thi nk they woul d l i ke i n an i magi nary system"
(Jenki ns, 1985).
Rather than force the user to try to understand and someti me
guess the many maj or and mi nor detai l s of an Informati on Sys-
tem presented i n form of document speci fi cati on, the devel oper
presents the user wi th a seri es of rough approxi mati ons (proto-
types) of the candidate computer system.
The prototype i s a worki ng model of the system, often i ncom-
pl ete. The IS devel oper i ni ti al l y meets wi th the user i n order to
gather enough i nformati on to bui l d a "rough" i ni ti al system pro-
totype, whi ch he then presents to the user to exami ne and i n-
teract wi th i t, i n order to provi de
feedback comments.
From this tangible approximation of the system, the user has
improved chances both to clarify system requirements, and to
express those requirements to the developer. The developer
then takes account of the newly expressed requirements and
produces a new prototype, which is again presented
to the user
for comments,
This continues iterative process is repeated, until there are no
new requests from customer, We can say that the four maj or
phases of the System Li fe Cycl e methodol ogy--anal ysi s, desi gn,
31
program devel opment, and i mpl ementati on--are combi ned i nto
one phase, repeated in each iteration (Parker, 1983)
Software engineering is a complex processes that incorporate a
number of activities in some SDLC methodologies these have a se-
quential order in others maybe not.
Gener al Char ac t er i s t i c s of SDLC
C
Independently to what methodology a development team is going
to follow there are general phases in Software Development life
cycle(Alexandrou, 2006).
These general steps are:
Requirements Analysis
.If there is an existing system, its deficiencies are identified.
This is possible by interviewing users and discusses with applica-
.
tion's suppoft personnel.
.The system requirements are defined. The important point at
'a
"
'
this stage is to take into account any deficiencies in the existing
system, if there is any, with specific proposals for improvement.
Specification
?.
.
Software is precisely described in a mathematically rigorous
way. Specifications are most important for external interfaces that
must remai n stabl e.
Software architedure
32
program development, and implementation--are combined into
one phase, repeated i n each i terati on (Parker, 1983)
Software engineering is a complex processes that incorporate a
number of activities in some SDLC methodologies these have a se-
quential order in others maybe not.
Gener al Char ac t er i s t i c s of SDLC
t.
Independently to what methodology a development team is going
to follow there are general phases in Software Development life
cycle(Alexandrou, 2006).
These general steps are:
Requirements Analysis
.If
there is an existing system, its deficiencies are identified.
This is possible by interviewing users and discusses with applica-
tion's support personnel.
.The
system requirements are defined. The important point at
this stage is to take into account any deficiencies in the existing
system, if there is any, with specific proposals for improvement.
Specification
?
.
Software is precisely described in a mathematically rigorous
way. Specifications are most important for external interfaces that
must remai n stabl e.
Software architecture
32
o A candi date system i s desi gned. Pl ans are created and i ncl ude
the hardware, operating systems, programming, and security is-
sues.
. The new system is developed.
Coding
Testing
. Users of the system must be trained
tests must be carried out. If necessary,
take place.
Documentation
in its use. Performance
new adjustments must
o Documenti ng the i nternal desi gn of software for the purpose of
future mai ntenance and enhancement. Documentati on i s most
i mportant for external i nterfaces,
Maintenance
. The system becomes operational either by replacing at once
the ol d system or by gradual l y repl aci ng the ol d system wi th the
new one.
. Once the new system is up and running for a period of time, it
shoul d be eval uated i n detai l s. Mai ntenance must be kept up at al l
times. The users should be kept up-to-date concerning the latest
modifications/chances and the new procedures
that maybe are in-
troduced.
33
Requi r ement Gat her i ng and Pr i or i t t z a'
t i on
Sof t wa r e r equi r ement s anal Ys i s
Software requirements analysis is the activity of extracting, analyzing, and
recordi ng requi rements for Informati on Systems' Someti mes i s overl ap-
pi ng wi th general system requi rements but as a paft of Software Devel -
opment Life Cycle has its own specific characteristics (Barrett, 7997)'
In a typical software development
project there is
(
a trained software
practitioner called the Requirements Analyst
(RA) that has as mai n area or responsi bi l i ty to communi cate wi th
the user i n order to understand what the requi rements are.
Most of the ti me cl i ents have a general i dea about what they want
from the system to do but are Requirements Analysts
job
to de-
fi ne i n detai l s what the real customer needs i s.
The next task, after the client's idea about the system has deter-
mi ned i n detai l s, the requi rement anal ysi s team has to determi ne
.
whether or not the candi date system i s:
. Feasi bl e
. Schedul abl e
r Affordable
. Legal
. Ethi cal
34
- + _ _ _ - : - ? - i * - ,
In the rush of enthusiasm associated with a new proje@
there is
always a temptauon to downplay the irnportane of requirements
analysis. However, studies of previous projects
reveal that costs
ar'rd technq'sal rbks can be reduced through rigorous and thorough
up-frontreqtrirernents
engrir,teenirg.
Tlp Requirer4ent Anatysf phase
is
diytded
on the fqllgwing sub-
phases (Barrett,
L997) :
.
Reguirempnts gathering,
r Reguirements analysis,
.
Reguirements specification
o
Requirements verification.
35
R e q
u. i r, me f r, t *. . , Gri l S, h e
r I q
g
. . ' , ' . . . .
Rquitnrent
gnth*ingi$ an ift.pon@,nt'sub'-phase of. Reryirer
: ' i - - - j _
' ^
analysis. at
lhfs
sta99 the develofnenttear,ril,must
[$igab,
r
and deflne the Client needs. Once the client's requirements
"
'
\
" (
. ; , .
bee* idenUfiba;'tfr'e systern'designeis are then. in'a position t
sisn a solutign-(urqEptg19e9).,
,
A formal Gqulrenr,e.nt Ci trering g.o"*." ins{ode} the fun(
steps (Table 2):
'
:
I Current Defects Evaluatlon Plan
2 "Prior Relea6$" Problem Revlew Plan
3 Revlew of Bdsting Product (Project) Malntenance Plans ,
4 Rsrb$, Prcliminary Softwarc Ardlitectural Overview
, " Prliminary Requirements Gathering Phase Exit Crlteria
l
Table 1 Requirement Gathrlng Steps
36
Pr obl ems & Di f f i c ul t i es
Duri ng the Requi rements Gatheri ng phase a number of
Stakehol der Issues may ri se, Some of them are resul ts of
cl i ent' s organi zati on behavi or and some others get ground
on human nature (for i nstance some peopl e tend to be
overopti mi sti c) (Tsagataki s, 2005).
The aforementi oned di ffi cul ti es can be categori zed as
stakehol der i ssues, engi neeri ngai ssues, and general i ssues.
In category that referees to general probl ems we can cl as-
si fy the fol l owi ng si tuati ons:
' The ri ght peopl e wi th adequate experi ence, techni cal exper-
t i se, and l anguage ski l l s may not be avai l abl e ei t her be-
cause organi sati on structure doesn' t i ncl ude such person-
nel or because management and other factors prevent
them from communi cati ng wi th Software devel opment
t eam. I n t hat case t he Requi r ement Acqui si t i on peopl e
must " r ei nvent t he wheel " i n or der t o cover t he gap and
someti me make assumpti on (not al ways correct) about the
exi sti ng system and tb.e needs of the new candi date sys-
t em.
. The i ni t i al specul at i ons about what t he needs ar e most of t he
ti me don' t cover al l the aspects sati sfactory, i t may be' i n-
compl et e, or opt i mi st i c assumpt i ons about t he nat ur e
(ti me, user acceptance, i ntegrati on e.t.c) of the proj ect i s
taken i n account.
.The need of wel l trai ned requi rements acqui si ti on personnel
and knowl edge engi neer i n combi nat i on wi t h t he di f f i cul t y
37
of usi ng the compl ex tool s and di verse methods l i nked to
requi rements gatheri ng process may di shearten the hope
for benefi ts of a compl ete and detai l ed approach.
In addi ti on to (McConnel l , 2004) precedi ng
si tuati ons we
have to take into account the ways that users can affect the
requi rements gatheri ng process:
Jsome
users maybe are not in position to understand what
they really want
.Some
users maybe are unwi l l i ng to commi t to a set of wri tten
requirements (in order to feel safe in case of future unde-
si red si tuati ons)
o Some users maybe are not i n posi ti on to express i n a sui tabl e
and understandabl e way what there real l y needs are.
r Some users may i ntroduce new requi rements after the cost
and the ti me schedul e have been fi nal i zed.
.
Communi cati on wi th users i s sl ow and that has as resul t
. Users often do not participate in reviews or they don't have
the appropri ate background to do that.
o Users don' t understand thc devel opment process.Thi s
com-
monl y l eads to the si tuati on where user requi rements keep
changi ng even when system or product devel opment has been
stafted.
But not onl y users are responsi bl e for proj ect del ays and/or i n-
adequate Informati on Systems, someti mes engi neers and de-
vel opers maybe cause probl emati c
si tuati ons duri ng requi re-
38
rnents analysis
2005):
processt(Wikipedia,
2006b) (Tsagatakis,
rTechni cal personnel
and end users often have di fferent vo-
cabularies and code of intercommunieations. That sometimes
has as result that while both believe they are in perfect under-
standing, but when the product is finished and becorrc tangi-
ble the discover that they didn't cEer all the necessary as-
pects.
.In
business systems domain, the duty to bridge that gap is
often assigned to Business Analysts, His role is to analyze and
document the busi ness processes of busi ness uni ts that wi l l be
affected by the candidate Information system. In parallel Busi-
ness Systems Analysts, analyze and document the proposed
business solution from a systems perspective. This parallel
situation som,etimes causes confusion and incorrect assump-
ti ons,
. Engineers and developers often try to refine the require-
ments in order to fit to'an existing system or model, while a
more clear solution like thE'aeielopment a system specific to
the needs of the cl i ent.
rAnal ysi s i s often carri ed out by engi neers or programmers,e
rather than knowl edge engi neers who have the appropri ate
communi cati on ski l l s and suffi ci ent domai n knowl edge to un-
derstand the cl i ent' s needs properl y.
Ma i n t e c h n i q u e s o f I n f o r ma t i o n Ga t h e r i n g
The introduction of a new Information System is very likely to
change the envi ronment and the rel ati onshi ps between peopl e,
rnents anal ysi s
2005):
processt(Wikipedia,
2006b) (Tsagatakis,
.Techni cal personnel and end users often have di fferent vo-
cabularies and code of intercommunications. That sometimes
has as result that while both beli.eve they are in perfect under-
standing, but when the product is finished and becorrc tangi-
ble the discover that they didn't cEer all the necessary as-
pects.
.In
business systems domain, the duty to bridge that gap is
often assigned to Business Analysts. His role is to analyze and
document the busi ness
processes
of busi ness uni ts that wi l l be
affected by the candidate Information system. In parallel Busi-
ness Systems Analysts, analyze and document the proposed
business solution from a systems perspective. This parallel
situation ssmetimes causes co,nfu,sion and incorrect assump-
ti ons.
. Engineers and developers often try to refine the require-
ments in order to fit to'an existing system or model, while a
more clear solution like ttE'deielopment a system specific to
the needs of the cl i ent.
.Anal ysi s i s often carri ed out by engi neers or programmers,c
rather than knowl edge engi neers who have the appropri ate
communi cati on ski l l s and suffi ci ent domai n knowl edge to un-
derstand the client's needs properly.
Mai n t ec hni ques of I nf or mat i on Gat her i ng
The introduction of a new Information System is very likely to
change the envi ronment and the rel ati onshi ps between peopl e,
39
thus it is impoftant to identify all the stakeholders, take into ac-
count al l thei r needs and ensure they understand the i nference of
the new systems.
To happened that we need a structured procedure that will help to
keep the requirement discussions between development team and
client well organized and efficient.
rKnowledge Engineers and systems Analysts can employ several
techniques to get the requirements from the customer (Dr Vru-
si as. B, 2005) thi s i ncl udes i nterui ews, questi oners, recordi ng,
group workshops (known as requirements workshops) and whish
l i sts.
More modem techniques include Prototyping, and use cases.
Where necessary, the analyst will employ a combination of these
methods to establish the exact requirements of the client so that
a system that meets the business needs is produced.
40
:CHAPTER 3.' S O FTV/ARB RE
QU
I RE MENTS .SPE,CIF'I CA-
TION
I nt r od' uc t i on
This paft of the Final Year project has as scope to provide a full
idea about the Software and system requirements as they have
been captured by the system developer, The structure of the
document and the basi c el ements
a' re
based on IEEE830-1998
IEEE
I d e n t i f i c a t i o n .
This SRS (Software Requirernents Specification) refers to a
web Infor:nration retrieval system, Current version of this
sof t war ei s#1( one) .
The purpose of thi s chapter i s tb descri be i n detai l the opera-
tion of the Web Information Retrieval software
project. In a
normal SRS paper the fi rst secti on of thi s document shoul d
provi de a document overvi ew, the appropri ate defi ni ti ons and
references for the rest.
But due to the nature of the proj ect a Fi nal year proj ect and
the need of defi ni ti ons and bi bl i ography i n other parts of thi s
document was deci ded to ski p defi ni ti ons and reference at Se
stage i n order to prevent redundant materi al from appeari ng.
So i n the fi rst part of thi s chapter we wi l l provi de a document
review for consistency to IEEE830-1998 IEEE
4I
In the second section will give details about the major objec-
tive of the software under the question and a fictional account
of its use. It will also specify some constraints and data re-
qui rements.
In the last sections we give a more detailed description about
the technical aspects of this project such as user limitations
and technical requirements to use the product.
Sy s t em ov er v i ew.
The proposed system is running into parts, the first part that
consists the main application receives queries from users ei-
ther by command line or web interface.
The query is passed to Google search engine from which the
system receive a resuft a list of URLs (maximum #50) that
according to Google are correlated to user initial query. The
initial query is stored in a map.
Then system crawl each of this URLS and produce two Hash-
Map type object; one that contains the total of terms occurred
i n al l document that have been crawl ed and a second one
that contains current document term index and in what fre-
quency this term occurred in the text.
During the map creation phase closed class words (Van Petten
C, 1991) are removed whi l e the remai ni ng term pass through
a stemmer that i mpl ements the Porter Steamer al gori thm(C.J.
van Rijsbergen, 198dJ.
After we have finished with the crawling of all URLs we end up
with 50 Hash Map objects one for each document and one
large Hash Map with all the words that we have met during the
42
URLs crawl i ng. Usi ng LSA and Eucl i di an di stance we produce a
rel evance to ori gi nal query l i st.
Def i ni t i ons , Ac r ony ms , and Abbr ev i a-
t i ons
Subchapter omi tted i n order to prevent redundancy to the
general Gl ossary secti on of Fi nal Year Proj ect report.
Thi s part can be found i n the appendi ces
Ref er enc e
Subchapter omi tted i n order to prevent redundancy to the
general Bi bl i ography secti on of the Fi nal Year Proj ect report.
Thi s part can be found i n the appendi ces
Gener al Des c r i pt i on
Us er Per s onas and Char ac t er i s t i c s
Al l users that thi s system i s targeted to be u5ed are peopl e wi th
average computer l i terature that have used before a search engi ne
l i ke Googl e or yahoo.
Pr oduc t Per s pec t i v e
Thi s software requi res a graphi cal web browser such as Internet
Expl orer (T) versi on 6 or Mozi l l a Fi refox versi on 1.5.0.1. Al so a
43
Gener al Cons t r ai nt s , As s umpt i ons , De-
pendenc i es , Gui del i nes
Our user run thi s software i n computer that i s connect to the
i nternet wi th a connecti on wi th at l east 256Kbi ts/sec down-
stream capaci ty, the OS that faci l i tates thi s software i s one of
Wi ndows XP, Li nux Fedora core 4, Suse 10.
Java 1.5 i s i nstal l ed and both mysql -connector-j ava
(http://www.mysql .com) versi on 3.t.t2 and html uni t versi on
{
1.8 (htto://www.Gargovl eSoftware.com/),
Al so Apache Jakarta i s runni ng at thi s system and i s l i steni ng
on 8080 TCP port.
We assume that the user demands speci fi cal l y web-based ap-
pl i cati on and hi s computer i s equi pped wi th software abl e to i n-
terpret HTML.
User View of Product Use
.The
compl ete vi si on of the proj ect i s a search engi ne si mi l ar
to Googl e. The key factors are return resul ts rel evance and
resources i n the form of ti me needed i n order to get an an-
swer. The goal i s to provi de user wi th a i ni ti al adequate an-
swer t o hi s/ her quer y wi t h l ow t i me wai t and at l east i nsi de
users' tol erance l i mi ts(Bhatti ).
The hel p menu wi l l be accessi bl e vi a a hel p i con t hat appear s
on the fi rst page of each query
.In
the fi rst screen, the user i nsefts the query he wants to make.
The more common the word the more resul ts wi l l be returned as
46
consequence the,longer the results fetch process will last. User can
perform hi s/her query usi ng command l i ne or web i nterface, 3,0
Specific Requirements
Ex t er nal I nt er f ac e Requi r ement s
The program requires a PC with at least a Pentium 4/Celeron
runni ng at 2GHz or Athl on/Athl on 64 runni ng at 2600+ PR
processor. Operation system must be one of the above:
o Window Xp
rj
r
Windows 2003 server
i
Fedora Linux 4
.
Suse Li nux 10
r Solaris 9 or later
.
MacOS 10.1 or later
All system must be equipped with at least 512 MB of
DDMM/333MHZ and a rnonitor capable for 800x600 screen
resol uti on wi th a mi ni mum of 16 mi l l i on col ors, 100MB of hard
di sk space.
The tlme for a result to return is depending on network con-
nection and system .Ou:!1"0:
Network connection is necessary for this applicatlon to func-
ti on, Al so Googl e 6i te rnust be
.up
and runni ng i n order to get
the Appropriate URLS.
?
The product r.equires a web browser compatible with HTML 4.
The base r.equirement ,for the b{owser.would be Internet Ex-
pl orer
6.0 and above and Mozi l l a Fl refox 1.5.0 and above.
47
Det ai l ed Des c r i pt i on
qui r ement s
Introduction Page
of Func t i onal Re-
l Purpose
The i ntroducti on page prompt user to enter hi s
i
i ' j r y l
i ,q' "ry I
ilnputs
rurouse anO GvUo*a inputr
--i
[e.""."Jinl
Display instruct-ions for the s"*.h
""gr" "r "il"-
|
I
user to insert a query.
I
t ( l
P:::to :l ---::
----- -
I
Wai ti ng Screen
Result Screen
i
- - - - - l
j Purpose i Present to user the answers to hi s/her query
i;p"t"
irlo
--- -l
i ' l
-
,P'roJessins
itllA
- - *i
l ' | r/ A
I
iOutputs
iThe
User's
Query
results page
i
-*-____._.-,1
Hel p Page
i P*pose-*E[ t t r"
i "t rodu&i * p"g", t t re; serA" . t [ k ; " h"t pl
t l
button to recei ve hel p about how to use the appl i ca-
Iti on.
ffiilt"
*-
irh; ffi r ;impty .ticfi ; the hoilA ;'.crr
t
-,,--_.' .----' _-1
l Processi ng
The processi ng i s done through poi nt and cl i ck acti ons
I
48
i Prompt
user to wai t unti l the resul ts retri eved
I
Ftp"t"
]fiom
the user through tfi b.",#.
i fhe
hel n page wi th hel p i nformati on
After24Hours Page
Informati on Page
l Purpose
lContract
information
I
llnnuts lrurn
lP';;ffi; [i^
loutnuts
lAuthor
& supcrvisor committee
I
I
j Processi ng
I lrurn
i
I
P e r f o r m a n c e Req u i r e m en t s
The Appl i cati on wi l l be l oaded l ocal l y and accessed vi a a web
browser. It operates i n si ngl e-user mode onl y. System i s hi ghl y
depended on Googl e web si te avai l abi l i ty and ti me response.
As for the response ti me of the user i n depend on the speed
of the network connecti on and computer' s CPU.
,
---l
I
I
I
I
fu"prt,ri"
"
rn*r* .|rcr. to Sr"a .ppiopiiale ;"th"ai
I i of
search from a drop down menu and submi t button. i
49
As for the nu.mber
of files and fi,le sizes,,,tfiere
will be two files
per
URL (one
content
ftle,
and one serialized
HashMap).
Da-
tabase is also run locally
and thus SeL query performance
is
depending
on
System memory
that is availabte
and cup.
speed.
Quat i t y At t r i but es
.
The generated
results
shoutd correspond
Go relative
documents
'
y99n average preGision
must be relatively
hlgh.
::
3
Ot her
Requi r ement s
NONE
j
I
J
50
CHAPTER 4. SYSTEM DESIGN
Met hodol ogy Chos en
Final year project is often the first large scale project curried out
by a student i n undergraduate l evel , thus the methodol ogy that
wi l l be fol l owed duri ng the proj ect devel opment on the one hand
must provide enough flexibility to
$ftware
developer in order to
overcome desi gn mi stake and i neffi ci ent roadmaps that maybe
caused by devel oper l ow l evel of experi ence and on the other
hand to provi de the devel oper wi th stabl e ground to conti nue
wi th the rest of system devel opment and documentati on.
Thi s requi rements are easi l y covered i f the methodol ogy gi ve he
appropri ate tool s that wi l l permi t the segmentati on of the under
devel opment proj ect i n smal l semi -autonomous segments and
step by step crystallize of project aspects.
The above requi rements seem to be covered by the agi l e devel -
opment methodol ogy, but i f we i ncorporate the fact that the
specific project has a kindi.f research project nature it seems to
be safer that the prototypi ng methodol ogy to be chosen.
As i t has been di scussed i n SDLC methodol ogy chapter the pro-
totype i s not a paper speci fi cati on of the system, but a worki ng
model of the system, al bei t often i ncompl ete,
51
The fi rst step was to meet the supervi sors and col l ect enough
i nformati on to bui l d a>> rough" i ni ti al system prototype, whi ch
normal l y shoul d be presents to supervi sory commi ttee for
comment s,
These comments woul d be taken i nto account i n the next proto-
type versi on. Thi s i terati ve process i s repeated unti l no new
comments are expressed by the supervi sory commi ttee.
The fi nal system evol ves gradual l y through thi s process
of tri al
and error, as gradual l y the system i s refi ned by thi s i terati on
process.
Sy s t em Ov er v i ew
Sy s t em Cor e and f r ont
-
ends
The proposed system i s runni ng i nto two parts, the fi rst
part that consi sts of the mai n appl i cati on recei ves queri es
. from users ei ther by command l i ne or web i nterface.
The query i s passed to Googl e search engi ne from whi ch
the system recei ve a resul t a l i st of URLs (maxi mum #50)
that accordi ng td Googl e are correl ated to user i ni ti al
quer y. The i ni t i al quer y i s st or ed i n a map and i n a dat a-
base for future reference.
Then system crawl each of thi s URLS and produce
two
HashMap type obj ect; one that contai ns the total of terms
occurred i n al l document that have been crawl ed and a
52
second one that contains current document term index
and i n what frequency thi s term occurred i n the text.
Duri ng the map creati on phase cl osed cl ass words (Van
Petten C, 1991) are removed whi l e the remai ni ng term
pass through a stemmer that i mpl ements the Porter
Steamer al gori thm(C.J. van Ri j sbergen, 1980).
After we have fi ni shed wi th the crawl i ng of al l URLs we end
up wi th 50 Hash Map obj ects,( one for each document and
one l arge Hash Map wi th al l the words that we have met
duri ng the URLs crawl i ng.
Each URL i s represented by a nx1 di mensi ons vector where
n i s the number of terms that l i ves i n each document.
At thi s stage system we combi ne the Large Hash Map and
each URLs i ndi vi dual Hash Map i n order to produce one
l arge 2D array wi th the al l terms hash map val ues as rows
and vi si ted URLs as col umns.
Then we decompose thi s l arge 2D array usi ng the Si ngul ar
Val ue Decomposi ti on. The next step i s to use Latent Seman-
t i c Anal ysi s t echni que and Eucl i di an di st ance t o cl assi f y t he
rel evance of each document to the ori gi nal User
Query.
The Eucl i di an Di stance of two vectors P= (p' pu,
F*)
and*Q=
(Q",
9y, 9*),
i s defi ned by the formul a Edi stance(P,Q)=
^l (n, -o )'
+ (P
y
-
Q
)'
+ (P*
-
q*)'
The user can access the rel ati vi ty l i st wi th a web i nterface or
vi a the standard output.
53
The other part of this application implements some charac-
teristics of an agent; this agent-like paft is initiated via a
time scheduler and has as scope to rework the previous
24
hours user queries, but now by getting extra results from
yahoo.com.
This part is launched daily 5GTM since after several tests
were run it was found to be he best time in term of lower
network congestions both in Europe and the majority of
USA(please refer to Appendix II with the Greek Networks
L
T
Weathermap).
Pr oj ect dev el opment pr oc es s
In order to accomplish the tasks of this projct the develop-
ment process had been segment to discrete phases while the
softurare development, since
java
is a language that promote
reusability and Object Orientations had been developed in
modul es l ogi c.
. Phase I
l.The Initial task was to determine the idea that the
project
should be served. In the beginning there was a
thought about creating an intelligent search engine us-
i ng AI.
qr-
2.But after some discussion with the project supervi-
sor a decision to incorporate some research about
Agents was taken. Additionally there was an agree-
ment to implement some of the Agents characteristics,
if the time and resource was antiquated.
54
Phase II
Phase III
3,The second task was to deci de on what pl atform the
system would be developed. Platforms concept was in-
cl udi ng t he Pr ogr ammi ng Language and t he Human
Computer Interacti on,
r The next step was to undertake a Bi bl i o-
graphical Research on how the state-of-the-
art web search engi nes work. The mai n
search engi ne of i nterest was Googl e. At thi s
stage some l i teracy research was taken pl ace
i n order to understand thi s area of computer
sci ence.
1. After bi bl i ographi cal research was fi ni shed the
next task was to deci de about the modul es
that shoul d compri se i n the fi nal proj ect. The
Domai n charts and i ni ti al cl ass di agrams was
construct. The top down strategy was em-
pl oyed i n order to vi sual i ze an overvi ew of
the system wi thout goi ng i nto detai l for any
paft of it.Ibe first unit was the web crawler
system. for col l ecti ng the web-data.
2. Testi ng i s a conti nues process i n prototypi ng*
methodol ogy and so after the WebCrawl er de-
si gn and i mpl ementati on fi ni sh some tests was
carried out to determine the efficiency and the
stabi l i ty of thi s speci fi c uni t. Test and fi x.
3. User Eval uati on
55
t
4. The next module was the software part that
would count the occurrence of each term in
every document and the interconnection with the
task 6.
5. Agai n some testi ng was taken pl ace. Test and
Ft x.
6. Incorporate AI techniques in order to test the
relevance of each retrieved URL to original ques-
tion and interconnect the new software with soft-
ware from task 8 and 6.
7. User Eval uati on
8. Test and Fi x.
1, System Front-End. How user would interact
with the core system. For safety reasons
(si nce knowl edge i n graphi cal gui was l i m-
ited) both the console mode and web inter-
face methods have been empl oyed.
User Eval uati on
Test and Fix.
Introduce Agent characteristics. Autonomous
Test and Fix.
?
1. Total system testing
2, Final System Evaluation from User
3, Produce the final documentation
Phase IV
Phase V
2.
3.
4.
5.
56
Bi bl i ographi cal Research on how the state-of-the-aft
web search engi nes work (Googl e & msn)
Bi bl i ographi cal Research on how desi gn and i mpl ement
an soft-agent
Desi gn and i mpl ement an agent (crawl er) for col l ecti ng
the web-data
Design the storage database
Deci de what AI method are appl i cabl e to our domai n
problem
Desi gn and produce an output i nterface
{
Evaluate usability of the interface using users feedbad<
Appl y user feedback
Test the agent
Test the agent & database system
Produce final year project repoft
Table 2 Development Phases
57
CHAPTER 5. SOFTWARE DEVELOPMENT PHASES IN
DETAILS
Des i gn Ov er v i ew
Fac i l i t i es
The retrieval system provides five basic facilities; user input,
+ candi date URLs retri eval , and parsi ng, processi ng and taxo-
nomical presentation of web sites that corresponds to a spe-
cific user query.
User Input: User can i nput a guery to system usi ng ei -
ther the command line or by using a web interface.
Candidate URLs retrieval: system is retrieving URLs from
web databases.
Parsing: URLs are parsed
Processi ng: URLs Content i s tested agai nst ori gi nal
Query.
Taxonomi cal Presentati on: User get an output to hi s/her
Questi on.
58
The c or e s y s t em
Sof t war e dev el opment pl at f or m
The core system i s i mpl emented usi ng the Java Devel op-
ment Ki t versi on 1.5. The el ecti on of
j ava pl atform was
taken because of the fol l owi ng j ava
features:
.
Java ( ver si on 1. 4. 2 and l at er ) i ncl udes an ef f i ci ent r egu-
l ar expr essi on mechani sm f ur t her mor e some ot her ad-
vantages can be spotted i ome other advantages that
make Java our fi rst choi ce as has to do wi th the construc-
t i on of agent & par ser :
. Java i s avai l abl e wi th a free l i cense for both commerci al
and non-commerci al purposes.
. l ava APL There are i s a bi g number of ready cl asses, and
packages and other uti l i ti es avai l abl e that may fi t our
needs.
o Documentati on. There i s a bi g number of books currentl y
i n pri nt or onl i ne, whi l e Sun offers adequate number of
onl i ne resource.
.
Obj ect Ori ented: through extensi on and i nterface i mpl e-
mentati on we can reuse code and save ti me and effort.
Addi ti onal l y thi rd party l i brari es can be i ntegrated i nto our
project.
.
Memory Management: Java provi des i ts own memory
management system.
. Avai l abi l i ty. Java i s everywhere i n PCs (wi ndows, l i nux,
uni x), mai nframes l aptops, PDAs, mobi l es.
59
The web interface implementation will be performed
in JSp
(Java Servl et Pages) si nce i s a l i ghtwei ght and wel l -known web
appl i cati on Language, whi l e i t has bui l t i n i nteracti on features
wi th Java.
The above deci si on has as resul t the need for the tomcat JSp
server i nstal l ed and runni ng at port 8080 of the system.
The database system wi l l be based on MySeL versi on 4.L!4
si nce i t provi des
a strong database i nfrastructure, very good
documentati on, and conveni ent i nterfaces for connecti vi ty wi th
{
most wel l known databases and i s free of charge.
I n t e r g r a d e d
o p me n t
Dev el opment Env i r onment Dev el -
Duri ng the fi rst days of thi s proj ect there was a di -
l emma whi ch IDE to use, i n the past years
duri ng al l
coursework' s at Uni versi ty of Surrey the Net Beans
pl atform was suggested by most of the Lab Instruc-
tors,
On the other hand a number of peopl e(Vaughan-
Ni chol s, 2003) were suggesti ng the Ecl i pse as the ul -
ti mate tool for Java codi ng. Among the other bene-
fi ts speci al reference i s gi ven
to the fol l owi ng:
Ecl i pse i s Open Source
There are a l arge number of pl ug-i ns
avai l abl e for i n-
tegrati on wi th Ecl i pse
.
The qui ck
fi x error system
60
.
The conveni ent way to i ncorporate new l i brari es
without the need for CLASSPATH alteration.
.
Maj or UML tool s i ntegrati on wi th ecl i pse (Borl and
Together, Posei don Vi sual Paradi gm)
'
No need for installation (less registry inputs to sys-
tem
)
Taki ng i nto account the a[orementi oned el ements the
Ecl i pse was sel ected as the devel opment IDE.
Sy st em Des i gn
In the fi gure bel ow (fi gure 5) we can see the use case that
supports the requirement gathering procedure for this pro-
ject'
61
I
Figure 5 System Use Case
A user can input queries in the system. He can also select
whether or not cached results will be used or a fresh search
will take place. Also system gets the List of URLs from an exter-
nal actor that in our case is Google. The state chart diagram in
the next figure (#6) provides a clear picture about system tran-
si ti ons.
The i ni ti al i zati on event i s the user query. i f thi s questi ons oc-
curred in the past the system will suggest user to take advan-
tage of the cached results.
*
Otherwise the query is stemmed and then stored in a DB for fu-
ture reference; the next step performed is to get a URL list
ftom Google. Then the closed class words are removed, since
they have minor semantic value, from the context of each web
62
page
and the remai ni ng terms get stemmed i n order to reduce
the total context entropy (Pitikaris, ZOO2).
Figure 6 System State Diagram
63
In the next phase the system get the frequency of each term that
occurred i n the context of the web page at the same ti me If that
word has never met been duri ng thi s query process the word i s
added to the total Word HashMap. If no other URL i s l eft for proc-
essi ng then the system proceeds wi th Si ngul ar Val ue Decomposi -
ti on and Latent Semanti c Anal ysi s.
SVD appl i es to a Matri x A whi ch has the fol l owi ng structure
Samol e of a Matri x candi date for
After SVD operati on matri x A become A-SxVxD where V i s a di -
agonal Matri x, and S,D normal matri ces. Then we take the fi rst
two col umns from each matri x and we end uo wi th the mi ni atures
Tabl e 3
SVD
URLl URL2 URL3 URL4
Terml
t
' l' imes
terml
occurred in con-
text of URL1
Times terml
occurred in con-
text of URL2
]]mes Term 1
occurred in con-
text of URL3
Times terml
occurred in con-
text of URL4
Term2
' l-imes
term2
occurred in con-
text of URLl
nmes term2
occurred in con-
text of URL2
Times Term 2
occurred in con-
text of URL3
Times term2
occurred in con-
text of URL4
Term 3 Times term3
occurred in con-
text of URLl
Ti mes
term3occurred
i n context of
URL2
Times Term 3
occurred in con-
text of URL:I
Times term3
occurred in con-
text of URL4
Term 4 Times term4
occurred in con-
text of URL1
l ' i mes term4
occurred i n con-
text of URL2
nmes Term 4
occurred in con-
text of URL3
Ti mes
term4occurred
i n context of
URL4
64
sD> I
@ffi|
iry
ffil
Fi gure 7 Cl ass Di agram
f k h g f f i l a V
l
I
Fi gure 5 Sequence di agram
66
Comments, chal l enges and i nteresti ng detai l s
The fi rst notabl e thi ng that made i mpressi on was the con-
cept of the object term in Java. Before this project it was
somethi ng abstract and very di ffi cul t to understand. But
when obj ect storage and obj ect compari sons took pl ace i n
Map and HashMap oper at i ons i t became cl ear how i mpor -
tant and fl exi bl e thi s basi c
j ava
type i s.
Duri ng the i mpl ementati on( a l ot of redundant code was
pr oduced and a l ar ge number of si mi l ar or qui t e si mi l ar
cl asses were created most of the ti me to perform a si ngl e
and speci fi c work. To i mprove code readabi l i ty and reduce
the compl exi ty i n cl asses di stri buti ons packages where
created and speci fi c cl asses were reengi neered i n order to
provi de more generi c faci l i ti es. Thi s resul ted i n the reduc-
ti on of source di rectory si ze by 15% wi thout any l oss i n
system functi onal i ty.
Another thi ng that i t i s i nteresti ng i s the i mportance of
Javadoc. Thi s becomes apparent taki ng i nto consi derati on
the vast amount of code l i nes that were wri tten for the
cl asses and the mel hodS used. In order to keep track of
the operati onal characteri sti cs for each one of the compo-
nents used an effi ci ent documentati on faci l i ty was re-
qui red.Javadoc provi des an excel l ent tool for real *ti me
documentati on and future reference,
The devel opment phase started wi th the deci si on of whi ch
URLs source to be used. After the deci si on for the use of
Googl e search engi ne as i ni ti al poi nt for URLs retri eval was
another probl em was encountered. For the i ni ti al tests the
67
.
Stability
.
Return types
.
At first small and predefined data (monk objects) sets
were used to test each module. Since the data sets
where predefined the result was already known, so it
was easy to check whether or not the module return re-
sults were consistent.
The next step was to input large data sets of exponential
increased sizes in order to test the consistency and the
behavior of each unit in extreme situations.
Duri ng thi s second testi ng phase a number of modul es
crashed wi th excepti ons and thi s was useful si nce a num-
ber of "What can go wrong" situations was caught.
Finally small programs were created to reverse specific op-
erations like read from saved file and convert matrices to
Hash Maps and 2D arrays in a kind of automate tests for
consistency.
This procedure was extremely useful when in later time it
was necessary to repeat testing on ome modules during
the Integration phase. Unit testing helped to eliminate un-
certai nty about the i ndi vi dual modul es and was used i n a
bottom-up testing app5oach.
I nt egr at i on Tes t i ng
After the i ndi vi dual modul e creati on was compl eted and
the modules passed the unit test mode individual soft-
ware modules were combined and tested as a group.
70
Throughout thi s stage i s essenti al to perform some Inte-
grati on Testi ng. Every ti me a new modul e i s i ncorporated
to the core system appropri ate testi ng i s appl i edi n order
to check whether or not the system performs the com-
bi ned task of i ncorporated modul es i n a consi stent and
suffi ci ent manner.
The test method i s si mi l ar to uni t test but i nstead of ap-
pl yi ng the tests i n i ndi vi dual uni ts we appl y them i n
group of uni ts fol l owi nq a bottom-up approach. Al so
si mul ati on tests the usage of shared data areas (Maps
and fi l es) and i nter-process communi cati on.
7t
great oppoftunity for the author to investigate areas of high interest such
as agent technologies and Information retrieval methods'
Fut ur e wor k
In a future release the following proposed extensions to this
project must be taken into account:
.
Agent technology employment: Especially for
disturbed processing, according to what it was
extract from literature review agent is a con-
veni ent techni que to i mpl ement di stri buti ng
computing facilities. By distributing the proc-
ess load we expect to get faster results and
improve the system performance in terms of
the time that is required by the system to
provide results.
o Term weighting: Some papers (Tumey, 2005)
suggests the transform of word frequencies
using the logarithmic transformation and the
applications of the term entropy as a measure
of terms importance in context.
. An adequate user interface must be produced
in order to improve user experience
. Personalized search methods must be tested
It is possible that we can perform such tasks
by creating user profiles which will contain the
specific user search interests and combine
78
that data wi th extended use of agent technol -
ogy.
o An i mproved versi on of thi s proi ect wi l l i n-
cl ude extensi ve use of database features' An-
other i mprovement can be the use of Hi ber-
nati ng technol ogy whi ch sagest the encapsu-
lation of
java objects in a database
schema( Hi ber nat e, 2005) .
a
79
Bi b l i og ra p hy
ALEXANDROU, M. (2006) Systems Devel opment Li fe Cycl e (SDLC)
ht t o : / / www. mar i osal exandr ou. com/ met hodol ooi es/ syst ems devel o
pment
l i fe cvcl e.asp (tO/4/2006)
BADAL, D. & DAVIES, M. Retri eval of Unstructured Text
BARRETT, L,, M. (1997) Si mul ati ng requi rements gatheri ng. Techni cal
Symposium on Computer Science Education. San Jose, California,
Uni ted States, ACM Press New York, NY, USA,
BHATTI, N., BOUCH,A., KUCHINSKY,A. Integrati ng User-Percei ved
Qua
l i ty
i nt o Web Ser ver Desi gn ht t o: / / www9. or o/ w9cdr om/ 92l 92, ht ml
2/4/2006
C. l . VAN RI JSBERGEN, S. E. R. A. M. F. P. ( 1980) New model s i n pr obabi l -
i sti c i nformati on retri eval .
DEERWESTER, S. C. , DUMAI S, S, T. , LANDAUER, T, K. , FURNAS, G, W, &
HARSHMAN, R. A. ( 1990) I ndexi ng by Lat ent Semant i c Anal ysi s.
Journal of the American Society of Information Science, 47, 391-
407.
DOSZKOCS, T. E. R. , J. ; LI N, X ( 1990) Connect i oni st Model s and I nf or ma-
tion Retrieval. Annual Review of Information Science and Technol-
ogy (ARIST), vol . 25., 209-260.
DR VRUSIAS. B (2005) CS364 Arti fi ci al Intel l i gence l I/2OOs
htto : //oortal . su rrey, ac, u k/porta l /pa ge? pa gei d =
798.463237& dad
=oof t al &
schema=PORTAL 10/ 2005
FALOUTSOS, C. & OARD, D. W, (1995) A Survey of Informati on Retri eval
and Fi l t er i ng Met hods.
FOWLER, M, ( 2002) The Agi l e Mani f est o: wher e i t came f r om and wher e i t
may go http : //www. ma rti nfow l er. com/a rti cl es/a gi l eSto
ry.
htm I
2/04/2006
GONCALVES, M. A, , FRANCE, R. K. , FOX, E. A, & DOSZKOCS, T. E,
MARIAN: Searchi ng and
Queryi ng
across Heterogeneous Federated
Di gi tal Li brari es
HIBERNATE (2005) Rel ati onal Persi stence for Java and .NET
*
http://www.hi bernate,org/ 20/4/2006
I NC. , C, ( 2000) What i s Rapi d Appl i cat i on
Devel opment ( RAD.)?
..2O
/
4
/
2006
JARVELIN, K., KEKALAINEN, J. (2000) IR eval uati on methods for retri evi ng
hi ghl y rel evant documents. Annual ACM Conference on Research
and Development in Information Retrieval. Athens, ACM Press
New York, NY, USA.
JENKI NS, A. M, ( 1985) " Pr ot ot ypi ng: A Met hodol ogy f or t he
desi gn and Devel opment of Appl i cati ons Systems". Spectrum,, 2.
JUSTTCE, U. S. D. O. ( 2003) I NFORMATTON RESOURCES MANAGEMENT
http : //www. usdoj . gov/j md/i rm/l i fecvcl e/ta bl e. htm t0 / 2005
LEE, D. L,, KIM, Y. M. & PATEL, G. (1995) Effi ci ent Si gnature Fi l e Methods
for Text Retrieval. Knowledge and Data Engineering, 7, 423-435.
LOWE, D., WENDY, H. (1999) Hypermedi a and the Web,Wi l ey
80
MCCONNELL, S, (2004) Code Complete, Second Edition, Microsoft Press,
MOCK, K. J, ( 1996) Hybr i d Hi l l - Cl i mbi ng and Knowl edge- Based Met hods
for Intelligent News Fiftering. AAAVIAAI, Vol. 1.
MOSOUD, M, , JENTZCH, R, ( 2004) Comput at i onal I nt el l i gence Techni ques
Dr i ven I nt el l i gent Agent f or Web Dat a Mi ni ng I SBN: 159140
PARKER, D. C. ( 1983) The Evol ut i on of Management Deci si on
Support Systems",. Information Systems, 4, 56-65.
PITIKARIS, T,, PAPADOUMKIS,G., NIKITAKIS, M., WARE, J., A. TSAGA-
TAKIS, G. (2002) Di gi tal l i brari es: i nformati on retri eval wi th neural
networks. IN LARISSA, T. E. L O. (Ed,) Hellenic Conference on
Aca dem i c Li b ra ri es, La ri ssa, Tech nol o gi ca I Educati ona I Insti tute of
Lari ssa
RI CHARDSON, M. , DOMI NGOS, P, ( 2002) T$e I nt el l i gent
Surfer:Probabi l i sti c Combi nati on of Li nk and Content Informati on i n
PageRank. Advances in Neural Information Processing Systems.
Vancouver, Canada, December 9-14, MIT Press,
SOCIETY, L (2004) Internet Soci ety 2004 Annual Report. Geneva, Swi t-
zerl and, Internet Soci ety,
THER, H. , H, ( 2993) Topi c- Sensi t i ve PageRank: A Cont ext - Sensi t i ve Rank-
ing Algorithm for Web Search IEE TRANSACTION ON KNOWLEDGE
AND DATA ENGINEERING, 74,784-796.
TSAGATAKIS, G., PLESSAS, J., PITIKARIS, T. (2005) Informati on Soci ety
Commi ttee cal l No 114: An e-government pl atform for Lasi thi Pre-
fecture. Agi os Ni kol aos, Technol ogi cal Insti tute of Crete.
TURNEY, D., P. (2005) Measuri ng Semanti c Si mi l ari ty by Latent Rel ati onal
Analysis. Nineteenth International Joint Conference on Artificial In-
tel l i gence. Edi nburgh, Scotl and, .
URQUHART, C. (1999) Themes i n earl y requi rements gatheri ng: The case
of the anal yst, the cl i ent and the student assi stance scheme. In-
formation Technology & People, t2, 44
-
70.
VAN PETTEN C, K. M. (1991) Infl uences of semanti c and syntacti c context
on open- and closed-class rryords'. Mem Cognit., 95-712.
VAUGHAN- NI CHOLS, S. J. ( 2003) The bat t l e over t he uni ver sal Java I DE.
Comput er , 36t 2I - 23.
WIKIPEDIA (2005) Software devel opment process. Wi ki pedi a.
WI KI PEDI A ( 2006a) PageRank
http ://en, wi ki p_edi a, orgl wi ki l Pa geran k# Si m pl i fi ed Pa geRa n k a l gori
thm 4/2006
WIKIPEDIA (2006b) Requi rements anal ysi s
http://en.wi ki pedi a.orgl wi ki /Requi rements anal ysi s 4/2006
WIKIPEDIA (2006c) System Devel opment Li fe Cycl e
htto://en.wi ki pedi a.oro/wi ki /Svstem Devel opment Li fe Cvcl e
t0/04/2006
WOOLDRIDGE (1995) Intel l i gent Agents: Theory and Practi ce. Knowl edge
Engineering Review, VOL 10, 115-152.
ZAPHI RI S, P, & ZACHARI A, G. ( 2001) Desi gn Met hodol ogy of an Onl i ne
Greek Language Course
81
I NDEX
Adapti ve methods, 29 Informati on Retri eval , i , 4, t8,
Agent, 4, 56, 67
,
76, 77
,
80 4t, 71, 76, 79
Agi f e, i , i i i , i v, 25, 27, 28, 29, 79 I nt el l i gent agent s, 20
Agile Software Development, Intergraded Development
i , 25, 27 Envi r onment Devel opment , i i
Capabi l i ty Maturi ty Model ,22 feb, ZS
Codi ng,13, 33 l oi nt appl i cati on
devel opment, 25
D Joi nt appl i cati on devel opment
Desi gn overvi ew, i i , 58
(JAD' 25
Di stri buted agents, 21
L
Documentation, 33, 59
Latent Semanti c Indexi ng, i , 16
E LSA, t 7, t 8, 43, 67, 73, 76, 85
External Interface Requirements,
LSI' 16
i , 47
Extreme Programmi ng
,
26,29,
M
30 Mai ntenance,33, 36
Matrix, iii, 64, 67
F Met hodol ogy, i , 26, 3t , 51, 79,
Faci l i ti es, i i , 58
80
fountai n modet, 25
Mul ti -agent systems' 21
Functi onal Requi rements, i , 44,
4g' ' N
Natural Language processi ng, i ,
G15
Google, iv,3,5, 6,
g,
42,
::, ::, ilil:ij
Networks' 3
^
46, 47, 49, 52, 55, 57, 62, 66,
72, 76
P
H Performance Requi rements, i , 49
Hash Map, 42,53, 69
Predi cti ve methods, 29
HashMap, 42
,
50, 52, 64, 66, 67
Proj ect devel opment process' , i i ' ,
Prototype, i i , 71
Prototypi ng Methodol ogy, 26, 3t
A
and Abbrevi ati o
Appendi x I
1. Defi
ns Ab
LSA Latent Semanti c Anal ysi s
A mathemati cal /stati sti cal techni que for extracti ng
and representi ng the si mi l ari ty of meani ng of
words and passages by anal ysi s of l arge bodi es of
text,
SVD Si ngul ar Val ue Deco' i nposi ti on
i s a method to factori ze a rectangul ar real or
compl ex matri x
SRS Softwa re Requ irements Specification
TCP Transmi ssi on Control Protocol
One of the core
protocol s of the Internet protocol
sui te. Usi ng TCP, appl i cati ons on networked hosts
can create connections to one another, over which
they can exchange data or
Packets
JVM Java Vi rtual Machi ne
DRAM Dynami c RAM
Speci fi c type of computer memory
2D Two Di mensi ons
MB Megabyte
Computer measurement uni t for storage medi a'
85
equati on) that are someti mes al so known as char-
acteristic roots, characteristic values (Hoffman and
Kunze 1971), proper val ues, or l atent roots
:
J
87

You might also like