KEMBAR78
Context Driven Technique for Document Classification | PDF
ACEEE Int. J. on Network Security , Vol. 02, No. 02, Apr 2011


                Context Driven Technique for Document
                             Classification
                                     *Upasana Pandey, @ S. Chakraverty, # Rahul Jain
                                                   *upasana1978@gmail.com
                                                  @apmahs@rediffmail.com
                                                   #rahul.jain@nsitonline.in
                                         NSIT, SEC 3, Dwarka, New Delhi 110078, India

Abstract: In this paper we present an innovative hybrid Text              contextual information is harnessed at two stages. First, it is
Classification (TC) system that bridges the gap between                   used to extract a meaningful and cohesive set of keywords
statistical and context based techniques. Our algorithm                   for each input category. Secondly, it is used to refine the
harnesses contextual information at two stages. First it extracts         feature set representing the documents to be classified.
a cohesive set of keywords for each category by using lexical                 For the rest of the paper, section II presents a brief
references, implicit context as derived from LSA and word-                background of the field and the relevance of context based
vicinity driven semantics. And secondly, each document is
                                                                          TC. Section III brings into perspective prior work in the area.
represented by a set of context rich features whose values are
derived by considering both lexical cohesion as well as the extent        Section IV presents our proposed context-enhanced TC
of coverage of salient concepts via lexical chaining. After               model. In Section V, we compare the proposed TC scheme
keywords are extracted, a subset of the input documents is                with current approaches and overview its implications and
apportioned as training set. Its members are assigned categories          advantages. We conclude in Section VI.
based on their keyword representation. These labeled
documents are used to train binary SVM classifiers, one for                       II. BACKGROUND AND MOTIVATION
each category. The remaining documents are supplied to the
trained classifiers in the form of their context-enhanced feature             A TC system accepts two primary inputs; a set of
vectors. Each document is finally ascribed its appropriate                categories and a set of documents to be classified. Most TC
category by an SVM classifier.                                            systems use supervised leaning methods that entail training
                                                                          a classifier. Support Vector Machines (SVM) use kernel
Keywords: Lexical references, Vicinity driven semantics, Lexical          functions to transform the feature space into a higher
chaining.                                                                 dimensional space and have been found to be especially
                                                                          suitable for TC applications [2]. Their training process may
                     I. INTRODUCTION                                      use prior labeled documents which guide the classifier in
    Text Classification (TC) is the task of inspecting input              tuning its parameters. In another approach, as illustrated in
documents with the aim of assigning them categories from a                the Sleeping Experts algorithm [3], a classifier is trained
predefined set. TC serves a wide range of applications such               dynamically during the process of classification using a
as classifying scientific and medical documents, organizing               subset of the input documents and its parameters are
documents for query-based information dissemination, email                progressively refined. Another approach is to identify a set
folder management and spam filtering, topic-specific search               of training documents from among the input dataset and label
engines and library / web document indexing [1].                          them with keyword- matched categories.
    The world-wide-web has witnessed a mind boggling                          A concern here is to derive a set of keywords for each
growth of both volume as well as variety of text content,                 category. While user-input keywords can be used, this is
driving the need for sophisticated TC techniques. Over the                cumbersome and may be impractical. It is attractive to
years, numerous statistical approaches have been                          generate the keywords automatically. This motivates the
successfully devised. But increasingly, the so-called bag-of-             application of contextual information to cull out meaningful
words approach has reached a point of saturation where it                 keywords for each category.
can no longer handle the increasing complexities posed by                     The documents must be pre-processed and represented
polysemous words, multi-word expressions and closely                      by a well-defined and cohesive set of features; the most time-
related categories that require semantic analysis. Recently,              consuming part of TC. The main aim of feature extraction is
researchers have illustrated the potential for applying context-          to reduce the large dimensionality of a document’s feature
oriented approaches to improve upon the quality of TC. This               space and represent it in concise manner within a
has resulted in a fruitful convergence of several fields such             meaningfully derived context space. Once encoded as a
as Information Retrieval (IR), Natural Language Processing                feature set, a trained classifier can be invoked to assign it a
(NLP) and Machine Learning (ML). The aim is to derive the                 suitable category. A plethora of classification techniques are
semantics encapsulated in groups of words and utilize it to               available and have been tapped by TC researchers. They
improve classification accuracy.                                          include Bayesian classifiers [4], SVM [5], Decision trees
    This paper showcases a proposal which bridges the gap                 [6], Ada Boost [7], RIPPER [3], fuzzy classifiers [8] and
between statistical and context based approaches for TC. The

© 2011 ACEEE                                                         23
DOI: 01.IJNS.02.02.196
ACEEE Int. J. on Network Security , Vol. 02, No. 02, Apr 2011

growing seeded clusters based on Euclidean distance between                  In [5], Diwakar et al have used lexical semantics such as
data points [9].                                                        synonyms, analogous words and metaphorical words/phrases
    Predominantly, Statistical approaches have been applied             to tag input documents before passing them to a statistical
for feature extraction with highly acceptable performance               Bayesian classifier. After semantic pre-processing and
of about 98%. These methods employ statistical metrics for              normalization, the documents are fed to a Naïve Byes
 feature evaluation, the most popular being Term Frequency-             classifier. The authors have reported improvement in
Inverse Document Frequency (TF-IDF), Chi-square and                     classification accuracy with semantic tagging. However
Information Gain[10]. Statistical techniques have been                  processing is almost entirely manual.
widely applied to web applications and harnessed to capacity.                In [7], the authors locate the longest multiword expression
Their applicability in further improving the quality of TC              leading to a concept. They derive concepts, synonyms and
seems to have reached a saturation point. They have some                generalized concepts by utilizing ontology. Since concepts
inherent limitations, being unable to deal with synonyms,               are predominantly noun phrases, a Part Of Speech (POS)
polysemous words and multi-word expressions.                            analyzer is used to reject unlikely concepts. The set of terms
    On the other hand, Semantic or context-based feature                and their lexically derived concepts is used to train and test
extraction offers a lot of scope to experiment on the relevancy         a Ada-boost classifier. Using the Reuters-21578 [14],
among words in a document. Semantic features encapsulate                OHSUMED [15] and FAODOC [16] corpora, the authors
the relationship between words and the concepts or the                  report that the combined feature set of terms and derived
mental signs they represent. Context can be interpreted in a            concepts yield noticeable improvements over the classic term
variety of intuitively appealing ways such as:                          stem representation.
     • Implicit correlation between words as expressed                       In [17] the authors select a set of likely positive examples
          through Latent Semantic Analysis (LSA)                        from an unlabeled dataset using co-training. They also
     • Lexical cohesiveness as implied by synonyms,                     improve the positive examples with semantic features. The
          hypernyms, hyponyms, meronyms or as sparse                    expanded positive dataset as well as the unlabeled negative
          matrices of ordered words as described in [3]                 examples together train the SVM classifier, thus increasing
     • Syntactic relation between words as reflected by                 its reliability. For the final classification, the TF-IDF of all
          Parts-Of-Speech (POS) phrases                                 the semantic features used in the positive data sets are
     • Shallow semantic relationships as expressed by                   evaluated in test documents. The author’s experiments on
          basic questions such as WHO did WHAT to                       Reuters-21578 [14] and Usenet articles [18] corpora reveal
          WHOM, WHERE [2]                                               that using positive and unlabeled negative datasets combined
     • Enhancement of a word’s importance based on the                  with semantic features gives improved result.
          influence of other salient words in its vicinity [11]               Ernandes et al [11] propose a term weighting algorithm
     • Domain based context features expressed by                       that recognizes words as social networks and enhances the
          domain specific ontology [7].                                 default score of a word by using a context function that relates
    With such a wide spectrum, different context-oriented               it to neighboring high scoring words. Their work has been
features can be flexibly combined together to target different          suitably applied for single word QA as demonstrated for
applications. The issue of real time performance can be                 crossword solving.
handled through efficient learning algorithms and compact
feature representation.                                                                   IV. PROPOSED WORK
    In this paper, we propose a scheme for incorporating
lexical, referential as well as latent semantics to produce                Our proposal taps the potential of lexical and contextual
context enhanced text classification.                                   semantics at two tiers:
                                                                         A. Keyword extraction:
         III. REVIEW OF CONTEXT BASED TC
                                                                            Automatic keyword generation frees the user from the
    Several authors have suggested the use of lexical cohesion          encumbrance of inputting them manually. Our scheme
in different schemes to aid the process of classification.              employs lexical references derived from wordnet[19],implicit
    L. Barak et al have used a lexical reference scheme to              context as derived from an LSA and vicinity based context
expand upon a given category name with the aim of                       as given the terms surrounding already known keywords.
automatically deriving a set of keywords for the category               The three driving forces together yield a cohesive set of
[12]. A reference vector comprises lexical references derived           keywords. The set of extracted keywords for each category
from wordnet and wikipedia for each category and a context              serves as a template for identifying representative documents
vector is derived from the LSA space between categories                 for that category. These documents are utilized as supervisors
and input documents. Each vector is used to calculate two               to train an SVM classifier.
cosine similarity scores between categories and documents.
                                                                         B. Document representation:
These scores are multiplied to assign a final similarity score.
The final bootstrapping step uses the documents that are                         Once training documents are identified, these and
identified for each category to train its classifier. With their        the remaining unlabeled test documents must be represented
combined reference plus context approach, the authors have              by a compact feature set. As during the keyword extraction
reported improvement in precision with Reuters-10 corpus.               phase, we use lexical relationships to identify generic

© 2011 ACEEE                                                       24
DOI: 01.IJNS.02.02.196
ACEEE Int. J. on Network Security , Vol. 02, No. 02, Apr 2011

concepts, synonyms and specialized versions of terms.                     the algorithm carries out a syntactic parsing of the
Taking a further step of refinement, we use anaphoric                     sentences containing keywords. Since concepts are
referential phrases and strong conjunctions to identify the               usually represented by noun phrases, they are separated
length of cohesiveness or lexical chain of each salient                   out. The number of instances of each vicinity related pair
concept. These parameters are utilized to fine tune TF-IDF                is counted. If this exceeds a pre-decided threshold τv ,
feature values. Only those features whose values cross a                  the keyword set is augmented with the new term. The
stipulated threshold finally represent the document.                      two augmented keyword sets are now AK lex(c) and
     The algorithm Context-driven Text Classification CTC,                AKimc(c).
is described in the pseudo-code in figure 1.
                                                                       Step 3: Evaluate documents’ representative index:
Step1: Removal of stop words and stemming:
     The first step is pre-processing of documents. Stop words,
i.e. those words that have negligible semantic significance
such as ‘the’ ‘an’ ‘a’ etc are removed. Weak conjunctions
such as ‘and’ are removed but strong conjunctions ‘therefore’
and ‘so’ are retained so that their contribution to lexical
chaining can be utilized. Next the system carries out                           The cosine similarity, which gives the cosine of the
stemming, viz. the process of reducing inflected words to a            angle between two vectors has been used in [12,22] to match
root word form. The entire set of documents is partitioned             documents with categories. However the keyword vectors
into one thrid training set and two third testing set.                 being equally weighted, the cosine similarity will bias
Step 2: Keyword extraction:                                            documents with equal number of instances of each keyword
     Words that bear a strong semantic connection with                 against others. Also, a document with more instances of a
category names are keywords. This concept can be extended              keyword may receive the same similarity score as a document
to hypothesize that words strongly associated with identified          with less instances of the keyword. An appropriate metric
keywords are themselves good candidates as keywords. We                should reflect the fact that any document containing more
apply these principles to extract keywords from the following          keywords from a category is more representative of it and a
sources.                                                               document with more instances of each keyword deals with
                                                                       the category in greater detail. Given a keyword list AK(c)
   Keyword Set-            from lexical references:
                                                                       and a document d we multiply two factors to derive its
       Words that are lexically related to a category c are            representative index ρ(AK(c),d): (1) fraction of the total
   collected from the wordnet resource. Taking cue from                number of category keywords present in d (2) Ratio of the
   [12], we use only that sense that corresponds to the                average numbers of keywords in d to the sum total of these
   category’s meaning. First synonyms are extracted. Then              averages for all documents. Thus:
   hypernyms of each term are transitively located to allow
                                                                                             A K (d )      ∑ T F _ ID F ( w ∈ A K (d )
                                                                                                                           i

   generalization. Hypernyms are collected up to a pre-                ρ ( A K (c ), d ) =            ⋅       i

                                                                                             A K (c )     ∑ ∑ T F _ ID F ( w ∈ A K (d
                                                                                                                               i         k   )
   specified level k so that term meanings are not obfuscated                                             k       i


   by excessive generalization. Next, hyponyms at                       Step 4: Assigning categories to documents:
   immediate lower level are located. At the end of this                         The next step labels documents with categories. The
   process, the system is armed with an initial set of                 document with the highest r-score for a category is assigned
   keywords that explicitly refer to category names.                   to it. Thereafter, all documents whose r-scores for that
   Keyword Set-             from implicit context:                     category are within a preset category assignment factor F of
                                                                       this maximum similarity score are also are assigned to it.
       Latent Semantic Analysis (LSA) is a well known
   technique to help identify the implicit contextual bonding           Step 5: Document feature representation:
   between category names and input documents. LSA uses                    The TC system utilizes contextual information to derive
   the Singular Value Decomposition (SVD) to cull out                  the feature set values of documents in the following ways:
   groups of words which have high vector similarity in the            Generalization with lexical references:
   LSA space [20]. This step generates a new set of
                                                                           As a document is browsed, new terms encountered are
   contextually related keywords AKimc(c) for each category
                                                                       annotated with their synonyms, hypernyms for k-up levels
   c.
                                                                       and hyponyms k-down levels. A rescanning of the document
   Augmented Keyword Sets from vicinity context:                       replaces any of the lexical references encountered with their
        Words in the vicinity of keywords weave a semantic             equivalent base words. Term counts are simultaneously
   thread that correlates with the category. LSA does club             updated. When all documents have been tagged, TF-IDF
   together word co-occurring anywhere in a document but               values are evaluated.
   does not highlight semantics inherently present in                  Creating Lexical chains:
   neighboring words. The idea is to locate those words
                                                                           Vertical search engines focus on a particular topic or text
   surrounding keywords that have a high potential for
                                                                       segment only. They need TC methods with greater depth of
   becoming keywords themselves. Given a keyword set,

© 2011 ACEEE                                                      25
DOI: 01.IJNS.02.02.196
ACEEE Int. J. on Network Security , Vol. 02, No. 02, Apr 2011

search. This is in contrast with generic Web search engines.          capture their faithfulness to a given category. The r-score
Towards this goal, we recognize that the extent of coverage           used here encodes the extent to which a document represents
of a concept is as important as its frequency of occurrence.          a category. It encapsulates both parameters ; the fraction of
In our framework, we have introduced an optional step that            the keywords present in a document and the weighted average
can be invoked for vertical categorization requirements.              keywords frequency
Lexical chaining:                                                     Weighting TF-IDF with length of lexical chain:
    Links consecutive sentences that are related though                   To allow categorization for applications that require a
repetition of a concept term or its lexical references                vertical search based on a topic, we use lexical chaining that
(iteration), anaphoric references and strong conjunctions             allows us to measure the extent of coverage on a concept.
[21]. Besides iteration, resolved anaphoric references also           Being compute-intensive, we allow this feature as a user-
lead to the next link in the chain. Strong conjunctions such          requested option.
as therefore or so serve as final links. When a gap of a few          Tunable Guiding parameters:
sentences occurs, it indicates termination of a local chain.
                                                                          Parameters such as maximum level of lexical referencing,
                                                                      vicinity score threshold and category assignment factor can
                                                                      be experimentally tuned. There is flexibility to adjust them
                                                                      for different applications and input categories.

                                                                                               VI. CONCLUSION
                                                                          In conclusion, we have proposed a comprehensive
                                                                      scheme for context based TC that starts its job with only
Step 6: Classifier Training:                                          category names. A cohesive set of keywords is generated
                                                                      by exploiting three kinds of semantic cohesion; lexical
    The documents output by step 4 are positively labeled
                                                                      references, implicit context and vicinity induced cohesion.
documents. They are used to train a set of binary SVM
                                                                      Semantic cohesion is also utilized for representing documents
classifiers, one category. All documents labeled with a given
                                                                      as context oriented feature sets. While lexical references
category are employed to train its classifier.
                                                                      bring in generalization, an optional lexical chaining scheme
Step 7: Testing:                                                      allows the depth of coverage of concepts to influence the
    The test documents are applied to the trained classifiers         classification decision. The application of context at various
to be finally assigned their appropriate categories.                  levels gives a framework for more meaning-driven
                                                                      classification which cannot be expected from a purely bag-
                        V.DISCUSSION                                  of-words approach.
   The TC system proposed above differs from reported                                          REFERENCES
schemes with the following innovations. In this
                                                                       [1] F. Sebastani, “Text categorization”, Trans. of State-of-the-art
A. Application of context for TC:                                     in Science and Engineering, Procs Text Mining and its Applications
    In this paper, context driven information is utilized for         to Intelligence, CRM and Knowledge Management, volume 17,
                                                                      2005, WIT Press.
keyword generation as well as for document representation.
                                                                        [2] S. Pradhan et.al., “Support vector learning for semantic
 B. Keyword extraction:                                               argument classification,” Machine Learning, 60, pp 11-39, 2005.
                                                                       [3] William W. Cohen and Y. Singer, “Context Sensitive Learning
     We use a three pronged approach to extracting keywords
                                                                      Methods for Text Categorization,” ACM Transaction on
automatically staring with only category names using,                 Information Systems, Vol 17 No.2, Pages 141-173, 1999.
        i. An LSA space vector similarity between documents            [4] I. Androutsopoulos et al., “Learning to filter spam mail: a Bayes
           and category names                                         and a memory based approach,” Procs of the workshop “Machine
        ii. Deriving lexical references to category names and         Learning and Textual Information Access”, 4 th European
        iii. Utilizing the strong semantic connection between         Conference on Principles and Practice of Knowledge discovery in
           concepts in the vicinity of identified keywords.           Databases, 2000.
                                                                       [5] Q. Wang al, “SVM Based Spam Filter with Active and Online
     While the first method has been tried out in [22] and the        Learning”, Procs. of the TREC Conference, 2006.
first two combined has been reported in [12], our algorithm             [6] Jiang Su; Harry Zhang, “ A Fast Decision Tree learning
augments both kinds of keywords by tapping distance based             algorithm”, Procs of the 21st conf. on AI-Vol. 1 Boston, Pages
context. The combined approach for keyword extraction is              500-505, , 2006.
geared towards an expanded and more cohesive keyword                   [7] Stephen Bloehdorn et al, “Boosting for text classification with
set.                                                                  semantic features”, Procs. of the MSW 2004 workshop at the 10th
                                                                      ACM SIGKDD Conference on Knowledge Discovery and Data
Document to category matching:
                                                                      Mining , AUG (2004) , p. 70-87.
    In [12,22], the authors have used the cosine similarity             [8] El-Sayed M. El-Alfy, Fares S. Al-Qunaieer, “A Fuzzy
metric to map documents with categories. This does not truly          Similarity Approach for Automated Spam Filtering”, Procs. of the


© 2011 ACEEE                                                     26
DOI: 01.IJNS.02.02.196
ACEEE Int. J. on Network Security , Vol. 02, No. 02, Apr 2011


2008 IEEE/ACS International Conference on Computer Systems
and Applications - Volume 00, Pages 544-550, 2008.
  [9] P.I. Nakov, P.M. Dobrikov, “Non–Parametric Spam
Filtering Based on KNN and LSA”, Procs of the 33th National
Spring Conference, 2004.
   [10] Yiming Yang et al , “A comparative Study on Feature
Selection in Text Classification”, Proceedings of ICML-97, 14th
International Conference on Machine Learning, page 412—420.
Nashville, US, Morgan Kaufmann Publishers, San Francisco, US,
(1997)
  [11] Marco Ernandes et al, “An Adaptive Context Based Algorithm
for Term Weighting”, Proceedings of the 20th international joint
conference on Artifical intelligence, 2748-2753, 2007
  [12] Libby Barak et al, “Text Categorization from Category Name
via Lexical Reference”, Proc. of NAACL HLT 2009: Short Papers,
pages 33-36, June 2009.
 [13] Diwakar Padmaraju et al, “Applying Lexical Semantics to
Improve Text                  Classification”                                          , http:/
/ w e b 2 p y. i i i t . a c . i n / p u b l i c a t i o n s / d e f a u l t / d o w n l o a d /
inproceedings.pdf.9ecb6867-0fb0-48a5-8020-0310468d3275.pdf
  [14] Reuters dataset: www.reuters.com
  [15]OHSUMED test collection dataset: http://medir.ohsu.edu/
~hersh/sigir-94-ohsumed.pdf
  [16]FAODOC test collection dataset: www.tesisenxarxa.net/
TESIS_UPC/AVAILABLE/TDX.../12ANNEX2.pdf
  [17] Na Luo et al, “Using Co-training and semantic feature
extraction for positive and unlabeled text classification”, 2008
International Seminar on Future Information Technology and
Management Engineering, http://ieeexplore.ieee.org/stamp/
stamp.jsp?arnumber=04746478
  [18] http://www.brothersoft.com/downloads/nzb- senet.html
  [19] http://WordNet.princeton.edu/WordNet/download/
  [20] http://en.wikipedia.org/wiki/Latent_semantic_analysis
  [21]Halliday, M.A.K. and Hasan, R., Cohesion In English,
Longman 1976.
 [22] Gliozzo et al, “Investigating Unsupervised Learning for Text
Categorization Botstraping”, Proc. of Human Languages
Technology Conference and Conference on Empirical Methods in
Natural Languages, Pages 129-136, October 2005




© 2011 ACEEE                                                                                       27
DOI: 01.IJNS.02.02.196

Context Driven Technique for Document Classification

  • 1.
    ACEEE Int. J.on Network Security , Vol. 02, No. 02, Apr 2011 Context Driven Technique for Document Classification *Upasana Pandey, @ S. Chakraverty, # Rahul Jain *upasana1978@gmail.com @apmahs@rediffmail.com #rahul.jain@nsitonline.in NSIT, SEC 3, Dwarka, New Delhi 110078, India Abstract: In this paper we present an innovative hybrid Text contextual information is harnessed at two stages. First, it is Classification (TC) system that bridges the gap between used to extract a meaningful and cohesive set of keywords statistical and context based techniques. Our algorithm for each input category. Secondly, it is used to refine the harnesses contextual information at two stages. First it extracts feature set representing the documents to be classified. a cohesive set of keywords for each category by using lexical For the rest of the paper, section II presents a brief references, implicit context as derived from LSA and word- background of the field and the relevance of context based vicinity driven semantics. And secondly, each document is TC. Section III brings into perspective prior work in the area. represented by a set of context rich features whose values are derived by considering both lexical cohesion as well as the extent Section IV presents our proposed context-enhanced TC of coverage of salient concepts via lexical chaining. After model. In Section V, we compare the proposed TC scheme keywords are extracted, a subset of the input documents is with current approaches and overview its implications and apportioned as training set. Its members are assigned categories advantages. We conclude in Section VI. based on their keyword representation. These labeled documents are used to train binary SVM classifiers, one for II. BACKGROUND AND MOTIVATION each category. The remaining documents are supplied to the trained classifiers in the form of their context-enhanced feature A TC system accepts two primary inputs; a set of vectors. Each document is finally ascribed its appropriate categories and a set of documents to be classified. Most TC category by an SVM classifier. systems use supervised leaning methods that entail training a classifier. Support Vector Machines (SVM) use kernel Keywords: Lexical references, Vicinity driven semantics, Lexical functions to transform the feature space into a higher chaining. dimensional space and have been found to be especially suitable for TC applications [2]. Their training process may I. INTRODUCTION use prior labeled documents which guide the classifier in Text Classification (TC) is the task of inspecting input tuning its parameters. In another approach, as illustrated in documents with the aim of assigning them categories from a the Sleeping Experts algorithm [3], a classifier is trained predefined set. TC serves a wide range of applications such dynamically during the process of classification using a as classifying scientific and medical documents, organizing subset of the input documents and its parameters are documents for query-based information dissemination, email progressively refined. Another approach is to identify a set folder management and spam filtering, topic-specific search of training documents from among the input dataset and label engines and library / web document indexing [1]. them with keyword- matched categories. The world-wide-web has witnessed a mind boggling A concern here is to derive a set of keywords for each growth of both volume as well as variety of text content, category. While user-input keywords can be used, this is driving the need for sophisticated TC techniques. Over the cumbersome and may be impractical. It is attractive to years, numerous statistical approaches have been generate the keywords automatically. This motivates the successfully devised. But increasingly, the so-called bag-of- application of contextual information to cull out meaningful words approach has reached a point of saturation where it keywords for each category. can no longer handle the increasing complexities posed by The documents must be pre-processed and represented polysemous words, multi-word expressions and closely by a well-defined and cohesive set of features; the most time- related categories that require semantic analysis. Recently, consuming part of TC. The main aim of feature extraction is researchers have illustrated the potential for applying context- to reduce the large dimensionality of a document’s feature oriented approaches to improve upon the quality of TC. This space and represent it in concise manner within a has resulted in a fruitful convergence of several fields such meaningfully derived context space. Once encoded as a as Information Retrieval (IR), Natural Language Processing feature set, a trained classifier can be invoked to assign it a (NLP) and Machine Learning (ML). The aim is to derive the suitable category. A plethora of classification techniques are semantics encapsulated in groups of words and utilize it to available and have been tapped by TC researchers. They improve classification accuracy. include Bayesian classifiers [4], SVM [5], Decision trees This paper showcases a proposal which bridges the gap [6], Ada Boost [7], RIPPER [3], fuzzy classifiers [8] and between statistical and context based approaches for TC. The © 2011 ACEEE 23 DOI: 01.IJNS.02.02.196
  • 2.
    ACEEE Int. J.on Network Security , Vol. 02, No. 02, Apr 2011 growing seeded clusters based on Euclidean distance between In [5], Diwakar et al have used lexical semantics such as data points [9]. synonyms, analogous words and metaphorical words/phrases Predominantly, Statistical approaches have been applied to tag input documents before passing them to a statistical for feature extraction with highly acceptable performance Bayesian classifier. After semantic pre-processing and of about 98%. These methods employ statistical metrics for normalization, the documents are fed to a Naïve Byes feature evaluation, the most popular being Term Frequency- classifier. The authors have reported improvement in Inverse Document Frequency (TF-IDF), Chi-square and classification accuracy with semantic tagging. However Information Gain[10]. Statistical techniques have been processing is almost entirely manual. widely applied to web applications and harnessed to capacity. In [7], the authors locate the longest multiword expression Their applicability in further improving the quality of TC leading to a concept. They derive concepts, synonyms and seems to have reached a saturation point. They have some generalized concepts by utilizing ontology. Since concepts inherent limitations, being unable to deal with synonyms, are predominantly noun phrases, a Part Of Speech (POS) polysemous words and multi-word expressions. analyzer is used to reject unlikely concepts. The set of terms On the other hand, Semantic or context-based feature and their lexically derived concepts is used to train and test extraction offers a lot of scope to experiment on the relevancy a Ada-boost classifier. Using the Reuters-21578 [14], among words in a document. Semantic features encapsulate OHSUMED [15] and FAODOC [16] corpora, the authors the relationship between words and the concepts or the report that the combined feature set of terms and derived mental signs they represent. Context can be interpreted in a concepts yield noticeable improvements over the classic term variety of intuitively appealing ways such as: stem representation. • Implicit correlation between words as expressed In [17] the authors select a set of likely positive examples through Latent Semantic Analysis (LSA) from an unlabeled dataset using co-training. They also • Lexical cohesiveness as implied by synonyms, improve the positive examples with semantic features. The hypernyms, hyponyms, meronyms or as sparse expanded positive dataset as well as the unlabeled negative matrices of ordered words as described in [3] examples together train the SVM classifier, thus increasing • Syntactic relation between words as reflected by its reliability. For the final classification, the TF-IDF of all Parts-Of-Speech (POS) phrases the semantic features used in the positive data sets are • Shallow semantic relationships as expressed by evaluated in test documents. The author’s experiments on basic questions such as WHO did WHAT to Reuters-21578 [14] and Usenet articles [18] corpora reveal WHOM, WHERE [2] that using positive and unlabeled negative datasets combined • Enhancement of a word’s importance based on the with semantic features gives improved result. influence of other salient words in its vicinity [11] Ernandes et al [11] propose a term weighting algorithm • Domain based context features expressed by that recognizes words as social networks and enhances the domain specific ontology [7]. default score of a word by using a context function that relates With such a wide spectrum, different context-oriented it to neighboring high scoring words. Their work has been features can be flexibly combined together to target different suitably applied for single word QA as demonstrated for applications. The issue of real time performance can be crossword solving. handled through efficient learning algorithms and compact feature representation. IV. PROPOSED WORK In this paper, we propose a scheme for incorporating lexical, referential as well as latent semantics to produce Our proposal taps the potential of lexical and contextual context enhanced text classification. semantics at two tiers: A. Keyword extraction: III. REVIEW OF CONTEXT BASED TC Automatic keyword generation frees the user from the Several authors have suggested the use of lexical cohesion encumbrance of inputting them manually. Our scheme in different schemes to aid the process of classification. employs lexical references derived from wordnet[19],implicit L. Barak et al have used a lexical reference scheme to context as derived from an LSA and vicinity based context expand upon a given category name with the aim of as given the terms surrounding already known keywords. automatically deriving a set of keywords for the category The three driving forces together yield a cohesive set of [12]. A reference vector comprises lexical references derived keywords. The set of extracted keywords for each category from wordnet and wikipedia for each category and a context serves as a template for identifying representative documents vector is derived from the LSA space between categories for that category. These documents are utilized as supervisors and input documents. Each vector is used to calculate two to train an SVM classifier. cosine similarity scores between categories and documents. B. Document representation: These scores are multiplied to assign a final similarity score. The final bootstrapping step uses the documents that are Once training documents are identified, these and identified for each category to train its classifier. With their the remaining unlabeled test documents must be represented combined reference plus context approach, the authors have by a compact feature set. As during the keyword extraction reported improvement in precision with Reuters-10 corpus. phase, we use lexical relationships to identify generic © 2011 ACEEE 24 DOI: 01.IJNS.02.02.196
  • 3.
    ACEEE Int. J.on Network Security , Vol. 02, No. 02, Apr 2011 concepts, synonyms and specialized versions of terms. the algorithm carries out a syntactic parsing of the Taking a further step of refinement, we use anaphoric sentences containing keywords. Since concepts are referential phrases and strong conjunctions to identify the usually represented by noun phrases, they are separated length of cohesiveness or lexical chain of each salient out. The number of instances of each vicinity related pair concept. These parameters are utilized to fine tune TF-IDF is counted. If this exceeds a pre-decided threshold τv , feature values. Only those features whose values cross a the keyword set is augmented with the new term. The stipulated threshold finally represent the document. two augmented keyword sets are now AK lex(c) and The algorithm Context-driven Text Classification CTC, AKimc(c). is described in the pseudo-code in figure 1. Step 3: Evaluate documents’ representative index: Step1: Removal of stop words and stemming: The first step is pre-processing of documents. Stop words, i.e. those words that have negligible semantic significance such as ‘the’ ‘an’ ‘a’ etc are removed. Weak conjunctions such as ‘and’ are removed but strong conjunctions ‘therefore’ and ‘so’ are retained so that their contribution to lexical chaining can be utilized. Next the system carries out The cosine similarity, which gives the cosine of the stemming, viz. the process of reducing inflected words to a angle between two vectors has been used in [12,22] to match root word form. The entire set of documents is partitioned documents with categories. However the keyword vectors into one thrid training set and two third testing set. being equally weighted, the cosine similarity will bias Step 2: Keyword extraction: documents with equal number of instances of each keyword Words that bear a strong semantic connection with against others. Also, a document with more instances of a category names are keywords. This concept can be extended keyword may receive the same similarity score as a document to hypothesize that words strongly associated with identified with less instances of the keyword. An appropriate metric keywords are themselves good candidates as keywords. We should reflect the fact that any document containing more apply these principles to extract keywords from the following keywords from a category is more representative of it and a sources. document with more instances of each keyword deals with the category in greater detail. Given a keyword list AK(c) Keyword Set- from lexical references: and a document d we multiply two factors to derive its Words that are lexically related to a category c are representative index ρ(AK(c),d): (1) fraction of the total collected from the wordnet resource. Taking cue from number of category keywords present in d (2) Ratio of the [12], we use only that sense that corresponds to the average numbers of keywords in d to the sum total of these category’s meaning. First synonyms are extracted. Then averages for all documents. Thus: hypernyms of each term are transitively located to allow A K (d ) ∑ T F _ ID F ( w ∈ A K (d ) i generalization. Hypernyms are collected up to a pre- ρ ( A K (c ), d ) = ⋅ i A K (c ) ∑ ∑ T F _ ID F ( w ∈ A K (d i k ) specified level k so that term meanings are not obfuscated k i by excessive generalization. Next, hyponyms at Step 4: Assigning categories to documents: immediate lower level are located. At the end of this The next step labels documents with categories. The process, the system is armed with an initial set of document with the highest r-score for a category is assigned keywords that explicitly refer to category names. to it. Thereafter, all documents whose r-scores for that Keyword Set- from implicit context: category are within a preset category assignment factor F of this maximum similarity score are also are assigned to it. Latent Semantic Analysis (LSA) is a well known technique to help identify the implicit contextual bonding Step 5: Document feature representation: between category names and input documents. LSA uses The TC system utilizes contextual information to derive the Singular Value Decomposition (SVD) to cull out the feature set values of documents in the following ways: groups of words which have high vector similarity in the Generalization with lexical references: LSA space [20]. This step generates a new set of As a document is browsed, new terms encountered are contextually related keywords AKimc(c) for each category annotated with their synonyms, hypernyms for k-up levels c. and hyponyms k-down levels. A rescanning of the document Augmented Keyword Sets from vicinity context: replaces any of the lexical references encountered with their Words in the vicinity of keywords weave a semantic equivalent base words. Term counts are simultaneously thread that correlates with the category. LSA does club updated. When all documents have been tagged, TF-IDF together word co-occurring anywhere in a document but values are evaluated. does not highlight semantics inherently present in Creating Lexical chains: neighboring words. The idea is to locate those words Vertical search engines focus on a particular topic or text surrounding keywords that have a high potential for segment only. They need TC methods with greater depth of becoming keywords themselves. Given a keyword set, © 2011 ACEEE 25 DOI: 01.IJNS.02.02.196
  • 4.
    ACEEE Int. J.on Network Security , Vol. 02, No. 02, Apr 2011 search. This is in contrast with generic Web search engines. capture their faithfulness to a given category. The r-score Towards this goal, we recognize that the extent of coverage used here encodes the extent to which a document represents of a concept is as important as its frequency of occurrence. a category. It encapsulates both parameters ; the fraction of In our framework, we have introduced an optional step that the keywords present in a document and the weighted average can be invoked for vertical categorization requirements. keywords frequency Lexical chaining: Weighting TF-IDF with length of lexical chain: Links consecutive sentences that are related though To allow categorization for applications that require a repetition of a concept term or its lexical references vertical search based on a topic, we use lexical chaining that (iteration), anaphoric references and strong conjunctions allows us to measure the extent of coverage on a concept. [21]. Besides iteration, resolved anaphoric references also Being compute-intensive, we allow this feature as a user- lead to the next link in the chain. Strong conjunctions such requested option. as therefore or so serve as final links. When a gap of a few Tunable Guiding parameters: sentences occurs, it indicates termination of a local chain. Parameters such as maximum level of lexical referencing, vicinity score threshold and category assignment factor can be experimentally tuned. There is flexibility to adjust them for different applications and input categories. VI. CONCLUSION In conclusion, we have proposed a comprehensive scheme for context based TC that starts its job with only Step 6: Classifier Training: category names. A cohesive set of keywords is generated by exploiting three kinds of semantic cohesion; lexical The documents output by step 4 are positively labeled references, implicit context and vicinity induced cohesion. documents. They are used to train a set of binary SVM Semantic cohesion is also utilized for representing documents classifiers, one category. All documents labeled with a given as context oriented feature sets. While lexical references category are employed to train its classifier. bring in generalization, an optional lexical chaining scheme Step 7: Testing: allows the depth of coverage of concepts to influence the The test documents are applied to the trained classifiers classification decision. The application of context at various to be finally assigned their appropriate categories. levels gives a framework for more meaning-driven classification which cannot be expected from a purely bag- V.DISCUSSION of-words approach. The TC system proposed above differs from reported REFERENCES schemes with the following innovations. In this [1] F. Sebastani, “Text categorization”, Trans. of State-of-the-art A. Application of context for TC: in Science and Engineering, Procs Text Mining and its Applications In this paper, context driven information is utilized for to Intelligence, CRM and Knowledge Management, volume 17, 2005, WIT Press. keyword generation as well as for document representation. [2] S. Pradhan et.al., “Support vector learning for semantic B. Keyword extraction: argument classification,” Machine Learning, 60, pp 11-39, 2005. [3] William W. Cohen and Y. Singer, “Context Sensitive Learning We use a three pronged approach to extracting keywords Methods for Text Categorization,” ACM Transaction on automatically staring with only category names using, Information Systems, Vol 17 No.2, Pages 141-173, 1999. i. An LSA space vector similarity between documents [4] I. Androutsopoulos et al., “Learning to filter spam mail: a Bayes and category names and a memory based approach,” Procs of the workshop “Machine ii. Deriving lexical references to category names and Learning and Textual Information Access”, 4 th European iii. Utilizing the strong semantic connection between Conference on Principles and Practice of Knowledge discovery in concepts in the vicinity of identified keywords. Databases, 2000. [5] Q. Wang al, “SVM Based Spam Filter with Active and Online While the first method has been tried out in [22] and the Learning”, Procs. of the TREC Conference, 2006. first two combined has been reported in [12], our algorithm [6] Jiang Su; Harry Zhang, “ A Fast Decision Tree learning augments both kinds of keywords by tapping distance based algorithm”, Procs of the 21st conf. on AI-Vol. 1 Boston, Pages context. The combined approach for keyword extraction is 500-505, , 2006. geared towards an expanded and more cohesive keyword [7] Stephen Bloehdorn et al, “Boosting for text classification with set. semantic features”, Procs. of the MSW 2004 workshop at the 10th ACM SIGKDD Conference on Knowledge Discovery and Data Document to category matching: Mining , AUG (2004) , p. 70-87. In [12,22], the authors have used the cosine similarity [8] El-Sayed M. El-Alfy, Fares S. Al-Qunaieer, “A Fuzzy metric to map documents with categories. This does not truly Similarity Approach for Automated Spam Filtering”, Procs. of the © 2011 ACEEE 26 DOI: 01.IJNS.02.02.196
  • 5.
    ACEEE Int. J.on Network Security , Vol. 02, No. 02, Apr 2011 2008 IEEE/ACS International Conference on Computer Systems and Applications - Volume 00, Pages 544-550, 2008. [9] P.I. Nakov, P.M. Dobrikov, “Non–Parametric Spam Filtering Based on KNN and LSA”, Procs of the 33th National Spring Conference, 2004. [10] Yiming Yang et al , “A comparative Study on Feature Selection in Text Classification”, Proceedings of ICML-97, 14th International Conference on Machine Learning, page 412—420. Nashville, US, Morgan Kaufmann Publishers, San Francisco, US, (1997) [11] Marco Ernandes et al, “An Adaptive Context Based Algorithm for Term Weighting”, Proceedings of the 20th international joint conference on Artifical intelligence, 2748-2753, 2007 [12] Libby Barak et al, “Text Categorization from Category Name via Lexical Reference”, Proc. of NAACL HLT 2009: Short Papers, pages 33-36, June 2009. [13] Diwakar Padmaraju et al, “Applying Lexical Semantics to Improve Text Classification” , http:/ / w e b 2 p y. i i i t . a c . i n / p u b l i c a t i o n s / d e f a u l t / d o w n l o a d / inproceedings.pdf.9ecb6867-0fb0-48a5-8020-0310468d3275.pdf [14] Reuters dataset: www.reuters.com [15]OHSUMED test collection dataset: http://medir.ohsu.edu/ ~hersh/sigir-94-ohsumed.pdf [16]FAODOC test collection dataset: www.tesisenxarxa.net/ TESIS_UPC/AVAILABLE/TDX.../12ANNEX2.pdf [17] Na Luo et al, “Using Co-training and semantic feature extraction for positive and unlabeled text classification”, 2008 International Seminar on Future Information Technology and Management Engineering, http://ieeexplore.ieee.org/stamp/ stamp.jsp?arnumber=04746478 [18] http://www.brothersoft.com/downloads/nzb- senet.html [19] http://WordNet.princeton.edu/WordNet/download/ [20] http://en.wikipedia.org/wiki/Latent_semantic_analysis [21]Halliday, M.A.K. and Hasan, R., Cohesion In English, Longman 1976. [22] Gliozzo et al, “Investigating Unsupervised Learning for Text Categorization Botstraping”, Proc. of Human Languages Technology Conference and Conference on Empirical Methods in Natural Languages, Pages 129-136, October 2005 © 2011 ACEEE 27 DOI: 01.IJNS.02.02.196