KEMBAR78
Text Mining: Techniques and Its Application: December 2014 | PDF | Information Retrieval | Information
100% found this document useful (1 vote)
100 views5 pages

Text Mining: Techniques and Its Application: December 2014

Uploaded by

Ipsita Jena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
100 views5 pages

Text Mining: Techniques and Its Application: December 2014

Uploaded by

Ipsita Jena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/273038150

Text Mining : Techniques and its Application

Article · December 2014

CITATIONS READS
14 13,901

1 author:

Shilpa Dang
Maharishi Markandeshwar University, Mullana
15 PUBLICATIONS   48 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Text Mining View project

All content following this page was uploaded by Shilpa Dang on 04 March 2015.

The user has requested enhancement of the downloaded file.


IJETI International Journal of Engineering & Technology Innovations, Vol. 1 Issue 4, November 2014 22
ISSN (Online): 2348-0866
www.IJETI.com

Text Mining:
Techniques and its Application
Shilpa Dang1, Peerzada Hamid Ahmad 2
1
Assistant Professor, 2Research Scholar
M.M Institute of Computer Technology & Business Management
Maharishi Marakandeshwar University, Haryana, India.

Abstract organizations and industries information are stored in


Abstract electronic form.
Text mining has become an exciting research field as it tries to There are a variety of names for text mining like text data
discover valuable information from unstructured texts. The mining, knowledge discovery [4] from textual databases,
unstructured texts which contain vast amount of information analysis of intelligent text refers to extracting or retrieve
cannot simply be used for further processing by computers.
the valuable information from the unstructured text. It can
Therefore, exact processing methods, algorithms and techniques
are vital in order to extract this valuable information which is
be viewed as an extension of data mining or knowledge
completed by using text mining. In this paper, we have discussed discovery from (structured) databases. Text mining
general idea of text mining and comparison of its techniques. In discovers new pieces of information from textual data
addition, we briefly discuss a number of text mining applications which is earlier unidentified or secret information by
which are used presently and in future. extracting it using different techniques. Text mining is a
multidisciplinary field, concerning retrieval of information,
KEYWORDS: Retrieval, Extraction, Categorization, Clustering, analysis of text, extraction of information, categorization,
Summarization. clustering, visualization, mining of data, and machine
learning.
INTRODUCTION There are five basic text mining steps as under:
Text mining has become important research vicinity. A
very large number of information stored in different places Text mining steps:
in unstructured structure. Approximately 80% of the a) Collecting information from unstructured data.
world’s data is in unstructured text [1]. This unstructured b) Convert this information received into structured
text cannot be easily used by computer for more processing. data
So there is a need for some technique that is useful to c) Identify the pattern from structured data
extract some precious information from unstructured text. d) Analyze the pattern
These information are then stored in text database format e) Extract the valuable information and store in the
which contains structured and few unstructured fields. Text database.
can be sited in mails, chats, SMS, newspaper articles,
journals, product reviews, and organization records [2].
Almost every one of the institutions, government sectors,

f) STRUCTURED IDENTIFIED ANALYSIS OF


DATA DATA DATA
g)

UNSTRUCTURED EXTRACTION
TEXT DATABASE h)

Fig 1: Processing of Text Mining

IJETI
www.ijeti.com
IJETI International Journal of Engineering & Technology Innovations, Vol. 1 Issue 4, November 2014 23
ISSN (Online): 2348-0866
www.IJETI.com

Basic Text Mining Technologies Clustering:


Clustering is one of the most interesting and important
Information Retrieval: topics in text mining. Its aim is to find intrinsic structures
The most well known information retrieval (IR) systems in information, and arrange them into significant subgroups
are Google search engines which recognize those for further study and analysis. It is an unsupervised process
documents on the World Wide Web that are associated to through which objects are classified into groups called
a set of given words. It is measured as an extension to clusters. The problem is to group the given unlabeled
document retrieval where the documents that are returned collection into meaningful clusters without any prior
are processed to extract the useful information crucial for information. Any labels associated with objects are
the user [3]. Thus document retrieval is followed by a text obtained solely from the data. For example, document
summarization stage that focuses on the query posed by the clustering assists in retrieval by creating links between
user, or an information extraction stage. IR in the broader related documents, which in turn allows related documents
sense deals with the whole range of information to be retrieved once one of the documents has been
processing, from information retrieval to knowledge deemed relevant to a query [8].
retrieval [8]. It is a relatively old research area where first Clustering is useful in many application areas such as
attempts for automatic indexing where made in 1975. It biology, data mining, pattern recognition, document
gained increased attention with the grow of the World retrieval, image segmentation, pattern classification,
Wide Web and the need for classy search engines. security, business intelligence and Web search. Cluster
analysis can be used as a standalone text mining tool to
Information Extraction: achieve data distribution, or as a pre-processing step for
The goal of information extraction (IE) methods is the other text mining algorithms operating on the detected
extraction of useful information from text. It identifies the clusters.
extraction of entities, events and relationships from semi-
structured or unstructured text. Most useful information Summarization:
such as name of the person, location and organization are Text summarization is an old challenge in text mining but
extracted without proper understanding of the text [4]. IE in dire need of researcher’s attention in the areas of
is concerned with extraction of semantic information from computational intelligence, machine knowledge and
the text.IE can be described as the construction of a natural language processing. Text summarization is the
structured image of selected relevant piece information process of automatically creating a compressed version of
drawn from texts. a given text that provides useful information for the user.
In big organization or company, researcher do not have
Categorization: time to read all documents so they summarize document
Text categorization is a kind of “supervised” learning and highlight summary with main points [4]. A summary is
where the categories are known in advance and firm in a text that is produced from one or more texts that contains
progress for each training document. Then, its key a significant portion of the information, reduced in length
projected utilize was for indexing scientific literature by and keeps the overall meaning as it is in the original texts.
means of controlled words. It was only in the 1990s that Text summarization involves various methods that employ
the field fully developed with the availability of continuous text categorization, such as neural networks, decision trees,
increasing numbers of text documents in digital form and semantic graphs, regression models, fuzzy logic and swarm
the requirement to organize them for easier use [5]. intelligence. However, all of these methods have a
Categorization is the assignment of normal language common problem, that is, the quality of the development of
documents to predefined set of topics according to their classifiers is variable and highly dependent on the type of
content. It is a collection of text documents, the process of text being summarized.
finding the accurate topic or topics for each document.
Nowadays automated text categorization is applied in a Comparison of Text Mining Techniques:
variety of contexts from the classical automatic or Text mining uses various numbers of techniques which
semiautomatic indexing of texts to personalized play an important role. The techniques differ from each
commercials delivery, spam filtering, and categorization of other. The information of retrieval technique used
Web page under hierarchical catalogues, automatic unstructured text where it can retrieve valuable information
metadata generation, and detection of text genre, topic while as the information of extraction extracts the
tracking and many others [6]. The learning of automated information from structured database. The Summarization
text categorization starts early 1960s. It is a hot topic in technique is used to summarize the document which
machine learning today’s research field. reduces length and keeps meaning same as it is.

IJETI
www.ijeti.com
IJETI International Journal of Engineering & Technology Innovations, Vol. 1 Issue 4, November 2014 24
ISSN (Online): 2348-0866
www.IJETI.com

The categorization is supervised process and uses related subgroups for further study and analysis. It is an
predefined set documents according to their contents. unsupervised process through which objects are classified
Responsiveness and flexibility of the post-co-ordinate into groups called clusters. Clustering is dealing with high
system effectively prohibit the establishment of meaningful dimensional data, finding interesting pattern associated
relationships because a category is created by individual with data. Another feature is that it is a group of similar
not the system. While as the clustering is used to find type of data and their relationship between them.
intrinsic structures in information, and arrange them into

Table1: Comparison of text mining techniques

Technique Characteristics Tools

Retrieval Retrievals valuable information from Intelligent Miner,


unstructured text Text Analyst

Extraction Extract information from structured database Text Finder,


Clear Forest Text

Summarization Reduce length by keeping its main points and Tropic Tracking Tool,
overall meaning as it is Sentence Ext Tool

Categorization Document based categorization Intelligent Miner

Cluster Cluster collection of documents, Carrot,


Clustering, classification and analysis of text Rapid Miner
document

Bioinformatics
Applications text mining Research work has grown-up in a bioinformatics field,
where biomedical literature has become an important
Academic applications research application area for text mining. In the year 2005,
To discover the patterns and trends in the journals and the first textbook on biomedical text mining appeared,
proceedings from huge volume of papers is an essential where it has reported that industry has suggested that 90%
task in the research field [1]. The matter of importance to of drug targets are derived from the literature. The
publishers who hold large databases of information need motivation for this work comes primarily from biologists,
indexing for retrieval. This is especially true in scientific who find themselves faced with a massive increase in the
disciplines in which highly specific information is often number of publications in their field, by keeping up with
contained within written text. This text mining tool is the related literature is nearly not possible for many
applied to discover trends on different topics that exist in scientists [7]. The goal of text mining in this area is to
the proceedings and to show how they change over time. It allow biomedical researchers to extract knowledge from
is also used as topic tracking. Therefore, initiatives have the biomedical literature in facilitating new innovation in a
been taken such as Nature's proposal for an Open Text more efficient manner. One online text mining application
Mining Interface (OTMI) and the National Institutes of in the biomedical literature is that combines biomedical
Health's common Journal Publishing Document Type text mining with network visualization as an Internet
Definition (DTD) that would provide semantic cues to service. Bio-entity recognition aims to identify and classify
machines to answer specific queries contained within text technical terms in the domain of molecular biology that
without removing publisher barriers to public access. corresponds to instances of concepts that are of interest to
biologists. Entity recognition is becoming increasingly

IJETI
www.ijeti.com
IJETI International Journal of Engineering & Technology Innovations, Vol. 1 Issue 4, November 2014 25
ISSN (Online): 2348-0866
www.IJETI.com

important with the massive increase in reported results due of text mining, several text mining techniques and its
to high throughput experimental methods. It can be used in applications in various fields have been discussed. A
several higher level information access tasks such as comparison of different text mining has been shown which
relation extraction, summarization and question answering can be further enhanced. Text mining algorithms will give
[10]. us useful and structured data which can reduces time and
cost. Hidden information in social network sites,
Copyright and Customer Profile Analysis bioinformatics and internet security etc. are identified
The copyright analysis developed to a large application using text mining is a major challenge in these fields. The
area in recent years because of the increased number of advancement of web technologies has lead to a tremendous
copyright applications. The supervised and unsupervised interest in the classification of text documents containing
techniques are applied to analyze copyright documents and links or other information.
to support companies and also the copyright office in some
countries to their work. The challenges in copyright References:
analysis consist of the length of the documents, which are
larger than documents usually used in text classification, [1] Vallikannu Ramanathan, T. Meyyappan "Survey
and the large number of available documents in a corpus of Text Mining", International Conference on
[6]. Technology and Business and Management,
Companies use text mining to draw out the occurrences March 2013, pp. 508-514.
and instances of key terms in large blocks of text such as [2] Vidya K A, G Aghila, “Text Mining Process,
articles, Web pages, complaint forums. The software Techniques and Tools: an Overview”,
converts the unstructured data formats into topic structures International Journal of Information Technology
and semantic networks which are important information and Knowledge Management, July-December
drilling tools. By studying the semantic network, one can 2010, Volume 2, No 2, pp.613-622.
learn the general quality of the complaints, reasons for [3] R.Sagayam, S.Srinivasan, S.Roshini, “A Survey
complaining. It also finds common words used in of Text Mining: Retrieval, Extraction and
complaints and their relationships to other words in the Indexing Techniques”. Internaltional Journal of
text via semantic weight [9, 10]. Computational Engineering Research
(ijceronline.com) Vol.2 Issue.5.
Internet Security [4] Vishal Gupta and Guruprit Lehal, “A Survey of
The use of text mining tool in security field has become an Text Mining Techniques and Applications”,
important matter. A lot of text mining software packages is Journal Of Emerging Technologies In Web
marketed for security applications, particularly monitoring Intelligence, Vol. 1, No. 1, August 2009.
and analysis of online plain text sources such as Internet [5] Hearst, M. A. (1997) Text data mining: Issues,
news, blogs, mail etc. for security purposes [7]. It is also techniques, and the relationship to information
involved in the study of text encryption/decryption. access. Presentation notes for UW/MS workshop
Government agencies are investing considerable resources on data mining, July 1997.
in the surveillance of all kinds of communication, such as [6] Rashmi Agrawal, Mridula Batra, "A Detailed
email, online chats. Email is used in many legitimate Study on Text Mining Techniques", IJSCE, ISSN:
activities such as messages and documents exchange. 2231-2307, Vol. 2, Issue-6, January 2013.
Unfortunately, it can also be misused, for example in the [7] Falguni N. Patel, Neha R. Soni,"Text mining: A
distribution of unwanted junk mail, mailing offensive or Brief survey", International Journal of Advanced
bullying materials. The explosive growth of unsolicited e- Computer Research, ISSN (Online):2277-7970,
mail, more commonly known as spam, over the last years Vol. 2, No. 4, Issue-6, Dec 2012.
has been undermining constantly the usability of e-mail. [8] Mr. Rahul Patel,Mr. Gaurav Sharma,"A survey on
One solution is offered by anti-spam filters. Most text mining techniques", International Journal Of
commercially available filters use black-lists and hand- Engineering And Computer Science ISSN:2319-
crafted rules. Since time is crucial and given the scale of 7242, Vol 3 Issue 5, May 2014, pp.5621-5625
the problem, it is infeasible to monitor emails or online [9] Seth Grimes, “The developing text mining
chat normally. Thus automatic text mining tools offer a market”, white paper, Text Mining Summit Alta
considerable promise in this area [10]. Plana Corporation, Boston, 2005, pp. 1-12.
[10] Shaidah Jusoh and Hejab M. Alfawareh,
Conclusion "Techniques, Applications and Challenging Issue
Text mining generally refers to the process of extracting in Text Mining", IJCSI, ISSN (Online): 1694-
valuable information from unstructured text. In this survey 0814, Vol. 9, Issue-6, No. 2, November 2012.

IJETI
www.ijeti.com

View publication stats

You might also like