See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/322629223
Big Data Analytics
Article · February 2016
CITATIONS READS
0 3,350
2 authors:
Dr Hemlata Chahal Preeti Gulia
Maharshi Dayanand University Maharshi Dayanand University
12 PUBLICATIONS 7 CITATIONS 25 PUBLICATIONS 46 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Data Mining View project
Ph.D research View project
All content following this page was uploaded by Dr Hemlata Chahal on 21 February 2018.
The user has requested enhancement of the downloaded file.
Research Journal of Computer and Information Technology Sciences _____________________E-ISSN 2320 – 6527
Vol. 4(2), 1-4, February (2016) Res. J. Computer and IT Sci.
Big Data Analytics
Hemlata and Preeti Gulia
M.C.A. Department, M.D. University, Rohtak, Haryana, India
hemlatachahal@gmail.com
Available online at: www.isca.in, www.isca.me
Received 20th June 2015, revised 7th October 2015, accepted 29th January 2016
Abstract
Big data analytics refers to the method of analyzing huge volumes of data, or big data. The big data is collected from a large
assortment of sources, such as social networks, videos, digital images, and sensors. The major aim of Big Data Analytics is
to discover new patterns and relationships which might be invisible, and it can provide new insights about the users who
created it. There are a number of tools available for mining of Big Data and Analysis of Big Data, both professional and
non-professional. In this paper, we have summarised different big data analytic methods and tools.
Keywords: Big Data, Big Data Mining, R, Rapid-I Rapid Miner, KNIME.
Introduction most crucial point of big data, i.e. to explore new values from
datasets7,8.
Big data means the datasets which cannot be recognized,
obtained, managed, analyzed, and processed by present tools. Big Data Processing Framework2
Different definitions of big data have been given by different
users of Big Data and different analysts of Big Data like Big Data processing framework: META Group Research gave
research scholars, data analysts, and technical practitioners. a three tier structure of “Big Data mining platform” (Tier I).
Tier I emphasizes on low-level data accessing and computing.
According to Apache Hadoop “Big data is a dataset which could Tier II emphasizes on information sharing and privacy, and the
not be captured, managed, and processed by general computers domains and knowledge of Big Data application. Tier III
within an acceptable scope”1. emphasizes on mining algorithms.
Actually big data was defined in 2001 for the first time. Doug
Laney, defined the 3Vs model, i.e., Volume, Variety and
Velocity 2. In spite of the fact that the 3Vs model was not used
to define big data, Gartner and many other organizations, like
IBM3 and Microsoft4 still uses the “3Vs” model to define big
data5. In the “3Vs” model, Volume means, the dataset is so big
and large that it is very difficult to analyze; Velocity means the
data collected and gathered so rapidly to utilize it to the
maximum; Variety shows different types of data like structured,
semi-structured and unstructured data i.e. audio, video,
webpage, and text. IDC (International data Corporation), one of
the most dominant leaders in the research fields of Big Data, is
of different view about Big Data. According to an IDC report of
2011 “Big Data technologies describe a new generation of
technologies and architectures, designed to economically extract
value from very large volumes of a wide variety of data, by
enabling the high-velocity capture, discovery, and/or analysis”6.
According to this definition, big data characteristics can be:
Volume (huge volume), Variety (various types and structure of
data), Velocity (quick creation), and Value (great value but very
low similarity).
Figure-1
This 4Vs definition draws light on the meaning of Big Data, i.e.,
Big Data Framework Processing2
examining the concealed values. The definition specifies the
International Science Community Association 1
Research Journal of Computer and Information Technology Sciences _________________________________E-ISSN 2320 – 6527
Vol. 4(2), 1-4, February (2016) Res. J. Computer and IT Sci.
Big Data Analysis1 method is to fragment a problem and allocate them to different
processes for achieving co processing. Some parallel computing
Big Data Analysis mainly involves analytical methods of big models and low level tools are MPI (Message Passing
data, systematic architecture of big data, and big data mining Interface), Map Reduce, and Dryad. These low level tools are
and software for analysis. Data investigation is the most very difficult to use and learn. Some high level parallel
important step in big data, for exploring meaningful values, computing tools are developed like Map Reduce uses Sawzall,
giving suggestions and decisions. Possible values can be Pig, and Hive, and Dryad uses Scope and Dryad LINQ.
explored by data analysis7. However, analysis of data is a wide
area, which is dynamic and is very complex. Tools for Big Data Mining and Analysis1
Traditional Data Analysis: Traditional data analysis means the Different commercial and open source software are available for
proper use of statistical methods for huge data analysis, to Big Data Mining and Analysis. Five most frequently used
explore and elaborate the hidden data of the complex dataset, so software are:
that value of data can be maximized. Data analysis guides
different plans of development for a country, predicting R1: R is an open source environment. It is proposed for
demands of customers, and forecasting the trends of market for visualization, analysis and data mining. R is a collection of
organisations. Big data analysis may be stated as a technique of software facilities for9, i. Reading and manipulating data, ii.
analysis of a special data. So, most of the traditional methods Computation, iii. Conducting statistical analyses and iv.
are still used for big data analysis. Many traditional data Displaying the results.
analysis methods are represented here from statistics and
computer science. Factor Analysis, Cluster Analysis, R is the next version of S language which was developed by
Correlation Analysis, Regression Analysis, A/B Testing, AT&T Bell Labs for data extraction and statistical analysis.
Statistical Analysis, Data Mining Algorithms. When complex tasks are processed, the module in C,C++ and
Fortran can be called in R environment. We can also directly
Big Data Analytic Methods1 call objects of R in C. According to KDNuggets survey of 2012,
R is more popular as compared to S. In a survey of “Design
In the Big Data era, everybody wants to concentrate on languages you have used for data mining/analysis in the past
extracting key value and information from the huge dataset to year” of 2012, it was on the top rank, above Java and SQL.
achieve objectives of their organisation. Now a days, the main After the success of R, Teradata and Oracle also launched the
methods of big data analysis used are: products which supported R.
Bloom Filter: Bloom Filter method is collection of Hash Excel1: Excel of Microsoft Office, has robust data computing
functions. Main concept of this method is that bit arrays are and statistical analysis capabilities. Some plug-ins like Analysis
used to store data Hash values. Bit arrays are actually the bitmap ToolPak and Solver Add-in are installed with Excel which have
index for the storage of lossy compression of Hash functions. Its many capabilities of data analysis. Excel is a commercial
advantages can be high space efficiency and high query speed. software.
Its disadvantage is misidentifying values.
Rapid-I RapidMiner1: According to KDnuggets in 2011,
Hashing: Hashing method mutates data into smaller index and Rapidminer is ranked at number 1 and also more frequently
numeric values. Hashing has advantages like fast reading, used as compared to R. R is open source software which is used
writing, and querying speed, but it is very difficult to calculate a for machine learning, data mining, and predictive analysis. It
correct Hash function. was developed in the University of Dortmund in 2001 and has
been further maintained by Rapid-I GmbH. Data mining
Index: Index is an efficacious method for cutting the disk programs developed in RapidMiner follow the process of
reading cost and disk writing cost, and increasing the speed of Extract, Transform and Load (ETL). Written in Java Rapid-
query insertion, deletion, and modification. Disadvantage of this Miner combines the WEkA’s methods and implements them in
method is the extra cost of storage of index files. R. The flow of process may be represented as a series of
production of a factory in which data is considered as input and
Triel: A derived form of Hash Tree, is also called trie tree. This model as output. RapidMiner is a flexible analysis tool which
method is mostly used for fast retrieval. In this method, to bestow upon a large variety of methods like statistical analysis,
improve efficiency of query, the common prefixes of strings of correlation analysis, regression analysis, cluster analysis etc.10
character are used to reduce comparison.
KNIME1: KNIME (Konstanz Information Miner) is a open-
Parallel Computing: In contrast to the serial computing, source platform for data consolidation, data processing,
parallel computing refers to utilisation of resources analysis, and data mining11. KNIME creates data flows visually,
simultaneously to complete a task. The main idea behind this
International Science Congress Association 2
Research Journal of Computer and Information Technology Sciences _________________________________E-ISSN 2320 – 6527
Vol. 4(2), 1-4, February (2016) Res. J. Computer and IT Sci.
to execute the procedures, provides results and creating models
and views. This process is implemented in a envisioned environment.
KNIME is a module-based architecture which can be expanded.
Its processing units are not dependent on data containers.
KNIME nodes and views can be expanded.
Weka/Pentaho1: Waikato Environment for Knowledge
Analysis abbreviated as WEKA, is an open-source data mining
software which written in Java. Weka allows capabilities like
data processing, classification, regression, clustering, and
visualization, etc. Pentaho is a popular open-source software for
Business Intelligence. It has several tools for analysis, data
integration, and data mining, etc.
Conclusion
Big Data Analytics is a hot research topic among the database
researchers as well as the business community. However,
Figure-2 currently we have different methods to analyse big data which
Flow of data in a Knime 12 we have mentioned in our paper but there is a lot of scope to
create or invent new method of analytics. There are different
The three main principles of KNIME are12: i. Visual and tools and open source software available. Some of which we
interactive framework: Drag and drop option can be used for have mentioned briefly in the paper. There is a scope for the
combining various data flows of a variety of processing units. A future research to compare the tools and find out the best in a
variety of application models can be achieved by data pipelines. particular situation by applying it. Also new can always be
ii Modularity: In order to enable easy distribution of searched and invented. There are many more issues which can
computation and allow for independent development of be further investigated like: Big data privacy and security,
different algorithms modularity should be followed. completeness, Data Quality etc.
Written in Java, KNIME provides many functionalities which References
can be used as plug-ins. Users can process different files,
pictures by using plug-ins, and can apply into different open 1. Min Chen, Shiwen Mao and Yunhao Liu (2014). Big
source environments, like R and Weka. Data: A Survey, © Springer Science+Business Media
New York 2014, published online: 22 january.
Data Integration 2. Laney D 3-d data management: controlling data
Volume,velocity and variety. META Group Research
Note, 6 February (2001)
Data Cleansing
3. Olaiya Folorunsho (2013). Comparative Study of
Different Data Mining Techniques Performance in
Data Conversion knowledge Discovery from Medical Database.
International Journal of Advanced Research in Computer
Science and Software Engineering 3(3), March 2013
Data Filtering ISSN: 2277 128X.
4. Zikopoulos P and Eaton C et al (2011). Understanding
Data Statistics big data: analyticsfor enterprise class hadoop and
streaming data. McGraw-Hill Osborne Media(2011)
5. Beyer M, Gartner says solving big data challenge
Data Mining involves more than just managing volumes of data.
Gartner. http://www.gartner.com/it/page.jsp.
6. O. R. Team Big data now: current perspectives from
Data Visualization OReilly Radar. OReilly Media Gantz J, Reinsel D (2011)
Extracting value from chaos. IDC iView, 1–12 (2011)
Figure-3
KNIME follows the following steps
International Science Congress Association 3
Research Journal of Computer and Information Technology Sciences _________________________________E-ISSN 2320 – 6527
Vol. 4(2), 1-4, February (2016) Res. J. Computer and IT Sci.
7. Mayer-Sch¨onberger V and Cukier K (2013). Big data:a 16. Avita Katal et al (2013) Big Data: Issue, Challenge,
revolution that will transform how we live, work, and Tools and Good Practices, IEEE.
think. Eamon Dolan/Houghton Mifflin Harcourt. 17. Seref Sagiroglu and Duygu Sinanc, Big Data: A review,
8. Duren Che, Mejdl Safran and Zhiyong Peng (2013). IEEE January (2013)
From Big Data to Big Data Mining: Challenges, Issues
18. Zaiying Liu, Ping Yang and Lixiao Zhang (2013). A
and Opportunities, © Springer-Verlag Berlin Heidelberg. Sketch of Big Data Technologies IEEE Seventh
9. Petra Kuhnert and Bill Venables,” An Introduction to R: International conference on Internet Computing for
Software for StatisticalModelling & Computing”, CSIRO Engineering and Science.
Mathematical and information Sciences Cleveland,
19. Wei Fan, Albert Bifet (2012). Mining Big Data: Current
Australia (2011)
Status, and Forecast to the Future, SIGKDD
10. Sebastian Land and Simon Fischer (2012). RapidMiner 5 Explorations, 14(2).
RapidMiner in academic use 27th August. 20. Zikopoulos P, Eaton C et al. (2011). Understanding big
11. Berthold MR, Cebron N, Dill F, Gabriel TR, K¨otter T, data: analytics for enterprise class hadoop and streaming
Meinl T, Ohl P, Sieb C, Thiel K and Wiswedel B (2008). data” McGraw- Hill Osborne Media.
KNIME: the Konstanz information miner”. Springer. 21. Mayer-Sch¨onberger V and Cukier K (2013). Big data: a
12. Michael R. Berthold etal (2010). Knime: The Konstanz revolution that will transform how we live, work, and
Information Miner Technical Report, Altana Chair for think” Eamon Dolan/Houghton Mifflin Harcourt.
Bioinformatics and Information Mining.
22. Albert Bifet “Mining Big Data in Real Time” (2010)
13. Raymond Gardiner Goss and Kousikan Veeramuthu 23. Meijer E (2011). The world according to linq.
(2010). Heading Towards Big Data- Building A Better Communications of the ACM 54(10), 45–51.
Data Warehouse For More Data, More Speed, And More
Users. 24. Manyika J, McKinsey Global Institute, Chui M, Brown
B, Bughin J, Dobbs R, Roxburgh C (2011). Byers AH
14. Xindong Wu, Xingquan Zhu, Gong-Qing Wu, Wei Ding Big data: the next frontier for innovation, competition
(2014). Data Mining with Big Data, IEEE Transactions
and productivity. McKinsey Global Institute.
On Knowledge And Data Engineering, 26(1).
15. Bharti Thakur and Manish Mann(2014). Data Mining for
Big Data: A Review, IJARCSSE, 4(5).
International Science Congress Association 4
View publication stats