Big Data Processing Technologies in Distributed in
Big Data Processing Technologies in Distributed in
com
                                            ScienceDirect
                                            ScienceDirect
                                        Procedia Computer Science 00 (2018) 000–000
                                     Available online at www.sciencedirect.com
                                        Procedia Computer Science 00 (2018) 000–000                         www.elsevier.com/locate/procedia
                                                                                                            www.elsevier.com/locate/procedia
                                               ScienceDirect
                                         Procedia Computer Science 160 (2019) 561–566
Abstract
Abstract
The analysis of Big data technologies was provided. An example of MapReduce paradigm application, uploading of big volumes
of data,
The       processing
     analysis  of Big and
                       dataanalyzing   of unstructured
                            technologies    was provided. information andofits
                                                             An example         distributionparadigm
                                                                             MapReduce       into the application,
                                                                                                       clustered database  wasof
                                                                                                                   uploading    provided.  The
                                                                                                                                  big volumes
article
of data,summarizes
          processing the
                      andconcept
                           analyzingof of
                                       "big  data". Examples
                                           unstructured         of methods
                                                          information and itsfordistribution
                                                                                  working with
                                                                                             into arrays of unstructured
                                                                                                   the clustered database data
                                                                                                                           was are  given. The
                                                                                                                                provided.
parallelsummarizes
article   system Resilient   Distributed
                      the concept   of "bigDatasets  (RDD) is organized.
                                             data". Examples    of methodsThe for class of basic
                                                                                  working   with database
                                                                                                  arrays ofoperations wasdata
                                                                                                            unstructured    realized: database
                                                                                                                                are given. The
con-nection,
parallel       table
          system     creation,
                  Resilient    getting in Datasets
                             Distributed   line id, returning
                                                     (RDD) isallorganized.
                                                                  elements of
                                                                            Thetheclass
                                                                                    database, update,
                                                                                        of basic       delete
                                                                                                 database     and create
                                                                                                           operations wastherealized:
                                                                                                                             line. database
con-nection, table creation, getting in line id, returning all elements of the database, update, delete and create the line.
© 2019 The Authors. Published by Elsevier B.V.
© 2019
This      The
      is an    Authors.
            open        Published
                  accessPublished   by Elsevier
                          article under           B.V.
                                          the CC B.V.
                                                   BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
©  2019   The  Authors.             by Elsevier
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review
This  is an    under
            open     responsibility
                  access  article    of the
                                  under   theConference
                                              CC   BY-NC-NDProgram Chairs.
Peer-review under responsibility of the Conference       Programlicense
                                                                   Chairs.(http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Keywords: Big data; Web application; Modeling; Processing, analytics.
Keywords: Big data; Web application; Modeling; Processing, analytics.
     1. Introduction
     1. Introduction
   The information technology (IT) field is a promising field of research. Recently big systems have consisted of
   The information
several servers and technology
                    terabytes of(IT)  field is a Nowadays,
                                 information.    promising field   of research.
                                                            the systems    use aRecently  big systems
                                                                                 cloud cluster  model, have
                                                                                                        whichconsisted
                                                                                                               includesofa
several servers
thousand        and terabytes
          of multicore        of information.
                       processors and petabytes  Nowadays,  the is
                                                  of data. That  systems
                                                                    why it use
                                                                           wasacreated
                                                                                 cloud cluster  model, which
                                                                                        a new research  area asincludes
                                                                                                                 Big data.a
thousand
This      of multicore
      paradigm         processors
                has already       and in
                            reflected   petabytes of data.
                                           academic        That isExamples
                                                     programs.      why it was
                                                                             of created
                                                                                Big dataa new  research
                                                                                          branch  are thearea as Big data.
                                                                                                           structured and
This paradigm has already reflected in academic programs. Examples of Big data branch are the structured and
unstructured data, media or random processes as they practically can't be processed traditionally. The traditional
monolithic systems are being replaced with the new asynchronous and parallel solutions. These new solutions
provide the ability to work with Big data [1].
   Big Data information technology is the set of methods and means of processing different types of structured
(databases) and unstructured (text, stream) dynamic large amounts of data for their analysis and usage for decision
support. This technology is an alternative to traditional database management systems and Business Intelligence
solutions class. Besides, Big data technology can be used for parallel (distributed) data processing [2, 3]. The system
consists of several independent blocks that efficiently process information under conditions of continuous growth
and distribution throughout the multiple cluster nodes. In such systems, the volumes of information increase
exponentially, and unstructured data is the most significant part of the whole data. Therefore, the issues of a proper
interpretation of data flow in systems of such type become more and more urgent [1].
   The subject of research is the methods and tools for building, editing and adapting the information flow in
distributed information systems.
2. State-of-art
   In [1], the concept of Big data, criteria for their classification are given. The paper [3] considered Big data as a
revolutionary technology of innovation, competition, and productivity of the economy, a new resource for business.
The architecture, informational value for business and the impact of Big Data are given in [4]. The possibilities of
involving innovative Big Data to develop a business strategy are analyzed in [5].
   The analysis of the methods of data consolidation is given in [6]. In [7] the authenticity, integration, scalability,
and confidentiality of "open" structured (databases) and unstructured (text) data from social networks are described.
The technic aspects of Big data realization are given in [8]. The method of intelligent data analysis is described in
[9]. The analysis of possibility of Big data implementation in medicine is given in [10, 11]. The information model
of cloud data warehouse and a possibility to implement it as part of Big data technology is provided in [12 – 14].
The big data usage for information analysis in a social network is given in [14]. The methods of deep learning and
machine learning can process Big data consists of different sources such as images, video, audio [13 – 17]. The
business and e-libraries are examples of Big data technologies usage too [18 – 23].
   So, the partly solved tasks in Big data processing are: the biggest part of sources are unstructured data; there is
requires to time complexity, so the parallel data processing should be used.
3. Problem statement
   The research task is to develop the model of Big data and information technology of distributed unstructured data
processing. To gain such a result, the following tasks must be solved in the paper:
   1.    To analyze the methods and principles of Big data processing;
   2.    To analyze existing technologies of Big data processing;
   3.    To carry out a comparative analysis of productivity of Hadoop and Spark platforms for unstructured data
processing;
   4.    To test the parallelized system in Scala.
   The clustering is one of the ways to decrease the time complexity for Big data processing. Two variants of
scaling, i.e., horizontal and vertical scaling should be taken into account.
   Horizontal scaling divides the data set and distributes the data over multiple servers, or shards. So, you can create
ten instances each with 1TB database. Each shard is an independent database, and collectively, the shards make up a
single logical database. The system should rely on asynchronous message communication to delimit the
components. The controlling of the loads, flows, and message queues should be provided in the system [2 – 4].
                                  Nataliya Shakhovska et al. / Procedia Computer Science 160 (2019) 561–566         563
                                 Nataliya Shakhovska/ Procedia Computer Science 00 (2018) 000–000                      3
   When we use the systems of this type, several problems arise with all clustered system nodes interworking. For
example, different applications require data access from different nodes. This makes clustered system operation
more complicated, but there is the possibility of vertical data scaling that provides access to data of all system nodes.
   Hinchcliff divides the approaches to the Bigdata into three groups depending on the volume:
                                      VolBD = { VolFD, VolBA, VolDI },     (1)
where VolFD is Fast Data: their volume is measured in terabytes; VolBA is Big Analytics; they are petabytes of data;
VolDI is Deep Insight; it is measured in exabytes, zettabytes.
    Groups differ among themselves not only in the operating volumes of data but also in the quality of their
processing solutions. Processing information from different expressive power types of information sources, namely
structured, semi-structured, and unstructured is necessary for the Big data technology. A set of information products
is divided into three blocks:
                                      Ip =  St, SemS, UnS,                        (2)
where St = DB, DW is structured data (databases, data warehouses); SemS = Wb, Tb is semistructured data
(XML, electronic worksheets); UnS = Nd is unstructured data (text) [10, 14].
The following technologies are used for Big data processing:
                                     TBD = TNoSQL, TSQL, THadoop, TV,     (3)
where TNoSQL is the technology of NoSQL databases; THadoop is the technology that ensures the massively-parallel
processing; TSQL is the technology of the structured data processing (SQL database); TV is the technology of the Big
data visualization [8, 11].
    The main technologies of Big data processing are: NoSQL; MapReduce; Apache Hadoop; Apache Spark.
    The information volume increasing problem cannot be solved using classical relational architectures. The
followers of the concept of NoSQL language emphasize that it is not a complete negation of SQL and the relational
model, but the project comes from the fact that SQL is essential and handy tool, that can not be considered as
universal. One problem that point for a classical relational database is a problem of dealing with massive data and
projects with a high load. The first objective approach is to extend the database if SQL flexible enough, and not
displace it wherever it is to perform its tasks. Also, relation approach does not support both types of scaling (vertical
and horizontal).
     There are classical approaches and paradigms for the development of data processing facilities. MapReduce
paradigm is one of them [5]. This model of distributed data processing is suggested by Google to process the
significant volume of data on computing clusters. Cluster is several independent computers used together and
working as a single system.
     MapReduce provides for data organizing in the form of lists that pass 3 stages of processing:
1. Map stage. At this stage, the data are processed with the help of the map() function defined by the user. The
   operation is similar to the map() method in functional programming languages. The map function accepts the list
   at the input and returns several key-value pairs.
2. Shuffle stage. At this stage the map function “is divided into buckets” – each bucket conforms to one map stage
   key. Later on, these input buckets will serve for reduce() function.
3. Reduce stage. Reduce function defines the result for separate “buckets”.
   At present, the Apache Hadoop MapReduce and Apache Spark technologies are a leader in the use of
MapReduce paradigm and creation of the software platform for the arrangement of the distributed processing of
large data volumes [8, 16 – 18].
   Apache Hadoop MapReduce is a free platform for the arrangement of large data volumes processing (measured
in petabytes) using the MapReduce paradigm. This paradigm makes it possible to distribute the separate fragments,
each of which can be run at a separate cluster node. Hadoop includes implementation of the distributed Hadoop
HDFS file system, which automatically provides data backup and it is optimized for work with MapReduce. To
simplify the access to the data in Hadoop store, the SQL-like Hive language, which is kind of SQL for MapReduce,
was developed. The requests in this language can be parallelized and processed by several Hadoop platforms.
564                               Nataliya Shakhovska et al. / Procedia Computer Science 160 (2019) 561–566
4                              Nataliya Shakhovska/ Procedia Computer Science 00 (2018) 000–000
   Compared to the previous Hadoop MapReduce, the Spark provides 100 times higher performance when the data
is processed in memory and 10 times higher performance when the data is located on discs. This mechanism is
fulfilled at Hadoop cluster nodes with the help of Hadoop YARN and in a separate mode. It supports the data
processing at HDFS, Cassandra [13] and Hive [11] store and in any Hadoop input format [6, 8].
   The main difference between Spark and Hadoop MapReduce is that Spark stores information in computer
memory, providing in such a way the higher platform productivity, while Hadoop stores it on the disc, providing the
higher security level [18 – 19]. In addition to traditional features of Apache Hadoop MapReduce, namely,
processing of unstructured data, the Apache Spark platform includes Spark Streaming for working with
asynchronous streams, Mlib library for computer analysis and GraphX.
5. Experiment
   Let us provide the following comparative analysis of productivity of both platforms in the execution time to the
number of iterations ratio (Fig. 1).
      Spark provides API (Application Program Interface) in Scala, Java, Python and R programming languages. At
first, the Spark program creates the SparkContext object that shows the Spark method of access to the cluster. The
SparkConf object with the information about the application should be built to create the SparkContext.
      The concept of Resilient Distributed Datasets (RDD) is the basis of Spark. It is a failure-resistant collection
(list) of elements, that is being processed in parallel. There are two ways to create RDD: parallelization of the
transmitted collection (list) in the program and reference to the external file system, such as HDFS (Hadoop
Distributed File System), or any other data source in Hadoop [5].
      Let us divide the service structure into two parts. The first one is a Web page including the UI (User Interface)
with a form for document transmittal to the server and interfaces with data analysis after receiving the processed
data from the server. The second one is the API (Application Program Interface) of our system that will represent a
library of methods for acceptance, processing, analysis, and transmittal of data to the client.
      We focused attention on the API systems when Apache Spark is used. The example is provided in Scala
language. To begin with, we set the cluster configuration and create the SparkContext. In the master code the
URL is a cluster configuration setting; setMaster(“local[*]”) means running of Spark locally with the
determined number of information streams according to the quantity of cores on a certain computer;
setMaster(spark://HOST:PORT) is a configuration for connection with external cluster.
      We develop the method for file receiving from the client and checking of the file type (csv or xlsx). If so, the
file will be uploaded to the server and its name will be transmitted to the method parseAttachment(inputFile:
String).Otherwise, the method will return the warning.
      During the next step each file element should be transmitted to the CSVReader constructor, it should be parsed,
and the raw content should be returned to Spark RDD. This process allows paralleling of data processing. After that
the collection (list) returns and it is transmitted to the toTransactions(data) constructor, and in such a way
the collection returns from transaction. After completion of this process, each element of the collection is
                                   Nataliya Shakhovska et al. / Procedia Computer Science 160 (2019) 561–566         565
                                  Nataliya Shakhovska/ Procedia Computer Science 00 (2018) 000–000                      5
6. Results
     At the last stage, all elements are united to create the main class of application running. Scala library, namely
spray is used. It is necessary to run the server and deploy applications. Our issue is to create the configuration,
combine it with the database, create the service, actors system and run the HTTP server. Application operating
interface is shown in Fig. 4, the content of the database after the uploading of csv document with the data to the
server is shown on the left and right side.
7. Discussion
     The parallel method for file receiving from the client and checking of the file type (csv or xlsx) is developed.
Each file element is transmitted to the CSVReader constructor. The raw content after parsing is returned to Spark
RDD. Scala object for basic operations with the database is developed. It guarantees loosely coupled interface,
isolation, location transparency and provides means of errors or messages delegation.
8. Conclusions
     The information technology for Big data parallel processing is developed. The analysis of the methods and
principles of Big data processing is given. The comparative analysis of the productivity of Hadoop and Spark
platforms for unstructured data processing is provided. An example of the application of the MapReduce paradigm,
loading large volumes of data, processing, and analysis of unstructured information and its distribution into a cluster
database is given. Examples of methods for working with unstructured data arrays are given. A parallel RDD system
is organized. The proposed working class of loneliness for basic database operations such as database connection,
table creation, spread-sheet, id readout, the return of all database elements, update, deletion, and line creation.
     The parallelized system in Scala is developed and testing. This information technology allows us processing
566
6                                        Nataliya
                                       Nataliya   Shakhovska
                                                Shakhovska/   et al. / Computer
                                                            Procedia   Procedia Computer
                                                                                Science 00Science
                                                                                           (2018)160 (2019) 561–566
                                                                                                  000–000
structured, semi-structured and unstructured data and combining vertical and horizontal data scaling.
References
[1] Janssen, M., van der Voort, H., & Wahyudi, A. (2017). “Factors influencing big data decision-making quality”. Journal of Business Research,
    70: 338-345.
[2] Shaw, J. (2014). “Why Big Data is a big deal”. Harvard Magazine, 3: 30-35.
[3] Daas, P. J., Puts, M. J., Buelens, B., & van den Hurk, P. A. (2015). “Big data as a source for official statistics”. Journal of Official Statistics,
    31(2): 249-262.
[4] Shakhovska, N., Vovk, O., Hasko, R., Kryvenchuk, Y. (2018). “The Method of Big Data Processing for Distance Educational System”. In:
    Shakhovska N., Stepashko V. (eds) Advances in Intelligent Systems and Computing II. 689: 461-473.
[5] De Mauro, A., Greco, M., & Grimaldi, M. (2016). “A formal definition of Big Data based on its essential features”. Library Review, 65(3):
    122-135.
[6] Melnykova, N., Marikutsa, U., Kryvenchuk, U. (2018). “The New Approaches of Heterogeneous Data Consolidation”. Proceedings of the
    13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, September 2018
    (1): 408-411.
[7] Ediger, D., Jiang, K., Riedy, J., Bader, D. A., Corley, C., Farber, R., Reynolds, W. N. (2010). “Massive social network analysis: Mining
    twitter for social good”. Proceedings of the 39th International Conference on Parallel Processing (2010, September): 583-593.
[8] Chen, H., Chiang, R. H., Storey, V. C. (2012). “Business intelligence and analytics: from big data to big impact”. MIS quarterly: 1165-1188.
[9] Boyko, N. (2016). “A look trough methods of intellectual data analysis and their applying in informational systems”. Proceedings of the XIth
    International Scientific and Technical Conference “Computer Sciences and Information Technologies (CSIT), Lviv, September 2016: 183-185.
[10] Das, N., Das, L., Rautaray, S. S., Pandey, M. (2018). “Big Data Analytics for Medical Applications”. International Journal of Modern
    Education and Computer Science, 10(2): 35
[11] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., ... & Murthy, R. (2009). “Hive: a warehousing solution over a map-
    reduce framework”. Proceedings of the VLDB Endowment, 2(2): 1626-1629.
[12] Wang, C., Ren, K., Lou, W., & Li, J. (2010). “Toward publicly auditable secure cloud data storage services”. IEEE network, 24(4).
[13] Fedushko S., Shakhovska N., Syerov Yu. (2018) “Verifying the medical specialty from user profile of online community for health-related
    advices”. Proceedings of the 1st International workshop on informatics & Data-driven medicine (IDDM 2018) Lviv, November 28–30, 2018.
    2255: 301–310.
[14] Maass, W., Natschläger, T., & Markram, H. (2002). “Real-time computing without stable states: A new framework for neural computation
    based on perturbations”. Neural computation, 14(11): 2531-2560
[15] Vitynskyi, P., Tkachenko, R., Izonin, I., Kutucu H. (2018) “Hybridization of the SGTM Neural-like Structure through Inputs Polynomial
    Extension”. In Proceedings of the Second International Conference on Data Stream Mining Processing (DSMP), 386-391.
[16] Wang, G., & Tang, J. (2012, August). “The nosql principles and basic application of cassandra model”. In Proceedings of the 2012
    International Conference Computer Science & Service System (CSSS), 1332-1335.
[17] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., & Ghodsi, A. (2016). “Apache spark: a unified engine for big data
    processing”. Communications of the ACM, 59(11): 56-65.
[18] Molnár, E., Molnár, R., Kryvinska, N., Greguš M. (2014) “Web Intelligence in practice”. The Society of Service Science, Journal of Service
    Science Research, Springer, 6(1):149-172.
[19] Kryvinska, N. (2012) “Building Consistent Formal Specification for the Service Enterprise Agility Foundation”. The Society of Service
    Science, Journal of Service Science Research, Springer, Vol. 4, No. 2, 2012, pp. 235-269.
[20] Gregus, M. Kryvinska, N. (2015) “Service Orientation of Enterprises - Aspects, Dimensions, Technologies”. Comenius University in
    Bratislava, ISBN: 9788022339780.
[21] Kaczor, S., Kryvinska, N. (2013) “It is all about Services - Fundamentals, Drivers, and Business Models”. The Society of Service Science,
    Journal of Service Science Research, Springer, 5(2): 125-154.
[22] Kryvinska, N., Gregus, M. (2014) “SOA and it's Business Value in Requirements, Features, Practices and Methodologies”. Comenius
    University in Bratislava, ISBN: 9788022337649.
[23]. Rusyn, B., Vysotska, V., Pohreliuk, L.: “Model and architecture for virtual library information system”. In Proceedings of the 13th
   International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, September 2018 (1),
   37-41