KEMBAR78
ePG-Processing Big Data With Hadoop Technologies - ACE - INTL | PDF | Apache Hadoop | Computer Data Storage
0% found this document useful (0 votes)
26 views37 pages

ePG-Processing Big Data With Hadoop Technologies - ACE - INTL

The eProject Guide provides a structured approach for students to engage in IT project work, emphasizing the importance of teamwork and practical application of knowledge. It outlines project objectives, deliverables, and detailed steps for successful project execution, including documentation formats and team roles. A case study on Acmeshel illustrates the use of Hadoop technologies for processing big data, highlighting the challenges and proposed solutions for managing semi-structured data.

Uploaded by

tzkusman786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views37 pages

ePG-Processing Big Data With Hadoop Technologies - ACE - INTL

The eProject Guide provides a structured approach for students to engage in IT project work, emphasizing the importance of teamwork and practical application of knowledge. It outlines project objectives, deliverables, and detailed steps for successful project execution, including documentation formats and team roles. A case study on Acmeshel illustrates the use of Hadoop technologies for processing big data, highlighting the challenges and proposed solutions for managing semi-structured data.

Uploaded by

tzkusman786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

eProject Guide – Processing Big

Data with Hadoop Technologies


© 2024 Aptech Limited

All rights reserved.


No part of this book may be reproduced or copied in any form or by any means –
graphic, electronic or mechanical, including photocopying, recording, taping, or storing
in information retrieval system or sent or transferred without the prior written
permission of copyright owner Aptech Limited.

All trademarks acknowledged.

APTECH LIMITED
Contact E-mail: ov-support@onlinevarsity.com

Edition 1 - 2024
Preface
The best way to learn something is to apply its principles and then to test them. Similarly,
the best way to evaluate the knowledge you gained is to test its application through project
work. This eProject Guide, which has been prepared following the best practices in the
industry, will help you to have the experience of going through a live project and will teach
you the essentials of successful development of IT projects.

The eProject Guide will help you to:


➢ Analyze a project
➢ Design the specifications of the project
➢ Develop the solution
➢ Maintain disciplined documentation for the work done
➢ Work in groups

The ability to work in a group is a very vital quality for anybody desiring to join the software
industry.

Your project group will consist of 3-4 members. The Faculty will assign you to a project
group and select the Leader for the Group (Group Leader).

This eProject Guide reiterates the commitment of eProjects Team in keeping up its tradition
of providing innovative, career oriented professional education.

Religiously following the given systematic approaches in this book would prepare the
students to get the real life experience of handling projects, as the practices listed here have
been extracted from the current industry norms. Thus, such an exercise would prepare you
for joining the software development industry.

Wishing you the very best.

Design Team
Table of Contents

Contents
1. Introduction
1.1 Team
1.2 Tips for a Good Application
1.3 Project Objectives
1.4 Project Deliverables
1.5 Project Conduct

2. Steps for a Great Project

3. Documentation Section

3.1 Problem Definition


3.2 Customer Requirement Specification (CRS)
3.3 Architecture and Design of the Project
3.4 Flow Chart
3.5 Data Flow Diagram
3.6 Entity Relationship (ER) Diagram
3.7 Database Design/Structure
3.8 Task Sheet
3.9 Checklist of Validations
3.10 Submission Checklist

Annexure A: CASE STUDY

4. User Guide
1. Introduction

1.1 Team
Implementation of any new computerized system involves a team of people. An ideal team consists
of following members:

Project Manager: Project Manager coordinates the project. Apart from being knowledgeable
about software, the person chosen as the Project Manager must have good writing skills. A
Project Manager must have enough experience in this field so that he/she can be successful
in managing implementation. He/she is involved not only in team management, but also in
activities such as resource allocation, project planning, reporting, and so forth; all of which
form a part of the responsibilities.

Project Leader: Project Leader leads the project team. A Project Leader essentially decides
which tasks are to be performed by each team member and how much time should be
allotted to each project development phase.

Analyst: Analyst studies requirements of the system and defines the problem. Analyst
determines inputs, outputs, and processes involved in transforming those inputs into outputs.
He/she acts as ‘Technical Expert’ and studies technologies that are to be implemented to
develop the project.

Developer/Programmer: Developer builds the user interface according to specifications


prepared by the analyst. Next, developer builds a prototype of the system. After receiving
client approval on the prototype, developer adds necessary code to make prototype a
full-fledged system.

Tester: Tester tests functionality of the application. Test data is used to check if the program
is able to execute without causing any errors. Test data may be live data extracted from
existing records in the system or dummy data. The tester also verifies the integrated
application’s functionality with test data.

Implementation Engineer: Implementation Engineer ports final product to Client’s


computers. Implementation Engineer will ensure that installation process has been carried
out accurately, and hands over the system to the client.

Maintenance Engineer: Maintenance engineer is responsible for taking care of or


maintaining the system that has been built. Maintenance includes extending troubleshooting
support and performing software upgrades in case of changes in the external system.

1.2 Tips for a Good Application


Seven basic steps for creating an effective application are as follows:

1. Define your target audience


2. Organize your concepts, information, and material
3. Create a directory structure
4. Implement storyboarding – prepare a sketch of contents, sequence, and layout of the forms
you intend to create
5. Build a prototype of your application
6. Test the prototype and implement the required changes
7. Deploy the application on the server, if applicable

Before designing an application, you must first conceptualize the application. Goals and objectives
must be clearly defined. Think about what the application goals and missions may be to inform,
promote, educate, research and report, or to simply entertain. These goals must be defined and
should be in line with the company’s values and mission.

Annexure A shows a case study of a project.

1.3 Project Objectives


▪ Develop an application based on the problem specifications given

▪ Design a professional-looking Graphical User Interface (GUI), if applicable, for the application

▪ Integrate all the modules (if applicable) to form a complete solution

1.4 Project Deliverables


Following are the deliverables that have to be submitted on the completion of the project:

o Complete Application

o User Manual

o Installation Guide
1.5 Project Conduct
During project development, three meetings would be held within the team.

First Meeting
In this meeting, students must discuss the eProject specification. They should work upon:

 Problem Statement
 Table/File Structure
 Program/Code Specifications
 Reports to be generated

Review of First Meeting

After the first meeting, students will review and discuss details of the meeting. They will check if the
structure is correct and forms are designed as per design specifications.
➢ Second Meeting

In this meeting, students will discuss the design. They must consider following:

 Understanding of the problem


 Design of the forms
 Validations required

➢ Review of Second Meeting

After the second meeting, students will review and discuss details of the meeting. They will check
whether the designs of forms are correct and proper verification and validation rules are being
implemented.

➢ Third Meeting

In this meeting, students will discuss and present the final application. They will consider following:
 Implementing all specifications mentioned in previous meetings
 Form Design
 All required validations
 Integrating all the project modules such as forms, reports, and so forth
 User Manual and Installation Guide
2. Steps for a Great Project
➢ Analyze and understand the user requirements

➢ Do the detailed analysis for the project

➢ Identify the resources available

➢ Plan and maintain a schedule for the various activities to be done in the project

➢ Check and test all modules

➢ Document each detail


Documentation Section

In this section, we will look into various formats used for documenting the project.

This section contains different formats, which can also be referred to, in Annexure A,
in the case study.

Students can use their own additional formats in the project documentation.

3.
3. Documentation Section

This format can be used as the first page of your project report, duly signed by the Center
Academics Head.
3.1 Problem De`finition

In this form, you can define the problem statement. You can design a pointwise list of
requirements as mentioned in eProject Specification.

Form No. 1/eProjects/PS/Ver 1.0


3.2 Customer Requirement Specification (CRS)
Client:

Business/Project Objective:

(can address organization/business overview, products, concerns, and expectations from the
system)

Inputs provided by the Client:


• Inputs to the System
• Outputs from the System
• Process Involved in the System
• Expected Delivery Dates
• List of Deliverables

Hardware Requirements:

Software Requirements:

Scope of the Work (in brief):

Form No. 2/eProjects/CRS/Ver 1.0/1 of 1


3.3 Architecture and Design of the Project

Form No. 3/eProjects/Design/Ver 1.0/1 of 4


3.4 Flowchart

Form No. 3/eProjects/Design/Ver 1.0/2 of 4


3.5 Data Flow Diagram

A Data Flow Diagram (DFD) is a graphical representation of the flow of data.

Form No. 3/eProjects/Design/Ver 1.0/3 of 4


3.6 Entity Relationship (ER) Diagram

Form No. 3/eProjects/Design/Ver 1.0/4 of 4


3.7 Database Design/Structure

Form No. 4/eProjects/DD/Ver 1.0/1 of 1


3.8 TaskSheet

Project Ref. No.: Date of Preparation of Activity Plan:

Project Activity Plan


Sr. No. Task: Title: Prepared By: Actual Start Actual Team Mate Status
Date Days Names

Form No. 5/eProjects/TS/Ver 1.0/1 of 1


3.9 Checklist of Validations

Option Validated
Do all numeric variables have a default value of
zero?
Does the administrator have all the rights to
create and delete the records?
Are all the records properly fed into the
appropriate database?
Have all the modules been properly integrated
and are completely functional?
Have all the Design and Coding Standards been
followed and implemented?
Is the GUI design consistent all over?

Is the navigation sequence correct through all the


forms/ screens in the application?
Is exception handling mechanism implemented in
all the screens?
Are all the program codes working?

Form No. 6/eProjects/VAL/Ver 1.0/1 of 1


3.10 Submission Checklist

Sr. Particulars Yes No NA Comments


No.
1. Are all users able to search for a Yes
particular item?
2. Are all old data properly saved and Yes
retrieved when required?
3. Have all modules been properly Yes
integrated and are completely functional?
4. Are GUI contents devoid of spelling Yes
mistakes?
5. Is the application user-friendly? Yes
6. Is the project published properly into a Yes
setup file?

Form No. 7/eProjects/CKL/Ver 1.0/1 of 1


Annexure A: CASE STUDY
Acmeshel, a company based in Seattle, is into the retail industry.

The company uses Oracle databases to store and analyze data. As the team is growing, the
organization sees the necessity to generate and analyze data at regular intervals, which is
event driven. The volume of data that is collected in a year is 200 TB. This may include the
sentiment analysis of the people, depending upon their feedback or log collections. The data
collected is from the feedback form, which is sent to the client. The data from the feedback
forms is collected in a Web API.

The major issue arising for the company is not the storage, but analyzing the semi-
structured data and processing it. Note that most of the data is coming as logs (semi-
structured logs) and is streaming.

Data restored from the API is in Avro format. Processing and analyzing semi-structured data
such as Avro is a complex task.

In a meeting between the IT team and the stakeholders of the company, it is decided to consider
the following:
● They will look for a tool where developers find it easy to write operations, unlike Java.
● Investment in physical machines will be minimal.
● The data collected should be retained for two years.
● There has to be optimal use of resources, that is, RAM, core processor, and disk
space.
● The proposed application should be scheduled after very month.
● The end result should be shown as bar charts or graphs to the
stakeholders for decision-making.
● A strong technical support system would be consulted.

Proposed Solution

Given the considerations, the company decides to use a cluster of Hadoop, which is open
source. There are three vendors of Hadoop, Cloudera, Hortonworks, and MapR. The company
decides to go by Cloudera since, it is industry friendly and the support system by the Cloudera
community is much efficient.

To avoid unnecessary processing load of the semi-structured data, Spark streaming can be
used as it processes certain amount of data before it stores it on HDFS, which is the storage
area. The data on the HDFS is stored in the form of text files and directories. HDFS also
allows data retention for a longer time.

Spark along with YARN can provide resource management and job scheduling. It makes the
task of Spark much easier as it optimizes the usage of resources. The data received is semi-
structured logs. This means that the logs are required to be cleaned and brought up in one
readable format so that further analysis can be done. The most reliable and fault tolerant
method with faster processing then, would be using Spark instead of MapReduce. Since,
Spark uses in memory processing, it takes approximately 30% of the CPU utilization. Note that
writing operations and code is also very easy in Spark.

After the Avro data is processed, it is in a structured format and the data is ready to be
queried and analyzed. Note that Impala works well with Parquet data and Hive works well
with Avro data. It also lets you create tables on top of the processed data, which is still text
file. In case, if Impala is to be used, it can be embedded without hampering the cluster.

HUE provides GUI interface to work with Hadoop tools and also helps make visualizations.
After the Hive tables are created, the data will be sent to HUE to create charts and bar
graphs. These visualizations can then be imported and mailed to the required member of the
team for decision-making.

Oozie is a workflow management tool that assembles all the inflowing data and passes on to
the next phase in the sequencing of the jobs. Monthly scheduling of the application will be
taken care by Oozie. Oozie editor also comes along with HUE and hence, it becomes easy to
submit jobs from the editor already embedded in HUE.

Project Specifications
The project will be named Acmeshel Data Processing System.

By default, the Hadoop ecosystem creates three replicas of data. Going with the default value of 3,
the required storage space for data in a year will be the following:

200 TB * 3=600 TB

Note that the company wants the data retention for two years. The requirement of the storage will
then become, 600*2, that is 1200 TB.

As Spark uses in memory processing, it takes approximately 30% of the CPU utilization.
Therefore, 30% of data is in container storage and 70% of data is in AVRO format. (In addition,
when we get the AVRO data, the schema is in JSON format. This can be saved as a normal text
file and this consumes 30% of the storage and the rest 70% is in AVRO binary format.) Based on
this, the standard formula to compute total storage must be applied.

Following is the standard formula:

Total storage * % in container storage + total storage * %in compressed format*expected


compression

Based on this, following will be the calculation:

(1200 * 0.30) + (1200 * 0 .70) * (1- 0.70)

360 + 840 * 0.30 = 360 + 252 = 612 TB

612 TB is the required data storage.

Now, in addition to the data storage, space is required to process data and perform certain other
tasks, such as local operating system computation. Based on the yearly incoming data that is 200
TB, on an average day, it is assumed that only 10% of data is being processed and a data
process creates three times temporary data (replication). Therefore, around 30% of extra storage
is required.

The total storage required for data and other activities then will be the following:
612 + 612 * 0.30 = 612 + 183.6 = 795.6 TB

Just a Bunch Of Disks (JBOD) is recommended for the data nodes. Allocate 20% of data
storage to the JBOD file system. Therefore, the data storage requirement will go up by
20%.

This will make the calculation as follows:

795.6 + 795.6 * 0.20

795.6 + 159.12

954.72 GB ~ 955 TB

The total data storage makes it to 955 TB.

To store 955 TB of data, here are the configuration details.

For a JBOD of 12 disks, each disk worth of 4 TB, data node capacity will be 48

TB. The number of data nodes required will be 955/48 ~ 20.

There is no necessity to set up the whole cluster on the first day. The cluster can be scaled up
as data grows from small to big. It will then be recommended to start the cluster with 25% of
total nodes and then, gradually move to 100% as data grows.
Spark is an in memory processing tool that uses 30% of the CPU utilization, whereas
MapReduce and Hive are both disk input output based that use 70% of the CPU utilization. 70%
of data must be processed in batch mode with Hive.

Here is then, the number of nodes required:

20 * 0.70 = 14 nodes

14 nodes will be assigned for batch processing and other six nodes will be assigned for in-
memory processing with Spark.

This calculation is for data storage. Resource management and job scheduling will be done by
Yarn, in order to have optimal CPU utilization. Hadoop tools will be installed using Cloudera
Manager.

For the Acmeshel Data Processing System, it is recommended to use the latest version of
Cloudera, CDH 5.13, which has Cloudera Manager that can help in manifesting and starting up
the required services. A pseudo node cluster for testing purposes can be easily downloaded
from the Cloudera Website.
Project Functionality:

Set up the cluster on the version CDH 5.13 based on the given calculation. Once the cluster is
setup, connect the Spark streaming with the Java Web API from where the data is generated.
Once the data is extracted, it is stored in the HDFS in the form of blocks. The data must be
cleaned or processed before it goes for analytical process. For this, Spark must be used.

Write a Spark application in Python and use IDE called Jupyter. The application created should
bring the AVRO data in readable format and extract only the relevant data for it to be analyzed.
While submitting the application of Spark, use Yarn as the resource management tool.

The output of the Spark application is stored on the HDFS from where HIVE tables are
created. (The schema of the HIVE tables is stored in metastore.) Create a HIVE query to
analyze the logs to get meaningful insights. To represent the end result, use HUE that provides
bar graphs and charts to help visualize data.

Note that the whole workflow is managed by Oozie.


ACKNOWLEDGEMENT

I would like to acknowledge all those who have given moral support and helped me make the
project a success.

I wish to express my gratitude to the eProjects Team at the Head Office, who guided and helped
me. I would also like to express my gratitude to all the staff members of my center for not only
providing me with the opportunity to work with them on this project, but also for their support and
encouragement throughout the process.

I also express my sincere gratitude to my project guide at the organization for his/her valuable
guidance and support for the completion of this project.

Finally, I would like to offer many thanks to all my colleagues for their valuable suggestions and
constructive feedback.
Problem Definition

After reading the project specifications, the developer states the scope of the project very briefly.

The problem definition for Acmeshel Data Processing System is to analyze streaming
semistructured logs with optimal use of resources and convert it to meaningful information to the
stakeholders for decision-making.

Form No. 1/eProjects/PS/Ver 1.0/1 of 1


Sample Customer Requirement Specification (CRS)

Client/Project Undertaken: Acmeshel Data Processing System


______________________________

1. List of inputs to the system


________________________________________________________________________________
__________________________________
________________________________________________________
________________________________________________________________________________
_____
________________________________________________________________________________
_____
2. List of outputs expected from the system
________________________________________________________________________________
_____
________________________________________________________________________________
_____
________________________________________________________________________________
_____
3. Overview of Processes involved in the system
________________________________________________________________________________
_____
________________________________________________________________________________
_____
________________________________________________________________________________
_____
4. Hardware and Software required for implementing the project

For the Acmeshel Data Processing System, it is recommended to use the latest version of
Cloudera, CDH 5.13, which has Cloudera Manager that can help in manifesting and starting up the
required services. A pseudo node cluster for testing purposes can be easily downloaded from the
Cloudera Website.

Form No. 2/eProjects/CRS/Ver 1.0/1 of 1


Architecture and Design of the System

Architecture of the Application

Figure 1.1 displays the application architecture.

Figure 1.1: Application Architecture

Form No. 3/eProjects/Design/Ver 1.0/1 of 2


Figure 1.2 displays the streaming analytics with Spark Streaming.

Figure 1.2: Streaming Analytics with Spark Streaming

Form No. 3/eProjects/Design/Ver 1.0/2 of 2


Database Design/Structure

This form is not applicable for this project

Form No. 4/eProjects/DD/Ver 1.0/1 of 1


Sample Task Sheet

Project Activity Plan Date of Preparation of Activity Plan:


Project Ref. No.: Title: Prepared By:

Sr.No. Task Actual Actual Team Status


Start Days Mate
Date Names
1 Identifying the
team members
required to
complete the
project
2 Data Acquisition
and Filtering
3 Identify big data
storage
solutions such
as HDFS and
SPARK, where
data can be
stored
4 Exploratory
Data Analysis
5 Data
Preparation for
and Modeling
Assessment
6 Modeling and
Implemention

Form No. 5/eProjects/TS/Ver 1.0/1 of 1


Sample Checklist of Validations

Option Validated
Do all numeric variables have a default Yes
value of zero?
Have all the modules been properly Yes
integrated and are completely functional?
Is the design consistent all over? Yes

Form No. 6/eProjects/VAL/Ver 1.0/1 of 1


Sample Submission Checklist

Sr. Particulars Yes No NA Comments


No.
1 Have all the modules been properly Yes
integrated and are they completely
functional?

2 Are the codes working as per the Yes All the modules are properly
specification? tested

3 Does the application’s functionality Yes


resolve the client problem, and
satisfy his/her requirements
completely?
4 Have the hardware and software Yes
been correctly chosen?

Form No. 7/eProjects/CKL/Ver 1.0/1 of 1


4. User Guide
A. System Requirements:

No. Items Description


1 Operating System Microsoft Windows 10 or higher
Hadoop ecosystem tools including Spark, Yarn, Hue,
2 Software
Oozie, Cloudera, CDH 5.13

You might also like