Unit – I
Introduction to Big Data: Data, Types of Data, Big Data – 3 Vs of Big Data, Analytics, Types of Analytics, Need for
Big Data Analytics. Introduction to Apache Hadoop: Invention of Hadoop, Hadoop Architecture, Hadoop
Components, Hadoop Eco Systems, Hadoop Distributions, Benefits of Hadoop
Data: Data is the collection of raw facts and figures. Actually data is unprocessed, that is why data is called
collection of raw facts and figures. We collect data from different resources. After collection, data is
entered into a machine for processing. Data may be collection of words, numbers, pictures, or sounds etc.
Examples of Data:
Student data on admission form- bundle of admission forms contains name, father’s name, address,
photograph etc.
Student’s examination data - In examination system of a college/school, data about obtained marks
of different subjects for all students is collected, exam schedule etc.
Census Report, Data of citizens- During census, data of all citizens like number of persons living in a
home, literate or illiterate, number of children, cast, religion etc.
Survey Data – data can be collected by survey to know the opinion of people about their product like
/ unlike their products. They also collect data about their competitor companies in a particular area.
Information: Processed data is called information. When raw facts and figures are processed and arranged
in some proper order then they become information. Information has proper meanings. Information is
useful in decision-making. In other words, Information is data that has been processed in such a way as to
be meaningful values to the person who receives it.
Examples of information:
Student’saddress labels- Stored data of students can be used to print address labels of students. These
address labels are used to send any intimation / information to students at their home addresses.
Student’s examination, Results- In examination system collected data (obtained marks in each subject)
is processed to get total obtained marks of a student. Total obtained marks are Information. It is also
used to prepare result card of a student.
Census Report, Total Population- Census data is used to get report/information about total population
of a country and literacy rate, total population of males, females, children, aged persons, persons in
different categories line cast, religion, age groups etc.
Survey Report – Survey data is summarized into reports/information to present to management of the
company. The management will take important decisions on the basis of data collected through
surveys.
Ex: The data collected is in a survey report is: ‘HYD20M’
1
If we process the above data we understand that code is information about a person as follows:
HYD iscity name ‘Hyderabad’, 20 is age and M is to represent ‘MALE’
Units of data:When dealing with big data, we consider numbers to represent like megabytes, gigabytes,
terabytes etc. Here is the system of units to represent data.
International system of Units (SI)
Kilobyte KB 103
Megabyte MB 106
Gigabyte GB 109
Terabyte TB 1012
Petabyte PB 1015
Exabyte EB 1018
Zettabyte ZB 1021
Yottabyte YB 1024
What is Big Data?
Big Data:
Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store
and process using available database management tools or traditional data processing applications.
The quantity of data on planet earth is growing exponentially for many reasons. Various sources and our
day to day activities generate lots of data. With the smart objects going online, the data growth rate has
increased rapidly.
2
What is a Big Data?
In a digital world where data is increasing rapidly because of the increasing use of the internet, sensors
and heavy machines at a very high rate. The sheer volume, variety ,velocity and veracity of such data is
signified by the term “BIG DATA”.
EX: Rolling web log data, network and system logs click information what is considered “big data” varies
depending on the capabilities of the organization managing the set, and on the capabilities of the
applications that are traditionally used to process and analysis the data set in its domain. Big data is when
the data itself becomes part of the problem .
Evalution of a big data:
There are some major mile stones in the evaluation of big data
1940’s:The information is limited storage .
1960’s: Automatic data compression was published. Its exploration of information in the past few years
makes it necessary that requirement for storing information should be minimized.
1970’s: In 1970’s information flow is ordered to track the volume of information to circulating the country.
1980’s: In 1980’s research project was started in measured of volume of information in bits .
1990’s: In 1990’s a digital storage systems became more economical than paper storage .
2000’s : 2000 onwards various methods was introduced to steam line information technique for were
controlling the volume, velocity and variety of data merged ,thus introducing 3d data management
Hadoop was created by Doug Cutting and Mike Cafarellain 2005. It was originally developed to support
distribution for the Nutch search engine project. Doug, who was working at Yahoo! at the time and is
now Chief Architect of Cloudera, named the project after his son's toy elephant.