8/1/2019 6 Frequently Asked Hadoop Interview Questions and Answers - DZone Big Data
6 Frequently Asked Hadoop Interview
Questions and Answers
by Arul Kumaran · Dec. 11, 16 · Big Data Zone · Opinion
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale.
Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online
Playground, and get started today.
Presented by HPCC Systems
Are you preparing for an interview soon and need to have knowledge of Hadoop? DON'T PANIC! Here are
some questions you may be asked and the answers you should try to give.
Q1.What is Hadoop?
Hadoop is an open-source software framework for storing large amounts of data and processing/querying
those data on a cluster with multiple nodes of commodity hardware (i.e. low-cost hardware). In short,
Hadoop consists of the following:
HDFS (Hadoop Distributed File System): HDFS allows you to store huge amounts of data in a distributed
and a redundant manner. For example, a 1 GB (i.e 1024 MB) text ile can be split into 16 * 128MB iles and
stored on 8 different nodes in a Hadoop cluster. Each split can be replicated 3 times for fault tolerance so that
if 1 node goes down, you have backups. HDFS is good for sequential write-once-and-read-many times type
access.
We've Updated Our Site Policies.
We have recently updated our terms of service and privacy policy.
CLOSE
For additional information, visit:
https://dzone.com/pages/tos | https://dzone.com/pages/privacy
https://dzone.com/articles/6-faq-hadoop-interview-questions-amp-answers-with 1/8
8/1/2019 6 Frequently Asked Hadoop Interview Questions and Answers - DZone Big Data
MapReduce: A computational framework. This processes large amounts of data in a distributed and parallel
manner. When you do a query on the above 1 GB ile for all users with age > 18, there will be say “8 map”
functions running in parallel to extract users with age > 18 within its 128MB split ile, and then the “reduce”
function will run to combine all the individual outputs into a single inal result.
YARN (Yet Another Resource Nagotiator): A framework for job scheduling and cluster resource
management.
Hadoop ecosystem, with 15+ frameworks & tools like Sqoop, Flume, Ka ka, Pig, Hive, Spark, Impala, etc to
ingest data into HDFS, to wrangle data (i.e. transform, enrich, aggregate, etc) within HDFS, and to query data
from HDFS for business intelligence & analytics. Some tools like Pig and Hive are abstraction layers on top of
MapReduce, whilst the other tools like Spark and Impala are improved architecture/design from MapReduce
for much-improved latencies to support near real-time (i.e. NRT) and real-time processing.
Q2. Why
We've Are
Updated OurOrganizations
Site Policies. Moving from Traditional
Data
We Warehouse
have recently Tools
updated our terms toandSmarter
of service privacy policy. Data Hubs Based
CLOSE
on
For Hadoop
additional information,Ecosystems?
visit:
https://dzone.com/pages/tos
Organizations are investing| https://dzone.com/pages/privacy
to enhance their:
https://dzone.com/articles/6-faq-hadoop-interview-questions-amp-answers-with 2/8
8/1/2019 6 Frequently Asked Hadoop Interview Questions and Answers - DZone Big Data
Existing data infrastructure:
predominantly using “structured data” stored in high-end and expensive hardwares
predominantly processed as ETL batch jobs for ingesting data into RDBMS and data warehouse systems
for data mining, analysis & reporting to make key business decisions.
predominantly handle data volumes in gigabytes to terabytes
Smarter data infrastructure based on Hadoop where
structured (e.g. RDBMS), unstructured (e.g, images, PDFs, docs ), and semi-structured (e.g. logs,
XMLs) data can be stored in cheaper commodity machines in a scalable and fault tolerant manner.
data can be ingested via batch jobs and near real time (i.e. NRT, 200ms to 2 seconds) streaming (e.g.
Flume and Ka ka).
data can be queried with low latency (i.e under 100ms) capabilities with tools like Spark & Impala.
larger data volumes in terabytes to petabytes can be stored.
This empowers organizations to make better business decisions with smarter and bigger data with more
powerful tools to ingest data, to wrangle stored data (e.g. aggregate, enrich, transform, etc.), and to query
the wrangled data with low-latency capabilities for reporting and business intelligence.
Q3. How Does a Smarter & Bigger Data Hub
Architectures Differ from a Traditional Data
Warehouse Architectures?
Traditional Enterprise Data Warehouse Architecture
We've Updated Our Site Policies.
We have recently updated our terms of service and privacy policy.
CLOSE
For additional information, visit:
https://dzone.com/pages/tos | https://dzone.com/pages/privacy
https://dzone.com/articles/6-faq-hadoop-interview-questions-amp-answers-with 3/8
8/1/2019 6 Frequently Asked Hadoop Interview Questions and Answers - DZone Big Data
Hadoop-based Data Hub Architecture
Q4. What Are the Benefits of Hadoop-Based Data
Hubs?
Improves the overall SLAs (i.e. Service Level Agreements) as the data volume and complexity grows. For
example, “Shared Nothing” architecture, parallel processing, memory intensive processing frameworks like
Spark and Impala, and resource preemption in YARN’s capacity scheduler.
Scaling data warehouses can be expensive. Adding additional high-end hardware capacities
and licensing of data warehouse tools can cost signi icantly more. Hadoop-based solutions can not only be
cheaper with commodity hardware nodes and open-source tools, but also can complement the data
warehouse solution by of loading data transformations to Hadoop tools like Spark and Impala for more
We've Updated Our Site Policies.
ef icient parallel processing of Big Data. This will also free up the data warehouse resources.
We have recently updated our terms of service and privacy policy.
CLOSE
Exploration of new avenues and leads.
For additional information, visit: Hadoop can provide an exploratory sandbox for the data scientists
to discover potentially valuable data from social media, log iles, emails, etc., that are not normally available in
https://dzone.com/pages/tos | https://dzone.com/pages/privacy
data warehouses.
https://dzone.com/articles/6-faq-hadoop-interview-questions-amp-answers-with 4/8
8/1/2019 6 Frequently Asked Hadoop Interview Questions and Answers - DZone Big Data
Better lexibility. Often business requirements change, and this requires changes to schema and reports.
Hadoop-based solutions are not only lexible to handle evolving schemas, but also can handle semi-
structured and unstructured data from disparate sources like social media, application log iles, images, PDFs,
and document iles.
Q5. What Are Key Steps in Big Data Solutions?
Ingesting Data, Storing Data (i.e. Data Modelling), and processing data (i.e data wrangling, data
transformations, and querying data).
Ingesting Data
Extracting data from various sources such as:
1. RDBMs Relational Database Management Systems like Oracle, MySQL, etc.
2. ERPs Enterprise Resource Planning (i.e. ERP) systems like SAP.
3. CRM Customer Relationships Management systems like Siebel, Salesforce, etc.
4. Social Media feeds and log files.
5. Flat files, docs, and images.
And storing them on data hub based on “Hadoop Distributed File System”, which is abbreviated as HDFS.
Data can be ingested via batch jobs (e.g. running every 15 minutes, once every night, etc), streaming near-
real-time (i.e 100ms to 2 minutes) and streaming in real-time (i.e. under 100ms).
One common term used in Hadoop is “Schema-On-Read“. This means unprocessed (aka raw) data can be
loaded into HDFS with a structure applied at processing time based on the requirements of the processing
application. This is different from “Schema-On-Write”, which is used in RDBMs where schema needs to be
de ined before the data can be loaded.
Storing Data
Data can be stored on HDFS or NoSQL databases like HBase. HDFS is optimized for sequential access and
the usage pattern of “Write-Once & Read-Many”. HDFS has high read and write rates as it can parallelize I/O s
to multiple drives. HBase sits on top of HDFS and stores data as key/value pairs in a columnar fashion.
We've
ColumnsUpdated Our Site
are clubbed together Policies.
as column families. HBase is suited for random read/write access. Before
data can be stored in Hadoop, you need consider the following:
We have recently updated our terms of service and privacy policy.
CLOSE
1. Data Storage Formats:
For additional information, visit: There are a number of ile formats (e.g CSV, JSON, sequence, AVRO, Parquet,
etc.) and data compression
https://dzone.com/pages/tos algorithms (e.g snappy, LZO, gzip, bzip2, etc.) that can be applied. Each has
| https://dzone.com/pages/privacy
particular strengths. Compression algorithms like LZO and bzip2 are splittable.
https://dzone.com/articles/6-faq-hadoop-interview-questions-amp-answers-with 5/8
8/1/2019
p g p6 Frequently Asked
g Hadoop Interview Questions and
p Answers p- DZone Big Data
2. Data Modelling: Despite the schema-less nature of Hadoop, schema design is an important
consideration. This includes directory structures and schema of objects stored in HBase, Hive and
Impala. Hadoop often serves as a data hub for the entire organization, and the data is intended to be
shared. Hence, carefully structured and organized storage of your data is important.
3. Metadata management: Metadata related to stored data.
4. Multitenancy: As smarter data hubs host multiple users, groups, and applications. This often
results in challenges relating to governance, standardization, and management.
Processing Data
Hadoop’s processing framework uses the HDFS. It uses the “Shared Nothing” architecture, which in
distributed systems each node is completely independent of other nodes in the system. There are no shared
resources like CPU, memory, and disk storage that can become a bottleneck. Hadoop’s processing frameworks
like Spark, Pig, Hive, Impala, etc., processes a distinct subset of the data and there is no need to manage
access to the shared data. “Sharing nothing” architectures are very scalable as more nodes can be added
without further contention and fault tolerant as each node is independent, and there are no single points of
failure, and the system can quickly recover from a failure of an individual node.
Q6. How Would You Go About Choosing Among the
Different File Formats for Storing and Processing
Data?
One of the key design decisions is regarding ile formats based on:
1. Usage patterns like accessing 5 columns out of 50 columns vs accessing most of the columns.
2. Splittability to be processed in parallel.
3. Block compression saving storage space vs read/write/transfer performance
4. Schema evolution to add ields, modify ields, and rename ields.
We've Updated Our Site Policies.
CSV Files
We have recently updated our terms of service and privacy policy.
CSV iles are common for exchanging data between Hadoop & external systems. CSVs are readable CLOSEand
For additional information, visit:
parsable. CSVs are handy for bulk loading from databases to Hadoop or into an analytic database. When using
https://dzone.com/pages/tos | https://dzone.com/pages/privacy
CSV iles in Hadoop never include header or footer lines. Each line of the ile should contain records. CSV iles
https://dzone.com/articles/6-faq-hadoop-interview-questions-amp-answers-with 6/8
8/1/2019 6 Frequently Asked Hadoop Interview Questions and Answers - DZone Big Data
limited support for schema evaluations as new ields can only be appended to the end of a record and existing
ields can never be limited. CSV iles do not support block compression, hence compressing a CSV ile comes
at a signi icant read performance cost.
JSON Files
JSON records are different from JSON iles; each line is its own JSON record. As JSON stores both schema and
data together for each record, it enables full schema evolution and splitability. Also, JSON iles do not
support block level compression.
Sequence Files
Sequence iles store data in binary format with a similar structure to CSV iles. Like CSV, Sequence iles do not
store metadata, hence only schema evolution is appending new ields to the end of the record. Unlike CSV
iles, Sequence iles do support block compression. Sequence iles are also splittable. Sequence iles can be
used to solve “small iles problem” by combining smaller XML iles by storing the ilename as the key and the
ile contents as the value. Due to complexity in reading sequence iles, they are more suited for in- light (i.e.
intermediate) data storage.
Note: A SequenceFile is Java-centric and cannot be used cross-platform.
Avro Files
These are suited for long term storage with schema. Avro iles store metadata with data, but also allow
speci ication of independent schema for reading the ile. This enables full schema evolution support
allowing you to rename, add, and delete ields and change data types of ields by de ining a new independent
schema. Avro ile de ines the schema in JSON format, and the data will be in binary JSON format. Avro iles are
also splitable and support block compression. More suited in usage patterns where row level access is
required. This means all the columns in the row are queried. Not suited when a row has 50+ columns and the
usage pattern requires only 10 or less columns to be accessed. Parquet ile format is more suited for this
columnar access usage pattern.
Columnar Formats, e.g. RCFile, ORC
RDBMs store records in a row-oriented fashion as this is ef icient for cases where many columns of a record
need to be fetched. Row-oriented writing is also ef icient if all the column values are known at the time of
writing a record to the disk. But this approach would not be ef icient to fetch just 10% of the columns in a
row or if all the column values are not known at the time of writing. This is where columnar iles make more
sense. So columnar format works well
skipping I/O and decompression on columns that are not part of the query
We've
forUpdated
queries that Our SiteaPolicies.
only access small subset of columns.
We have forrecently
data-warehousing-type applications
updated our terms where
of service and users want
privacy to aggregate certain columns over a large
policy.
CLOSE
collection
For additional of records.
information, visit:
https://dzone.com/pages/tos | https://dzone.com/pages/privacy
RC & ORC formats are speci ically written in Hive and not general purpose as Parquet.
https://dzone.com/articles/6-faq-hadoop-interview-questions-amp-answers-with 7/8
8/1/2019 6 Frequently Asked Hadoop Interview Questions and Answers - DZone Big Data
Parquet Files
Parquet ile is a columnar ile like RC and ORC. Parquet iles support block compression and optimized for
query performance as 10 or less columns can be selected from 50+ columns records. Parquet ile write
performance is slower than noncolumnar ile formats. Parquet also support limited schema evolution by
allowing new columns to be added at the end. Parquet can be read and written to with Avro APIs and Avro
schemas.
So, in summary, you should favor Sequence, Avro, and Parquet ile formats over the others; Sequence iles for
raw and intermediate storage, and Avro and Parquet iles for processing.
Further reading: 70+ more Hadoop, spark, and BigData interview questions & answers
Like This Article? Read More From DZone
Top 25 Big Data Interview Solving a RavenDB Technical
Questions and Answers You Must Interview Question
Prepare for in 2018
Data Science and ML: A Complete Free DZone Refcard
Interview Guide Software Usage Analytics for Data-
Driven Development
Topics: HADOOP , INTERVIEW QUESTIONS , INTERVIEW ANSWERS , BIG DATA
Opinions expressed by DZone contributors are their own.
IN PROGRESS
We've Updated Our Site Policies.
We have recently updated our terms of service and privacy policy.
CLOSE
For additional information, visit:
https://dzone.com/pages/tos | https://dzone.com/pages/privacy
https://dzone.com/articles/6-faq-hadoop-interview-questions-amp-answers-with 8/8