BIG DATA INGESTION
Practice Test
NAMAN BARTWAL
R172219036
CSE BIG DATA
❖ Write a description about Sqoop and its characteristics.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache
Hadoop and structured data stores such as relational databases.
The traditional application management system, that is, the interaction of applications
with relational database using RDBMS, is one of the sources that generate Big Data.
Such Big Data, generated by RDBMS, is stored in Relational Database Servers in the
relational database structure.
When Big Data storages and analysers such as MapReduce, Hive, HBase, Cassandra,
Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact
with the relational database servers for importing and exporting the Big Data residing
in them. Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible
interaction between relational database server and Hadoop’s HDFS.
Sqoop − “SQL to Hadoop and Hadoop to SQL”
Sqoop is a tool designed to transfer data between Hadoop and relational database
servers. It is used to import data from relational databases such as MySQL, Oracle to
Hadoop HDFS, and export from Hadoop file system to relational databases. It is
provided by the Apache Software Foundation.
Characteristics of Apache Sqoop
The various key features of Apache Sqoop are:
1. Robust: Apache Sqoop is highly robust in nature. It has community
support and contribution and is easily usable.
2. Full Load: Using Sqoop, we can load a whole table just by a single Sqoop
command. Sqoop also allows us to load all the tables of the database by
using a single Sqoop command.
3. Incremental Load: Sqoop supports incremental load functionality.
Using Sqoop, we can load parts of the table whenever it is updated.
4. Parallel import/export: Apache Sqoop uses the YARN framework for
importing and exporting the data. This provides fault tolerance on the
top of parallelism.
5. Import results of SQL query: Sqoop also allows us to import the result
returned from the SQL query into Hadoop Distributed File System.
6. Compression: We can compress our data either by using the
deflate(gzip) algorithm with the –compress argument or by specifying the –
compression-codec argument. We can load a compressed table in
Apache Hive.
7. Connectors for all the major RDBMS Databases: Sqoop provides
connectors for various RDBMS databases, covering almost all of
the entire circumference.
8. Kerberos Security Integration: Basically, Kerberos is the computer
network
authentication protocol which works on the basis of the ‘tickets’ for
allowing nodes that are communicating over the non-secure network to
prove their identity to each other. Apache Sqoop provides support for
Kerberos authentication.
9. Load data directly into HIVE/HBase: Using Sqoop, we can load the data
directly into the Hive for data analysis. We can also dump our data in the
HBase, that is, the NoSQL database.
10. Support for Accumulo: We can instruct Apache Sqoop to
import a table in Accumulo instead of importing them in a
directory in HDFS.