Introduction to Sqoop Import
Sqoop is a combination of SQL and Hadoop. Sqoop is basically a bridge between the two entities i.e. RDBMS and HDFS. It is a tool for the bulk data import from structured data sources such as RDBMS, Data Warehouses to HDFS. It can import all tables, a single table, or a portion of a table into HDFS. Sqoop can also export data from HDFS back to the database and the process is called it as Exporting. It is created by Cloudera and then open-sourced. In this topic, we are going to learn about Sqoop Import.
It is a single client program that interacts with the Hadoop file system to create one or more map-reduce programs. It uses the primary key column to divide source data across its mappers. By default, Sqoop will spawn 4 mappers. Sqoop works with anything that is JDBC compliant. And for non-compliant databases, there are Sqoop connectors. The generic JDBC connector is the most basic non-compliant connector.
Syntax
To get into the Sqoop import command we need a user valid login id (username) and password and here the basic Sqoop import command to transfer the data from one place to another.
Syntax:
$ sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db\
--username cloudera \
--password secretkey \
--table department \
--target-dir /sqoopdata/departments\
--where “department_id = 1000” \
--m 1
Here, you have to provide the database name along with the machine name and port no and again with the database name. After that provide the valid username and password which is confidential. Providing the password in the command line may be a security issue so you can store the password in the file also and access a file using the password parameter.
Next is the table parameter from where you have to import the data or sqoop the data here we are using the department table for sqoop purpose. Another parameter is the target directory, by a default table name and the same name is created and you can specify on your own.
“Where” parameter is used to import the only a subset of data.
“m” it is used to specify the map task. By default, Sqoop uses 4 mappers but with the help of “m” we can control the parallelism by specifying the mapper value along with the “m”.
Import command in Sqoop
Here we have discussed the commands in sqoop.
- Import is nothing but the insertion of any kind of data from source to destination.
- Sqoop uses the import command to pass the data from RDBMS to Hadoop components like HDFS, Pig, Hive, etc.
- Sqoop mainly focuses on the data which is secure and can able to transfer the data anywhere.
- When the client submits Sqoop command it works internally and the very first step Sqoop goes ahead with is, it fetches the metadata where it means information about data or you can say data about the data is called metadata.
- And then it submits or launches maponly job and it’s a no reduction phase, why it is no reduction? Because we are only loading and fetching the data so there is no aggregation. So in short here no aggregation kind of thing happening here.
- It creates by default 4 map tasks and it distributes the data to the mapper equally for the high performance and it uses the JDBC connection and then it transfers data to the HDFS.
Import is done in two stages
- The first step is to introspect the database to gather the metadata (Primary Key Information)
- In the second stage, Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the HDFS.
Example
$ sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db\
--username cloudera \
--password secretkey \
--table department \
--target-dir /sqoopdata/departments\
--where “department_id = 1000” \
--m 1
Here, you have to provide the database name along with the machine name and port no and again with the database name. After that provide the valid username and password which is confidential. Providing the password in the command line may be a security issue so you can store the password in the file also and access a file using the password parameter.
4.7 (3,220 ratings)
View Course
Next is the table parameter from where you have to import the data or sqoop the data here we are using the department table for sqoop purpose. Another parameter is the target directory, by a default table name and the same name is created and you can specify on your own.
“Where” parameter is used to import the only a subset of data.
“m” it is used to specify the map task. By default, Sqoop uses 4 mappers but with the help of “m” we can control the parallelism by specifying the mapper value along with the “m”.
If you want to specify only the columns from a particular table i.e. for example department table-use below syntax:- –columns “dept_id,name”. If you want to import only matching rows from departments table so simply apply the condition along with the “where” parameter i.e. –where “dept_id>= 1000”.Import all tables from the database use the syntax:-
Syntax:
$sqoop import-all-tables \
--connect jdbc:mysql://localhost:3306/retail_db \
--username cloudera –password secret \
Warehouse-dir /mydata
While importing all tables make sure that all tables should have one primary key.
Advantages
It allows the transfer of data with a variety of structured data stores like Teradata, Petadata, etc.
- Sqoop can execute the data transfer in parallel.
- Since the data is transferred and get in the Hadoop cluster, sqoop allows offloading certain processing done in the ETL process into fast, low cost and effective one.
- It handles the bundle of data.
Conclusion
Sqoop is more like a transport kind of thing with high security and within the budget and we can use it efficiently and effectively everywhere. And as it is fast in-process every one wants this technology to be processed at their own sites to get better results.
Recommended Articles
This is a guide to Sqoop Import. Here we discuss the import command in Sqoop along with the advantages, syntax, and examples. You may also have a look at the following articles to learn more –