EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login

Sqoop Import

By Priya PedamkarPriya Pedamkar

Home » Data Science » Data Science Tutorials » Data Analytics Basics » Sqoop Import

Sqoop Import

Introduction to Sqoop Import

Sqoop is a combination of SQL and Hadoop. Sqoop is basically a bridge between the two entities i.e. RDBMS and HDFS. It is a tool for the bulk data import from structured data sources such as RDBMS, Data Warehouses to HDFS. It can import all tables, a single table, or a portion of a table into HDFS. Sqoop can also export data from HDFS back to the database and the process is called it as Exporting. It is created by Cloudera and then open-sourced. In this topic, we are going to learn about Sqoop Import.

It is a single client program that interacts with the Hadoop file system to create one or more map-reduce programs. It uses the primary key column to divide source data across its mappers. By default, Sqoop will spawn 4 mappers. Sqoop works with anything that is JDBC compliant. And for non-compliant databases, there are Sqoop connectors. The generic JDBC connector is the most basic non-compliant connector.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Syntax

To get into the Sqoop import command we need a user valid login id (username) and password and here the basic Sqoop import command to transfer the data from one place to another.

Syntax:

$ sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db\
--username cloudera \
--password secretkey \
--table department \
--target-dir /sqoopdata/departments\
--where “department_id = 1000” \
--m 1

Here, you have to provide the database name along with the machine name and port no and again with the database name. After that provide the valid username and password which is confidential. Providing the password in the command line may be a security issue so you can store the password in the file also and access a file using the password parameter.

Next is the table parameter from where you have to import the data or sqoop the data here we are using the department table for sqoop purpose. Another parameter is the target directory, by a default table name and the same name is created and you can specify on your own.

“Where” parameter is used to import the only a subset of data.

“m” it is used to specify the map task. By default, Sqoop uses 4 mappers but with the help of “m” we can control the parallelism by specifying the mapper value along with the “m”.

Import command in Sqoop

Here we have discussed the commands in sqoop.

  1. Import is nothing but the insertion of any kind of data from source to destination.
  2. Sqoop uses the import command to pass the data from RDBMS to Hadoop components like HDFS, Pig, Hive, etc.
  3. Sqoop mainly focuses on the data which is secure and can able to transfer the data anywhere.
  4. When the client submits Sqoop command it works internally and the very first step Sqoop goes ahead with is, it fetches the metadata where it means information about data or you can say data about the data is called metadata.
  5. And then it submits or launches maponly job and it’s a no reduction phase, why it is no reduction? Because we are only loading and fetching the data so there is no aggregation. So in short here no aggregation kind of thing happening here.
  6. It creates by default 4 map tasks and it distributes the data to the mapper equally for the high performance and it uses the JDBC connection and then it transfers data to the HDFS.

Import is done in two stages

  • The first step is to introspect the database to gather the metadata (Primary Key Information)
  • In the second stage, Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the HDFS.

Example

$ sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db\
--username cloudera \
--password secretkey \
--table department \
--target-dir /sqoopdata/departments\
--where “department_id = 1000” \
--m 1

Here, you have to provide the database name along with the machine name and port no and again with the database name. After that provide the valid username and password which is confidential. Providing the password in the command line may be a security issue so you can store the password in the file also and access a file using the password parameter.

Popular Course in this category
All in One Data Science Bundle (360+ Courses, 50+ projects)360+ Online Courses | 1500+ Hours | Verifiable Certificates | Lifetime Access
4.7 (3,220 ratings)
Course Price

View Course

Related Courses
Data Scientist Training (76 Courses, 60+ Projects)Machine Learning Training (17 Courses, 27+ Projects)Cloud Computing Training (18 Courses, 5+ Projects)

Next is the table parameter from where you have to import the data or sqoop the data here we are using the department table for sqoop purpose. Another parameter is the target directory, by a default table name and the same name is created and you can specify on your own.

“Where” parameter is used to import the only a subset of data.

“m” it is used to specify the map task. By default, Sqoop uses 4 mappers but with the help of “m” we can control the parallelism by specifying the mapper value along with the “m”.

If you want to specify only the columns from a particular table i.e. for example department table-use below syntax:- –columns “dept_id,name”. If you want to import only matching rows from departments table so simply apply the condition along with the “where” parameter i.e. –where “dept_id>= 1000”.Import all tables from the database use the syntax:-

Syntax:

$sqoop import-all-tables \
--connect jdbc:mysql://localhost:3306/retail_db \
--username cloudera –password secret \
Warehouse-dir /mydata

While importing all tables make sure that all tables should have one primary key.

Advantages

It allows the transfer of data with a variety of structured data stores like Teradata, Petadata, etc.

  • Sqoop can execute the data transfer in parallel.
  • Since the data is transferred and get in the Hadoop cluster, sqoop allows offloading certain processing done in the ETL process into fast, low cost and effective one.
  • It handles the bundle of data.

Conclusion

Sqoop is more like a transport kind of thing with high security and within the budget and we can use it efficiently and effectively everywhere. And as it is fast in-process every one wants this technology to be processed at their own sites to get better results.

Recommended Articles

This is a guide to Sqoop Import. Here we discuss the import command in Sqoop along with the advantages, syntax, and examples. You may also have a look at the following articles to learn more –

  1. Sqoop Commands
  2. Sqoop Interview Questions
  3. Sqoop
  4. Big Data Analytics

All in One Data Science Bundle (360+ Courses, 50+ projects)

360+ Online Courses

1500+ Hours

Verifiable Certificates

Lifetime Access

Learn More

0 Shares
Share
Tweet
Share
Primary Sidebar
Data Analytics Basics
  • Basics
    • What is Natural Language Processing
    • What Is Apache
    • What is Business Intelligence
    • Predictive Modeling
    • What is NoSQL Database
    • Types of NoSQL Databases
    • What is Cluster Computing
    • Uses of Salesforce
    • The Beginners Guide to Startup Analytics
    • Analytics Software is Hiding From You
    • Real Time Analytics
    • Lean Analytics
    • Important Elements of Mudbox Software
    • Business Intelligence Tools (Benefits)
    • Mechatronics Projects
    • Know about A Business Analyst
    • Flexbox Essentials For Beginners
    • Predictive Analytics Tool
    • Data Modeling Tools (Free)
    • Modern Data Integration
    • Crowd Sourcing Data
    • Build a Data Supply Chain
    • What is Minitab
    • Sqoop Commands
    • Pig Commands
    • What is Apache Flink
    • What is Predictive Analytics
    • What is Business Analytics
    • What is Pig
    • What is Fuzzy Logic
    • What is Apache Tomcat
    • Talend Data Integration
    • Talend Open Studio
    • How MapReduce Works
    • Types of Data Model
    • Test Data Generation
    • Apache Flume
    • NoSQL Data Models
    • Advantages of NoSQL
    • What is Juypter Notebook
    • What is CentOS
    • What is MuleSoft
    • MapReduce Algorithms
    • What is Dropbox
    • Pandas.Dropna()
    • Salesforce IoT Cloud
    • Talend Tools
    • Data Integration Tool
    • Career in Business Analytics
    • Marketing Analytics For Dummies
    • Risk Analytics Helps in Risk management
    • Salesforce Certification
    • Tips to Become Certified Salesforce Admin
    • Customer Analytics Techniques
    • What is Data Engineering?
    • Business Analysis Tools
    • Business Analytics Techniques
    • Smart City Application
    • COBOL Data Types
    • Business Intelligence Dashboard
    • What is MDM?
    • What is Logstash?
    • CAP Theorem
    • Pig Architecture
    • Pig Data Types
    • KMP Algorithm
    • What is Metadata?
    • Data Modelling Tools
    • Sqoop Import
    • Apache Solr
    • What is Impala?
    • Impala Database
    • What is Digital Image?
    • What is Kibana?
    • Kibana Visualization
    • Kibana Logstash
    • Kibana_query
    • Kibana Reporting
    • Kibana Alert
    • Longitudinal Data Analysis
    • Metadata Management Tools
    • Time Series Analysis
    • Types of Arduino
    • Arduino Shields
    • What is Arduino UNO?
    • Arduino Sensors
    • Arduino Boards
    • Arduino Application
    • 8085 Architecture
    • Dynatrace Competitors
    • Data Migration Tools
    • Likert Scale Data Analysis
    • Predictive Analytics Techniques
    • Data Governance
    • What is RTK
    • Data Virtualization
    • Knowledge Engineering
    • Data Dictionaries
    • Types of Dimensions
    • What is Google Chrome?
    • Embedded Systems Architecture
    • Data Collection Tools
    • Panel Data Analysis
    • Sqoop Export
    • What is Metabase?

Related Courses

Data Science Certification

Online Machine Learning Training

Cloud Computing Certification

Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

© 2020 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA Login

Forgot Password?

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you
Book Your One Instructor : One Learner Free Class

Let’s Get Started

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

Special Offer - All in One Data Science Bundle (360+ Courses, 50+ projects) Learn More