Introduction to Pig Commands
Apache Pig a tool/platform which is used to analyze large datasets and perform long series of data operations. Pig is used with Hadoop. All pig scripts internally get converted into map-reduce tasks and then get executed. It can handle structured, semi-structured and unstructured data. Pig stores, its result into HDFS.
Here are some characteristics of Pig:
- Self-Optimizing: Pig can optimize the execution jobs, the user has the freedom to focus on semantics.
- Ease to Program: Pig provides high-level language/dialect known as Pig Latin, which is easy to write. Pig Latin provides many operators, which programmer can use to process the data. The programmer has the flexibility to write their own functions as well.
- Extensible: Pig facilitates the creation of custom function which is called UDF’s (User defined functions), which make programmers capable of achieving any processing requirement fast & easy.
Pig script runs on a shell known as the grunt.
Why Pig Commands?
Programmers who are not good with Java, usually struggle writing programs in Hadoop i.e. writing map-reduce tasks. For them, Pig Latin which is quite like SQL language is a boon. Its multi-query approach reduces the length of the code.
So overall its concise and effective way of programming. Pig Commands can invoke code in many languages like JRuby, Jython, and Java.
The architecture of Pig Commands
All the scripts written in Pig-Latin over grunt shell go to the parser for checking the syntax and other miscellaneous checks also happens. The output of the parser is a DAG. This DAG then gets passed to Optimizer, which then performs the logical optimization such as projection and pushes down. Then compiler complies the logical plan to MapReduce jobs. Finally, these MapReduce jobs are submitted to Hadoop in sorted order. These jobs get executed and produce desired results.
Pig-Latin data model is fully nested, and it allows complex data types such as map and tuple.
Any single value of Pig Latin language (irrespective of datatype) is known as Atom.
Basic Pig Commands
Let’s take a look at some of the Basic Pig commands which are given below:-
Fs: This will list all the file in the HDFS
grunt> fs –ls
Clear: This will clear the interactive Grunt shell.
This command shows the commands executed so far.
Reading Data: Assuming the data resides in HDFS, and we need to read data to Pig.
grunt> college_students = LOAD ‘hdfs://localhost:9000/pig_data/college_data.txt’
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
PigStorage() is the function that loads and stores data as structured text files.
Storing Data: Store operator is used to storing the processed/loaded data.
grunt> STORE college_students INTO ‘ hdfs://localhost:9000/pig_Output/ ‘ USING PigStorage (‘,’);
Here, “/pig_Output/” is the directory where relation needs to be stored.
Dump Operator: This command is used to display the results on screen. It usually helps in debugging.
grunt> Dump college_students;
Describe Operator: It helps the programmer to view the schema of the relation.
grunt> describe college_students;
Explain: This command helps to review the logical, physical and map-reduce execution plans.
grunt> explain college_students;
Illustrate operator: This gives step-by-step execution of statements in Pig Commands.
grunt> illustrate college_students;
Intermediate Pig Commands
- Group: This Pig command works towards grouping data with the same key.
grunt> group_data = GROUP college_students by first name;
- COGROUP: It works similarly to the group operator. The main difference between Group & Cogroup operator is that group operator usually used with one relation, while cogroup is used with more than one relation.
- Join: This is used to combine two or more relations.
Example: In order to perform self-join, let’s say relation “customer” is loaded from HDFS tp pig commands in two relations customers1 & customers2.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Join could be self-join, Inner-join, Outer-join.
- Cross: This pig command calculates the cross product of two or more relations.
grunt> cross_data = CROSS customers, orders;
- Union: It merges two relations. The condition for merging is that both relation’s columns and domains must be identical.
grunt> student = UNION student1, student2;
Advanced Pig Commands
Let’s take a look at some of the advanced Pig commands which are given below:-
- Filter: This helps in filtering out the tuples out of relation, based on certain conditions.
filter_data = FILTER college_students BY city == ‘Chennai’;
- Distinct: This helps in removal of redundant tuples from the relation.
grunt> distinct_data = DISTINCT college_students;
This filtering will create new relation name “distinct_data”
- Foreach: This helps in generating data transformation based on column data.
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
This will get the id, age, and city values of each student from the relation student_details and hence will store it into another relation named foreach_data.
- Order by: This command displays the result in a sorted order based on one or more field.
grunt> order_by_data = ORDER college_students BY age DESC;
This will sort the relation “college_students” in descending order by age.
- Limit: This command gets limited no. of tuples from the relation.
grunt> limit_data = LIMIT student_details 4;
Tips and Tricks to Use Pig commands
Below are the different tips and tricks of Pig commands:-
Enable Compression on your input and output:
set input.compression.enabled true;
set output.compression.enabled true;
Above mentioned lines of code must be at the beginning of the Script, so that will enable Pig Commands to read compressed files or generate compressed files as output.
Join multiple relations:
For performing the left join on say three relations (input1, input2, input3), one needs to opt for SQL. It’s because outer join is not supported by Pig on more than two tables.
Rather you perform left to join in two steps like:
data1 = JOIN input1 BY key LEFT, input2 BY key;
data2 = JOIN data1 BY input1::key LEFT, input3 BY key;
This means two map-reduce jobs.
To perform the above task more effectively, one can opt for “Cogroup”. Cogroup can join multiple relations. Cogroup by default does outer join.
Conclusion – Pig Commands :
Pig is a procedural language, generally used by data scientist for performing ad-hoc processing and quick prototyping. It’s a great ETL and big data processing tool. Pig scripts can be invoked by other languages and vice versa. Hence Pig Commands can be used to build larger and complex applications.
This has been a guide to Pig commands. Here we have discussed basic as well as advanced Pig commands and some immediate Pig commands. You may also look at the following article to learn more –