APACHE PIG
- Apache Pig is a tool used to analyze large amounts of data by represeting them as data flows.
- Using the PigLatin scripting language operations like ETL (Extract, Transform and Load), adhoc data anlaysis and iterative processing can be easily achieved.
- Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to get the task done.
Dataset :
The dataset is a simple text (movies_data.csv) file lists movie names and its details likerelease year, rating and runtime.
To download : click here
A sample of the dataset is as follows:
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333
Pig can be started in one of the following two modes:- Local Mode (In local mode, pig can access files on the local file system. )
- Cluster Mode (In cluster mode, pig can access files on HDFS.)
To start in Local Mode:
$ pig -x local
To start in Cluster Mode:
$ pig
This command presents you with a grunt shell. The grunt shell allows you
to execute PigLatin statements to quickly test out data flows on your
data step by step without having to execute complete scripts.
Pig Latin Program :
To LOAD the data :
grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as id,name,year,rating,duration);
Note: When this statement is executed, no MapReduce task is executed.
grunt> DUMP movies;
- It is only after the DUMP statement that a MapReduce job is initiated.
- The DUMP command is only used to display information onto the standard output.
List the movies that having a rating greater than 4 :
grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;
grunt> DUMP movies_greater_than_four;
To STORE the data to a file :
grunt>store movies_greater_than_four into '/user/hduser/movies_greater_than_four';
To include the data type of the columns :
grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as
(id:int,name:chararray,year:int,rating:double,duration:int);
FILTER command :
grunt> movies_greater_than_four = FILTER movies BY rating>4.0;
List the movies that were released between 1950 and 1960 :
grunt> movies_between_50_60 = FILTER movies by year>1950 and year<1960;
List the movies that start with the Alpahbet A :
grunt> movies_starting_with_A = FILTER movies by name matches 'A.*';
List the movies that have duration greater that 2 hours :
grunt> movies_duration_2_hrs = FILTER movies by duration > 7200;
List the movies that have rating between 3 and 4 :
grunt> movies_rating_3_4 = FILTER movies BY rating>3.0 and rating<4.0;
DESCRIBE Command :
The schema of a relation/alias can be viewed using the DESCRIBE command:
grunt> DESCRIBE movies; movies: {id: int,name: chararray,year: int,rating: double,duration: int}
ILLUSTRATE Command :
To view the step-by-step execution of a sequence of statements you can use the ILLUSTRATE command:
grunt> ILLUSTRATE movies_duration_2_hrs;Note: DESCRIBE & ILLUSTRATE are really useful for debugging.
FOREACH : FOREACH gives a simple way to apply transformations based on columns.
List the movie names its duration in minutes :
grunt> movie_duration = FOREACH movies GENERATE name, (double)(duration/60);
The above statement generates a new alias that has the list of movies and it duration in minutes.
You can check the results using the DUMP command.
GROUP : The GROUP keyword is used to group fields in a relation.
List the years and the number of movies released each year.
grunt> grouped_by_year = group movies by year; grunt> count_by_year = FOREACH grouped_by_year GENERATE group, COUNT(movies);
Total number of movies in the dataset is 49590.
To check see if our GROUP operation is correct by verify the total of the COUNT field.
grunt> group_all = GROUP count_by_year ALL;
grunt> sum_all = FOREACH group_all GENERATE SUM(count_by_year.$1);
grunt> DUMP sum_all;
From the above three statements, the first statement, GROUP ALL, groups all the tuples to one group. This is very useful when we need to perform aggregation operations on the entire set.
The next statement, performs a FOREACH on the grouped relation group_all and applies the SUM function to the field in position 1 (positions start from 0).
Here field in position 1, are the counts of movies for each year.
(49590)The above value matches to our know fact that the dataset has 49590 movies.
So we can conclude that our GROUP operation worked successfully.
ORDER BY : Let us question the data to illustrate the ORDER BY operation.
List all the movies in the ascending order of year.
grunt> desc_movies_by_year = ORDER movies BY year ASC;
grunt> DUMP desc_movies_by_year;
List all the movies in the descending order of year :
grunt> asc_movies_by_year = ORDER movies by year DESC;
grunt> DUMP asc_movies_by_year;
DISTINCT : The DISTINCT statement is used to remove duplicated records.
It works only on entire records, not on individual fields.
grunt> movies_with_dups = LOAD 'movies_with_duplicates.csv' USING PigStorage(',') as (id:int,name:chararray,year:int,rating:double,duration:int); grunt> DUMP movies_with_dups;
You see that there are are duplicates in this data set.
grunt> no_dups = DISTINCT movies_with_dups;
grunt> DUMP no_dups;
LIMIT : Use the LIMIT keyword to get only a limited number for results from relation.
grunt> top_10_movies = LIMIT movies 10;
grunt> DUMP top_10_movies;
SAMPLE : Use the sample keyword to get sample set from your data.
grunt> sample_10_percent = sample movies 0.1;
grunt> dump sample_10_percent;
Here, 0.1 = 10%
As we already know that the file has 49590 records.
We can check to see the count of records in the relation.
grunt> sample_group_all = GROUP sample_10_percent ALL;
grunt> sample_count = FOREACH sample_group_all GENERATE COUNT(sample_10_percent.$0);
grunt> dump sample_count;
The output is (4937) which is approximately 10% for 49590.
Complex Types :
Pig supports three different complex types to handle data.
Tuples : A tuple is just like a row in a table.
(49539,'The Magic Crystal',2013,3.7,4561)
The above tuple has five fields. A tuple is surrounded by brackets.
Bags : A bag is an unordered collection of tuples.
{ (49382, 'Final Offer'), (49385, 'Delete') }
The above bag is has two tuples. Each tuple has two fields, Id and movie name.
Maps : A map is a store. The key and value are joined together using #.
['name'#'The Magic Crystal', 'year'#2013]