Friday, March 25, 2016

Apache FLUME

What is Flume?

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store

Advantages of Flume

Here are the advantages of using Flume −

Using Apache Flume we can store the data in to any of the centralized stores (HBase, HDFS).
When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume acts as a mediator between data producers and the centralized stores and provides a steady flow of data between them.
Flume provides the feature of contextual routing.
The transactions in Flume are channel-based where two transactions (one sender and one receiver) are maintained for each message. It guarantees reliable message delivery.
Flume is reliable, fault tolerant, scalable, manageable, and customizable.

Apache Flume - Architecture

Flume Event

An event is the basic unit of the data transported inside Flume. It contains a payload of byte array that is to be transported from the source to the destination accompanied by optional headers.

A typical Flume event would have the following structure −

Flume Agent
An agent is an independent daemon process (JVM) in Flume. It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent). Flume may have more than one agent. Following diagram represents a Flume Agent

A Flume Agent contains three main components namely, source, channel, & sink.

Source
A source is the component of an agent which receives data from the data generators and transfers it to one or more channels in the form of Flume events.
Example − Avro source, Thrift source, Twitter 1% source etc.
Channel
A channel is a transient store which receives the events from the source and buffers them till they are consumed by sinks. It acts as a bridge between the sources and the sinks.
Example − JDBC channel, File system channel, Memory channel, etc.
Sink
A sink stores the data into centralized stores like HBase and HDFS. It consumes the data (events) from the channels and delivers it to the destination.
Example − HDFS sink
Note :- A flume agent can have multiple sources, sinks and channels.

Additional Components of Flume Agent
A few more components that play a vital role in transferring the events from the data generator to the centralized stores.

Interceptors
Interceptors are used to alter/inspect flume events which are transferred between source and channel.
Channel Selectors
These are used to determine which channel is to be opted to transfer the data in case of multiple channels. There are two types of channel selectors −
Default channel selectors − These are also known as replicating channel selectors they replicates all the events in each channel.
Multiplexing channel selectors − These decides the channel to send an event based on the address in the header of that event.
Sink Processors
These are used to invoke a particular sink from the selected group of sinks. These are used to create fail over paths for your sinks or load balance events across multiple sinks from a channel.

Multi-hop Flow

Within Flume, there can be multiple agents and before reaching the final destination, an event may travel through more than one agent. This is known as multi-hop flow.

Fan-out Flow

The data flow from one source to multiple channels is known as fan-out flow. It is of two types −

Replicating − The data flow where the data will be replicated in all the configured channels.
Multiplexing − The data flow where the data will be sent to a selected channel which is mentioned in the header of the event.

Fan-in Flow

The data flow in which the data will be transferred from many sources to one channel is known as fan-in flow.

Flume example using netcat(source) and logger(sink):

# START example.conf file : A single-node Flume configuration
# Name the components on this AGENT
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# Configure the SOURCE
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Use a CHANNEL which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Configure the SINK
a1.sinks.k1.type = logger
# Bind the SOURCE and SINK to the CHANNEL
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# END example.conf file

########## RUNNING FLUME AGENT ##########
# flume-ng agent --conf conf --conf-file example.conf --name a1
######## RUNNING DATA GENERATOR #########
# $ telnet localhost 44444
# Hello World!

Sunday, March 20, 2016

Sqoop Vs Flume

Apache Sqoop and Apache Flume work with various kinds of data sources.
Apache Flume functions well in streaming data sources which are generated continuously in hadoop environment such as log files from multiple servers whereas
Apache Sqoop is designed to work well with any kind of relational database system that has JDBC connectivity. Sqoop can also import data from NoSQL databases like MongoDB or Cassandra and also allows direct data transfer or Hive or HDFS. For transferring data to Hive using Apache Sqoop tool, a table has to be created for which the schema is taken from the database itself.
In Apache Flume data loading is event driven whereas in
Apache Sqoop data load is not driven by events.
Apache Flume is a better choice when moving bulk streaming data from various sources like JMS or Spooling directory whereas
Apache Sqoop is an ideal fit if the data is sitting in databases like Teradata, Oracle, MySQL Server, Postgres or any other JDBC compatible database then it is best to use Apache Sqoop.
In Apache Flume, data flows to HDFS through multiple channels whereas in
Apache Sqoop HDFS is the destination for importing data.
Apache Flume agents are designed to fetch streaming data like tweets from Twitter or log file from the web server whereas
Apache Sqoop connectors are designed to work only with structured data sources and fetch data from them.
Apache Flume has agent based architecture i.e. the code written in flume is known as agent which is responsible for fetching data whereas in
Apache Sqoop the architecture is based on connectors. The connectors in Sqoop know how to connect with the various data sources and fetch data accordingly.
Apache Sqoop is mainly used for parallel data transfers, for data imports as it copies data quickly whereas
Apache Flume is used for collecting and aggregating data because of its distributed, reliable nature and highly available backup routes.

Thursday, March 17, 2016

APACHE HIVE

What is Hive ?

Hive is SQL for Hadoop cluster.
It is an open source data warehouse system on top of HDFS that adds structure to the data.
It provides SQL like interface which is known as "Hive Query Language (HQL)".
We write the query in HQL which translate into Map-Reduce code and run the same on cluster.

The main components of Hive are:

Metastore: It stores all the metadata of Hive. It stores data of data stored in database, tables, columns, etc..
Driver: It includes compiler, optimizer and executor used to break down the Hive query language statements.
Query compiler: It compiles HiveQL into DAG graph of map reduce tasks.
Execution engine: It executes the tasks produces by compiler.
Thrift server: It provides an interface to connect to other applications like MySQL, Oracle, Excel, etc. through JDBC/ODBC drivers.
Command line interface: It is also called Hive shell. It is used for working with data either interactively or batch data processing.
Web Interface: It is a visual structure on Hive used for interaction with data
SerDe : Serializer, Deserializer gives instructions to hive on how to process a record.

Data Storage in Hive:
Hive has different forms of storage options and they include:

Metastore: Metastore keeps track of all the metadata of database, tables, columns, datatypes etc. in Hive. It also keeps track of HDFS mapping. The default Metastore is DerBy Database.
Tables: There can be 2 types of tables in Hive.
First, normal tables (managed/internal tables) like any other table in database.
Second, external tables (un-managed tables) which are like normal tables except for the deletion part. HDFS mappings are used to create external tables which are pointers to table in HDFS.
The difference between the two types of tables is that when the external table is deleted its data is not deleted. Its data is stored in the HDFS whereas in case of normal table the data also gets deleted on deleting the table.
Partitions: Partition is slicing of tables that are stored in different subdirectory within a table’s directory. It enhances query performance especially in case of select statements with “WHERE” clause.
Buckets: Buckets are hashed partitions and they speed up joins and sampling of data.

Hive vs. RDBMS (Relational database)
Hive and RDBMS are very similar but they have different applications and different schemas that they are based on.

RDBMS are built for OLTP (Online transaction processing) that is real time reads and writes in database. They also perform little part of OLAP. (online analytical processing).
Hive is built for OLAP that is real time reporting of data. Hive does not support inserting into an existing table or updating table data like RDBMS which is an important part of OLTP process.
All data is either inserted in new table or overwritten in existing table.
RDBMS is based on write schema that means when data is entered in the table it is checked against the schema of table to ensure that it meets the requirements. Thus loading data in RDBMS is slower but reading is very fast.
Hive is based on read schema that means data is not checked when it is loaded so data loading is fast but reading is slower.

Hive Query Language (HQL)
HQL is very similar to traditional database. It stores data in tables, where each table consists of columns

Data Definition statements (DDL) like create table, alter table, drop table are supported.
All these DDL statements can be used on Database, tables, partitions, views, functions, Index, etc.
Data Manipulation statements (DML) like load, insert, select and explain are supported.
Load is used for taking data from HDFS and moving it into Hive.
Insert is used for moving data from one Hive table to another.
Select is used for querying data. Explain gives insights into structure of data.

Hive Commands :

Data Definition Language (DDL) :

Example : CREATE, DROP, TRUNCATE, ALTER, SHOW, DESCRIBE Statements.

Go to Hive shell by giving the command sudo hive and
Enter the command ’create database ’ to create the new database in the Hive.

Create Hive database using Hive Commands

To list out the databases in Hive warehouse, enter the command ‘show databases’.

The database creates in a default location of the Hive warehouse.
In Cloudera, Hive database store in a /user/hive/warehouse.

The command to use the database is USE

Copy the input data to HDFS from local by using the copy From Local command.

Data Manipulation Language (DML) : Retrieving Information

Function	MySQL	Hive
Retrieving Information (General)	`SELECT from_columns FROM table WHERE conditions;`	`SELECT from_columns FROM table WHERE conditions;`
Retrieving All Values	`SELECT * FROM table;`	`SELECT * FROM table;`
Retrieving Some Values	`SELECT * FROM table WHERE rec_name = "value";`	`SELECT * FROM table WHERE rec_name = "value";`
Retrieving With Multiple Criteria	`SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";`	`SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";`
Retrieving Specific Columns	`SELECT column_name FROM table;`	`SELECT column_name FROM table;`
Retrieving Unique Output	`SELECT DISTINCT column_name FROM table;`	`SELECT DISTINCT column_name FROM table;`
Sorting	`SELECT col1, col2 FROM table ORDER BY col2;`	`SELECT col1, col2 FROM table ORDER BY col2;`
Sorting Reverse	`SELECT col1, col2 FROM table ORDER BY col2 DESC;`	`SELECT col1, col2 FROM table ORDER BY col2 DESC;`
Counting Rows	`SELECT COUNT(*) FROM table;`	`SELECT COUNT(*) FROM table;`
Grouping With Counting	`SELECT owner, COUNT(*) FROM table GROUP BY owner;`	`SELECT owner, COUNT(*) FROM table GROUP BY owner;`
Maximum Value	`SELECT MAX(col_name) AS label FROM table;`	`SELECT MAX(col_name) AS label FROM table;`
Selecting from multiple tables (Join same table using alias w/”AS”)	`SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name;`	`SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)`

Using Metadata :

Function	MySQL	Hive
Selecting a database	`USE database;`	`USE database;`
Listing databases	`SHOW DATABASES;`	`SHOW DATABASES;`
Listing tables in a database	`SHOW TABLES;`	`SHOW TABLES;`
Describing the format of a table	`DESCRIBE table;`	`DESCRIBE (FORMATTED\|EXTENDED) table;`
Creating a database	`CREATE DATABASE db_name;`	`CREATE DATABASE db_name;`
Dropping a database	`DROP DATABASE db_name;`	`DROP DATABASE db_name (CASCADE);`

Current SQL Compatibility

Hive Command Line :

Function	Hive
Run Query	`hive -e 'select a.col from tab1 a'`
Run Query Silent Mode	`hive -S -e 'select a.col from tab1 a'`
Set Hive Config Variables	`hive -e 'select a.col from tab1 a' -hiveconf hive.root.logger=DEBUG,console`
Use Initialization Script	`hive -i initialize.sql`
Run Non-Interactive Script	`hive -f script.sql`

The .hiverc file :
What is .hiverc file?
It is a file that is executed when you launch the hive shell - making it an ideal place for adding any hive configuration/customization you want set, on start of the hive shell. This could be:
- Setting column headers to be visible in query results
- Making the current database name part of the hive prompt
- Adding any jars or files
- Registering UDFs

.hiverc file location
The file is loaded from the hive conf directory.
If the file does not exist, you can create it.
It needs to be deployed to every node from where you might launch the Hive shell.

Sample .hiverc
add jar /home/airawat/hadoop-lib/hive-contrib-0.10.0-cdh4.2.0.jar;
set hive.exec.mode.local.auto=true;
set hive.cli.print.header=true;
set hive.cli.print.current.db=true;
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=30000000;

Sunday, March 13, 2016

Apache PIG

APACHE PIG

Apache Pig is a tool used to analyze large amounts of data by represeting them as data flows.
Using the PigLatin scripting language operations like ETL (Extract, Transform and Load), adhoc data anlaysis and iterative processing can be easily achieved.
Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to get the task done.

Dataset :

The dataset is a simple text (movies_data.csv) file lists movie names and its details like
release year, rating and runtime.
To download : click here

A sample of the dataset is as follows:
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333

Pig can be started in one of the following two modes:

Local Mode (In local mode, pig can access files on the local file system. )
Cluster Mode (In cluster mode, pig can access files on HDFS.)

Restart your terminal and execute the pig command as follows:
To start in Local Mode:
$ pig -x local
To start in Cluster Mode:
$ pig

This command presents you with a grunt shell. The grunt shell allows you
to execute PigLatin statements to quickly test out data flows on your 
data step by step without having to execute complete scripts.

Pig Latin Program :

To LOAD the data :
grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as id,name,year,rating,duration);
Note: When this statement is executed, no MapReduce task is executed.
grunt> DUMP movies;
- It is only after the DUMP statement that a MapReduce job is initiated.
- The DUMP command is only used to display information onto the standard output.

List the movies that having a rating greater than 4 :
grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;
grunt> DUMP movies_greater_than_four;

To STORE the data to a file :
grunt>store movies_greater_than_four into '/user/hduser/movies_greater_than_four';

To include the data type of the columns :
grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as
(id:int,name:chararray,year:int,rating:double,duration:int);

FILTER command :
grunt> movies_greater_than_four = FILTER movies BY rating>4.0;

List the movies that were released between 1950 and 1960 :
grunt> movies_between_50_60 = FILTER movies by year>1950 and year<1960;

List the movies that start with the Alpahbet A :
grunt> movies_starting_with_A = FILTER movies by name matches 'A.*';

List the movies that have duration greater that 2 hours :
grunt> movies_duration_2_hrs = FILTER movies by duration > 7200;

List the movies that have rating between 3 and 4 :
grunt> movies_rating_3_4 = FILTER movies BY rating>3.0 and rating<4.0;

DESCRIBE Command :

The schema of a relation/alias can be viewed using the DESCRIBE command:

grunt> DESCRIBE movies;
movies: {id: int,name: chararray,year: int,rating: double,duration: int}

ILLUSTRATE Command :

To view the step-by-step execution of a sequence of statements you can use the ILLUSTRATE command:

grunt> ILLUSTRATE movies_duration_2_hrs;

Note: DESCRIBE & ILLUSTRATE are really useful for debugging.

FOREACH : FOREACH gives a simple way to apply transformations based on columns.
List the movie names its duration in minutes :
grunt> movie_duration = FOREACH movies GENERATE name, (double)(duration/60);
The above statement generates a new alias that has the list of movies and it duration in minutes.
You can check the results using the DUMP command.

GROUP : The GROUP keyword is used to group fields in a relation.
List the years and the number of movies released each year.
grunt> grouped_by_year = group movies by year; grunt> count_by_year = FOREACH grouped_by_year GENERATE group, COUNT(movies);

Total number of movies in the dataset is 49590.
To check see if our GROUP operation is correct by verify the total of the COUNT field.

grunt> group_all = GROUP count_by_year ALL;
grunt> sum_all = FOREACH group_all GENERATE SUM(count_by_year.$1);
grunt> DUMP sum_all;

From the above three statements, the first statement, GROUP ALL, groups all the tuples to one group. This is very useful when we need to perform aggregation operations on the entire set.

The next statement, performs a FOREACH on the grouped relation group_all and applies the SUM function to the field in position 1 (positions start from 0).
Here field in position 1, are the counts of movies for each year.
(49590)The above value matches to our know fact that the dataset has 49590 movies.
So we can conclude that our GROUP operation worked successfully.

ORDER BY : Let us question the data to illustrate the ORDER BY operation.
List all the movies in the ascending order of year.
grunt> desc_movies_by_year = ORDER movies BY year ASC;
grunt> DUMP desc_movies_by_year;

List all the movies in the descending order of year :
grunt> asc_movies_by_year = ORDER movies by year DESC;
grunt> DUMP asc_movies_by_year;

DISTINCT : The DISTINCT statement is used to remove duplicated records.

It works only on entire records, not on individual fields.

grunt> movies_with_dups = LOAD 'movies_with_duplicates.csv' USING PigStorage(',') as (id:int,name:chararray,year:int,rating:double,duration:int);
grunt> DUMP movies_with_dups;

You see that there are are duplicates in this data set.

List the distinct records present movies_with_dups :
grunt> no_dups = DISTINCT movies_with_dups;
grunt> DUMP no_dups;

LIMIT : Use the LIMIT keyword to get only a limited number for results from relation.

grunt> top_10_movies = LIMIT movies 10;
grunt> DUMP top_10_movies;

SAMPLE : Use the sample keyword to get sample set from your data.

grunt> sample_10_percent = sample movies 0.1;
grunt> dump sample_10_percent;

Here, 0.1 = 10%

As we already know that the file has 49590 records.
We can check to see the count of records in the relation.

grunt> sample_group_all = GROUP sample_10_percent ALL;
grunt> sample_count = FOREACH sample_group_all GENERATE COUNT(sample_10_percent.$0);
grunt> dump sample_count;
The output is (4937) which is approximately 10% for 49590.

Complex Types :
Pig supports three different complex types to handle data.
Tuples : A tuple is just like a row in a table.
(49539,'The Magic Crystal',2013,3.7,4561)
The above tuple has five fields. A tuple is surrounded by brackets.
Bags : A bag is an unordered collection of tuples.
{ (49382, 'Final Offer'), (49385, 'Delete') }
The above bag is has two tuples. Each tuple has two fields, Id and movie name.
Maps : A map is a store. The key and value are joined together using #.
['name'#'The Magic Crystal', 'year'#2013]

Sunday, March 6, 2016

ELK

LOGSTASH :

An agent which normally runs on each server you wish to harvest logs from.
Its job is to read the logs (e.g. from the filesystem), normalise them (e.g. common timestamp format), optionally extract structured data from them (e.g. session IDs, resource paths, etc.) and finally push them into elasticsearch.

ELASTICSEARCH :

ElasticSearch is a search engine with focus on real-time and analysis of the data it holds.
It is document-oriented/based and you can store everything you want as JSON. This makes it powerful, simple and flexible.
It is build on top of Apache Lucene, and is on default running on port 9200 +1 per node.
PLUGIN :
Note :- Install the following plugin by executing following command for GUI in ES.
.\bin\plugin install mobz/elasticsearch-head
.\bin\plugin install lukas-vlcek/bigdesk
.\bin\plugin install royrusso/elasticsearch-HQ
.\bin\plugin install lmenezes/elasticsearch-kopf
Hit http://localhost:9200/_plugin/head/to see Elastic GUI.

Hit http://localhost:9200/_plugin/bigdesk/ to see Elasticsearch Health.

KIBANA :

A browser-based interface served up from a web server.
It’s job is to allow you to build tabular and graphical visualizations of the log data based on elasticsearch queries. Typically these are based on simple text queries, time-ranges or even far more complex aggregations.
A server would get started and you could see the GUI at http://localhost:5601/

SHIPPERS:

Filebeat is for shipping log files to Logstash.
Packetbeat is for analyzing your network data.
Topbeat is for getting infrastructure information such as cpu and memory usage.
Winlogbeat is for shipping windows event logs.

Service manager:
NSSM: https://nssm.cc/release/nssm-2.24.zip

LOGSTASH CONF FILE:

input {
     file {
         type => "apache-access"
         path => "D:/access.log"
     }
     file {
         type => "apache-error"
         path => "D:/error.log"
     }
}
output {
# Emit events to stdout for easy debugging of what is going through
# logstash.
stdout { }

# This elasticsearch output will try to autodiscover a near-by
# elasticsearch cluster using multicast discovery.
# If multicast doesn't work, you'll need to set a 'host' setting.
elasticsearch { }
}

SAMPLE ERROR LOG FILE :

[Fri Dec 16 01:46:23 2005] [error] [client 1.2.3.4] Directory index forbidden by rule: /home/test/
[Fri Dec 16 01:54:34 2005] [error] [client 1.2.3.4] Directory index forbidden by rule: /apache/web-data/test2
[Fri Dec 16 02:25:55 2005] [error] [client 1.2.3.4] Client sent malformed Host header
[Mon Dec 19 23:02:01 2005] [error] [client 1.2.3.4] user test: authentication failure for "/~dcid/test1": Password Mismatch
[Sat Aug 12 04:05:51 2006] [notice] Apache/1.3.11 (Unix) mod_perl/1.21 configured -- resuming normal operations
[Thu Jun 22 14:20:55 2006] [notice] Digest: generating secret for digest authentication ...
[Thu Jun 22 14:20:55 2006] [notice] Digest: done
[Thu Jun 22 14:20:55 2006] [notice] Apache/2.0.46 (Red Hat) DAV/2 configured -- resuming normal operations
[Sat Aug 12 04:05:49 2006] [notice] SIGHUP received. Attempting to restart
[Sat Aug 12 04:05:51 2006] [notice] suEXEC mechanism enabled (wrapper: /usr/local/apache/sbin/suexec)
[Sat Jun 24 09:06:22 2006] [warn] pid file /opt/CA/BrightStorARCserve/httpd/logs/httpd.pid overwritten -- Unclean shutdown of previous Apache run?
[Sat Jun 24 09:06:23 2006] [notice] Apache/2.0.46 (Red Hat) DAV/2 configured -- resuming normal operations
[Sat Jun 24 09:06:22 2006] [notice] Digest: generating secret for digest authentication ...
[Sat Jun 24 09:06:22 2006] [notice] Digest: done
[Thu Jun 22 11:35:48 2006] [notice] caught SIGTERM, shutting down
[Tue Mar 08 10:34:21 2005] [error] (11)Resource temporarily unavailable: fork: Unable to fork new process
[Tue Mar 08 10:34:31 2005] [error] (11)Resource temporarily unavailable: fork: Unable to fork new process

SAMPLE ACCESS LOG FILE :

192.168.2.20 - - [28/Jul/2006:10:27:10 -0300] "GET /cgi-bin/try/ HTTP/1.0" 200 3395
127.0.0.1 - - [28/Jul/2006:10:22:04 -0300] "GET / HTTP/1.0" 200 2216
127.0.0.1 - - [28/Jul/2006:10:27:32 -0300] "GET /hidden/ HTTP/1.0" 404 7218
x.x.x.90 - - [13/Sep/2006:07:01:53 -0700] "PROPFIND /svn/[xxxx]/Extranet/branches/SOW-101 HTTP/1.1" 401 587
x.x.x.90 - - [13/Sep/2006:07:01:51 -0700] "PROPFIND /svn/[xxxx]/[xxxx]/trunk HTTP/1.1" 401 587
x.x.x.90 - - [13/Sep/2006:07:00:53 -0700] "PROPFIND /svn/[xxxx]/[xxxx]/2.5 HTTP/1.1" 401 587
x.x.x.90 - - [13/Sep/2006:07:00:53 -0700] "PROPFIND /svn/[xxxx]/Extranet/branches/SOW-101 HTTP/1.1" 401 587
x.x.x.90 - - [13/Sep/2006:07:00:21 -0700] "PROPFIND /svn/[xxxx]/[xxxx]/trunk HTTP/1.1" 401 587
x.x.x.90 - - [13/Sep/2006:06:59:53 -0700] "PROPFIND /svn/[xxxx]/[xxxx]/2.5 HTTP/1.1" 401 587
x.x.x.90 - - [13/Sep/2006:06:59:50 -0700] "PROPFIND /svn/[xxxx]/[xxxx]/trunk HTTP/1.1" 401 587
x.x.x.90 - - [13/Sep/2006:06:58:52 -0700] "PROPFIND /svn/[xxxx]/[xxxx]/trunk HTTP/1.1" 401 587
x.x.x.90 - - [13/Sep/2006:06:58:52 -0700] "PROPFIND /svn/[xxxx]/Extranet/branches/SOW-101 HTTP/1.1" 401 587

START ELASTICSEARCH:

.\bin\elasticsearch.bat
Checking Elasticsearch => http://localhost:9200/ =>> http://localhost:9200/_plugin/head/

START LOGSTASH AGENT:

.\bin\logstash agent -f logstash.conf

START KIBANA:

.\bin\kibana.bat

Checking Kibana=> http://localhost:5601

NOTE: Copy & Paste log files (access & error log) in (D:\) directory.

Wednesday, February 3, 2016

Hadoop / HDFS Commands

HDFS Command Syntax Overview: 
hadoop fs

: Ex.: hadoop fs -ls
hadoop version : check hadoop installed properly

HELP:
help [cmd]: hopefully this is self-describing

Inspect files:
-ls/lsr : list all files in (hadoop fs -ls /)
-cat : print  on stdout
-tail [-f] : output the last part of the

-test : return attributes of file and directory

-touchz : create new emty file size 0
-du/dus : show  space utilization

-count : no. of directories, files, and bytes
-setrep : (-r) change the replication factor of file/directory
-stat : info about the specified path

Create/remove files:
-mkdir : create a directory
-mv  : move (rename) files
-cp  : copy files
-rm/rmr : remove files

Copy/Put files from remote m/c into the HADOOP cluster:
-copyFromLocal  : copy a local file to the HDFS
-copyToLocal  : copy a file on the HDFS to the local disk

-cp : copies one or more files
-get : copies files to the local file system
-put : copies files from the local file system
-mv : moves one or more files

Hadoop Namenode Commands:
hadoop namenode -format: Format HDFS filesystem from Namenode
hadoop namenode -upgrade: Upgrade the NameNode
start-dfs.sh Start: HDFS Daemons
stop-dfs.sh Stop: HDFS Daemons
start-mapred.sh: Start: MapReduce Daemons
stop-mapred.sh Stop: MapReduce Daemons
hadoop namenode -recover -force: Recover namenode metadata after a cluster failure (may lose data)

Hadoop Configuration Files:
core-site.xml : Parameters for entire Hadoop cluster
hdfs-site.xml : Parameters for HDFS and its clients
mapred-site.xml : Parameters for MapReduce and its clients

yarn-site.xml : Parameters for nodemanager and resource manager

masters : Host machines for secondary Namenode
slaves : List of slave hosts

hadoop-env.sh : Sets ENV variables for Hadoop
set JAVA_HOME=%JAVA_HOME% set HADOOP_PREFIX=D:\Hadoop

Hadoop Job Commands
hadoop job -submit  : Submit the job
hadoop job -status  : Print job status completion percentage
hadoop job -list all : List all jobs
hadoop job -list-active-trackers : List all available TaskTrackers
hadoop job -set-priority   : Set priority for a job. Valid priorities : VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW
hadoop job -kill-task  : Kill a task
hadoop job -history : Display job history including job details, failed and killed jobs

Hadoop mradmin Commands
hadoop mradmin -safemode get : Check Job tracker status
hadoop mradmin -refreshQueues : Reload mapreduce configuration
hadoop mradmin -refreshNodes : Reload active TaskTrackers
hadoop mradmin -refreshServiceAcl : Force Jobtracker to reload service ACL
hadoop mradmin -refreshUserToGroupsMappings : Force jobtracker to reload user group mappings

Hadoop fsck Commands
hadoop fsck / : Filesystem check on HDFS
hadoop fsck / -files : Display files during check
hadoop fsck / -files -blocks : Display files and blocks during check
hadoop fsck / -files -blocks -locations : Display files, blocks and its location hadoop fsck / -files -blocks -locations -racks : Display network topology for data-node locations
hadoop fsck -delete : Delete corrupted files
hadoop fsck -move : Move corrupted files to /lost+found directory

Hadoop Balancer Commands
start-balancer.sh : Balance the cluster
hadoop dfsadmin -setBalancerBandwidth  : Adjust bandwidth used by the balancer
hadoop balancer -threshold 20 : Limit balancing to only 20% resources in the cluster

Hadoop Safe Mode (Maintenance Mode) Commands
The following dfsadmin commands helps the cluster to enter or leave safe mode, which is also called as maintenance mode. 
In this mode, Namenode does not accept any changes to the name space, it does not replicate or delete blocks.
hadoop dfsadmin -safemode enter : Enter safe mode
hadoop dfsadmin -safemode leave : Leave safe mode
hadoop dfsadmin -safemode get : Get the status of mode
hadoop dfsadmin -safemode wait : Wait until HDFS finishes data block replication
hadoop dfsadmin -report : total usage on the cluster

Launching Hadoop Jobs:
hadoop jar  [mainClass] args... :

Launch job via jar file
hadoop jar com.twitter.scalding.Tool [mainClass] args : A Scalding job is launched using
mapred job -kill : If you need to kill a map-reduce job

Commonly Used Administration Commands:
Format the namenode: hadoop namenode -format
Starting Secondary namenode: hadoop secondrynamenode
Run namenode : hadoop namenode
Run data node: hadoop datanode
Cluster Balancing: hadoop balancer
Run MapReduce job tracker node: hadoop jobtracker
Run MapReduce task tracker node: hadoop tasktracker

Start/Stop Yarn (starts resourcemanager and nodemanager)and DFS (Starts namenode and data node) from sbin directory:
start-yarn, stop-yarn
start-dfs, stop-dfs

Start and Stop ALL daemon from sbin directory:
start-all, stop-all

Check All 5 daemons (Namenode,Secoundary Node,Job Tracker, DataNode, Task Tracker ) are up using:
jps

Hadoop2x--Eclipse-plugin :
Download => https://github.com/winghc/hadoop2x-eclipse-plugin/tree/master/release

Monday, February 1, 2016

UBUNTU

Installing Java 8 on Ubuntu
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
Verify Installed Java Version :
$ java -version
Configuring Java Environment
Install this package using following command.
$ sudo apt-get install oracle-java8-set-default

Install "Guest Additions" on Oracle VirtualBox
- Select from the top menu: VirtualBox -> Devices -> Insert Guest Additions CD image
- In Ubuntu open a terminal, navigate to cd folder (usually /media/VBOXADDITIONS*) and run
$ sudo sh ./VBoxLinuxAdditions.run
$ sudo usermod -a -G vboxsf darren
(and replace ‘darren’ with your username)
* NOTE: the change will only be visible once you logout and login again!
https://www.virtualbox.org/manual/ch04.html

Expanding virtual harddisk to 20GB
VBoxManage.exe modifyhd "C:\Users\jini\VirtualBox VMs\Ubuntu14\Ubuntu14.vdi" --resize 20480

Set Environment Variable @ home directory
gedit .bashrc (at home directory) or gedit /etc/profile
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_PREFIX=/usr/local/hadoop
export PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar

SSH Configuration
$ sudo apt-get install openssh-server
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ Verify => ssh localhost /ssh -v localhost /sudo restart ssh

Open CL Terminal =>> Use Ctrl+Alt+t
Clear CL Screen => $ clear
View Env variable => $ echo $JAVA_HOME
Remove directory recursively as administrator => $ sudo rm -R [Directory name]
List contents => $ ls -al
Copy file to current directory => $ sudo cp /media/sf_mywork/jdk-8u71-linux-i586.tar.gz .
Untar commpressed file => $ sudo tar -zxf jdk-8u71-linux-i586.tar.gz
Rename file => $ sudo mv jdk1.8.0_71 java
Change permission recursively => $ sudo chmod -R 777 /usr/local/hadoop

BIPIN GUPTA