#onenote# hadoop


1. Copy from Local
the copyFromLocal command will upload directories recursively by default
hdfs dfs -copyFromLocal <src> < target>
2. Run job in testlab
$hadoop jar test-1.0-SNAPSHOT.jar com.forest.sandbox.bigdata.WordCount hdfs://wdctestlab0040-fe1.systems.uk.aaaa:8020/tmp/forest/README.md hdfs://wdctestlab0040-fe1.systems.uk.aaaa:8020/tmp/forest/README.md.wc


hadoop jar ch02-mr-intro-4.0.jar MaxTemperature /tmp/forest/input/defguide/ncdc/sample.txt /tmp/forest/output/defguide/ncdc/out_no_combiner

2. Copy file
hdfs dfs -cp /tmp/forest/MapNoReduce/_SUCCESS /tmp/forest/input/

hadoop distcp file1 file2

hadoop distcp dir1 dir2
If dir2 does not exist, it will be created, and the contents of the dir1 directory will be copied there.

hadoop distcp -update dir1 dir2
You can also update only the files that have changed using the -update option

distcp is implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster. There are no reducers. Each file is copied by a single map

A very common use case for distcp is for transferring data between two HDFS clusters. For example, the following
creates a backup of the first cluster’s /foo directory on the second:

hadoop distcp -update -delete -p hdfs://namenode1/foo hdfs://namenode2/foo

if the two clusters are running incompatible versions of HDFS, then you can use the webhdfs protocol
to distcp between them:

hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/foo

3. Displaying a SequenceFile with the command-line interface
% hadoop fs -text numbers.seq | head
100 One, two, buckle my shoe
99 Three, four, shut the door

4. Another run hadoop job approach
export HADOOP_CLASSPATH=target/classes/
% hadoop ConfigurationPrinter
5. Remove a dir
hdfs dfs -rm -r -f /user/ss-nano-uat/widgets
6. hadoop fs -ls /user/ss-nano-uat
= hdfs dfs -ls /user/ss-nano-uat


How Hadoop runs a MapReduce job using the classic framework

For very large clusters in the region of 4000 nodes and higher, the MapReduce system begins to hit scalability bottlenecks


YARN separates these two roles into two independent daemons: a resource manager to manage the use of resources across the cluster, and an application master to manage  the lifecycle of applications running on the cluster. The idea is that an application master negotiates with the resource manager for cluster resources—described in terms of a number of containers each with a certain memory limit


Furthermore, it is even possible for users to run different versions of MapReduce on the same YARN cluster


How Hadoop runs a MapReduce job using YARN


Resource Requests

YARN has a flexible model for making resource requests. A request for a set of containers can express the amount of

computer resources required for each container (memory and CPU), as well as locality constraints for the containers in that request.


Application Lifespan

The lifespan of a YARN application can vary dramatically: from a short-lived application of a few seconds to a long-

running application that runs for days or even months. Rather than look at how long the application runs for, it’s useful to categorize applications in terms of how they map to the jobs that users run. The simplest case is one application per user job, which is the approach that MapReduce takes.

The second model is to run one application per workflow or user session of (possibly unrelated) jobs. This

approach can be more efficient than the first, since containers can be reused between jobs, and there is also the potential to cache intermediate data between jobs. Spark is an example that uses this model.

The third model is a long-running application that is shared by different users. Such an application often

acts in some kind of coordination role. For example, Apache Sliderhas a long-running application master for launching other applications on the cluster. This approach is also used by Impala (see SQL-on-Hadoop Alternatives) to provide a proxy application that the Impala daemons communicate with to request cluster resources. The “always on” application master means that users have very low-latency responses to their queries since the overhead of starting a new application master is avoided.[37]


Scheduler Options

Three schedulers are available in YARN: the FIFO, Capacity, and Fair Schedulers



Apache Oozie is a system for running workflows of dependent jobs. It is composed of two main parts: a workflow engine that stores and runs workflows composed of different types of Hadoop jobs (MapReduce, Pig, Hive, and so on), and a coordinator engine that runs workflow jobs based on predefined schedules and data availability. Oozie has been designed to scale, and it can manage the timely execution of thousands of workflows in a Hadoop cluster, each composed of possibly dozens of constituent jobs.

Oozie makes rerunning failed workflows more tractable, since no time is wasted running successful parts of a workflow. Anyone who has managed a complex batch system knows how difficult it can be to catch up from jobs missed due to downtime or failure, and will appreciate this feature


1 简介

AvroHadoop中的一个子项目,也是Apache中一个独立的项目,Avro是一个基于二进制数据传输高性能的中间件。在Hadoop的其他项目中例如HBase(Ref)Hive(Ref)Client端与服务端的数据传输也采用了这个工具。Avro是一个数据序列化的系统。Avro 可以将数据结构或对象转化成便于存储或传输的格式。Avro设计之初就用来支持数据密集型应用,适合于远程或本地大规模数据的存储和交换。

2 特点

Ø  丰富的数据结构类型;

Ø  快速可压缩的二进制数据形式,对数据二进制序列化后可以节约数据存储空间和网络传输带宽;

Ø  存储持久数据的文件容器;

Ø  可以实现远程过程调用RPC

Ø  简单的动态语言结合功能。

avro支持跨编程语言实现(C, C++, C#Java, Python, Ruby, PHP),类似于Thrift,但是avro的显著特征是:avro依赖于模式,动态加载相关数据的模式,Avro数据的读写操作很频繁,而这些操作使用的都是模式,这样就减少写入每个数据文件的开销,使得序列化快速而又轻巧。这种数据及其模式的自我描述方便了动态脚本语言的使用。当Avro数据存储到文件中时,它的模式也随之存储,这样任何程序都可以对文件进行处理。如果读取数据时使用的模式与写入数据时使用的模式不同,也很容易解决,因为读取和写入的模式都是已知的。

New schema




Added field



The reader uses the default value of the new field, since it is not written by the writer.




The reader does not know about the new field written by the writer, so it is ignored


Removed field



The reader ignores the removed field (projection).




The removed field is not written by the writer. If the old schema had a default defined

for the field, the reader uses this; otherwise, it gets an error. In this case, it is best to

update the readers schema, either at the same time as or before the writers.



Pasted from <http://www.open-open.com/lib/view/open1369363962228.html>




  • Just store the data from source db to hive as internal table (in dev,  in prod, since we adopt encryzoom, even we drop the table, the data won’t be dropped)
  • No use Hcat, instead using the source meta data info for Pig, Hbase


Sqoop 2 has a server component that runs jobs, as well as a range of clients: a command-line interface (CLI), a web UI, a REST API, and a Java API. Sqoop 2 also will be able to use alternative execution engines, such as Spark. Note that Sqoop 2’s CLI is not compatible with Sqoop 1’s CLI.


  • Generated Code

Sqoop can use generated code to handle the deserialization of table-specific data from the database source before writing it to HDFS.


If you’re working with records imported to SequenceFiles, it is inevitable that you’ll need to use the generated classes (to deserialize data from the SequenceFile storage). You can work with text-file-based records without using generated code, but as we’ll see in Working with Imported Data, Sqoop’s generated code can handle some tedious aspects of data processing for you.



  • Controlling the Import

– specify a –query argument. with “select xxx from xx Where = ”



  • Imports and Consistency

The best way to do this is to ensure that any processes that update existing rows of a table are disabled during the import.



  • Incremental Imports

Sqoop will import rows that have a column value (for the column specified with –check-column) that is greater than some specified value (set via –last-value).



  • Imported Data and Hive

sqoop create-hive-table –connect jdbc:mysql://localhost/hadoopguide –table widgets –fields-terminated-by ‘,’


Sqoop Sample

sqoop import –connect  jdbc:mysql://localhost/test –table widgets -m 1


: Access denied for user ‘forest’@’localhost’ (using password: YES)

mysql -u root mysql

mysql -uforest -pabc123


sqoop import –connect  jdbc:mysql://localhost/test –table widgets -m 1 -username forest –password abc123

— doesn’t work as the job is running in cluster, it doesn’t know which is the “localhost”


hdfs dfs -cat /user/ss-nano-uat/SOURCE_DATABASE/part-m-00000

hdfs dfs -ls /user/ss-nano-uat/SOURCE_DATABASE

sqoop import –connect  jdbc:mysql://wdctestlab0044-fe1.systems.uk.hsbc/test –table widgets -m 1 –username forest –password abc123 –delete-target-dir


sqoop import –connect ‘jdbc:oracle:thin:@hkl103276.hk.hsbc:2001/DHKBBU01.hk.hsbc’ –table SOURCE_DATABASE -m 1 –username EN_BABARSIT –password pwd4smithj


sqoop import –connect ‘jdbc:oracle:thin:@hkl103276.hk.hsbc:2001/DHKBBU01.hk.hsbc’ –table SOURCE_DATABASE -m 1 –username EN_BABARSIT –password pwd4smithj –append

Append to a new file  part-m-00001



sqoop import –connect ‘jdbc:oracle:thin:@hkl103276.hk.hsbc:2001/DHKBBU01.hk.hsbc’ –table SOURCE_DATABASE -m 1 –username EN_BABARSIT –password pwd4smithj –append  –columns “SOURCE_ID,SOURCE_SYSTEM_NAME”


Delete target dir before copying

sqoop import –connect ‘jdbc:oracle:thin:@hkl103276.hk.hsbc:2001/DHKBBU01.hk.hsbc’ –table SOURCE_DATABASE -m 1 –username EN_BABARSIT –password pwd4smithj –delete-target-dir



sqoop import –connect ‘jdbc:oracle:thin:@hkl103276.hk.hsbc:2001/DHKBBU01.hk.hsbc’ -m 1 –username EN_BABARSIT –password pwd4smithj –delete-target-dir –target-dir /user/ss-nano-uat/SOURCE_DATABASE_1 –query “select * from source_database where source_system_name = ‘TRS_SITDS’ and \$CONDITIONS”


If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the–split-by argument.

sqoop import –connect ‘jdbc:oracle:thin:@hkl103276.hk.hsbc:2001/DHKBBU01.hk.hsbc’ -m 5 –username EN_BABARSIT –password pwd4smithj –delete-target-dir –target-dir /user/ss-nano-uat/SOURCE_DATABASE_1 –query “select * from source_database where \$CONDITIONS” –split-by source_id


sqoop import –connect ‘jdbc:oracle:thin:@hkl103276.hk.hsbc:2001/DHKBBU01.hk.hsbc’ -m 5 –username EN_BABARSIT –password pwd4smithj –target-dir /user/ss-nano-uat/sqoopjoblog1 –query “select * from sqoopjoblog where \$CONDITIONS” –split-by id -compression-codec ‘org.apache.hadoop.io.compress.SnappyCodec’ –incremental append –check-column id



sqoop import-all-tables   –connect jdbc:mysql://wdctestlab0044-fe1.systems.uk.hsbc/test -m 1 –username forest –password abc123


sqoop list-tables  –connect jdbc:mysql://wdctestlab0044-fe1.systems.uk.hsbc/test –username forest –password abc123


sqoop list-databases  –connect jdbc:mysql://wdctestlab0044-fe1.systems.uk.hsbc –username forest –password abc123



If you do use –escaped-by, –enclosed-by, or –optionally-enclosed-by when importing data into Hive, Sqoop will print a warning message.


Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\n and \r characters) or column delimiters (\01characters) present in them. You can use the –hive-drop-import-delims option to drop those characters on import to give Hive-compatible text data. Alternatively, you can use the –hive-delims-replacement option to replace those characters with a user-defined string on import to give Hive-compatible text data.


Sqoop will pass the field and record delimiters through to Hive. If you do not set any delimiters and do use –hive-import, the field delimiter will be set to ^A and the record delimiter will be set to \n to be consistent with Hive’s defaults.


Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work correctly. You should append parameters –null-string and –null-non-string in case of import job or –input-null-string and –input-null-non-string in case of an export job if you wish to properly preserve NULL values. Because sqoop is using those parameters in generated code, you need to properly escape value \N to \\N:


$ sqoop import  … –null-string ‘\\N’ –null-non-string ‘\\N’


sqoop import –connect ‘jdbc:oracle:thin:@hkl103276.hk.hsbc:2001/DHKBBU01.hk.hsbc’ -m 5 –username EN_BABARSIT –password pwd4smithj –target-dir /user/ss-nano-uat/sqoopjoblog –query “select * from sqoopjoblog where \$CONDITIONS” –split-by id -compression-codec ‘org.apache.hadoop.io.compress.SnappyCodec’ –incremental append –check-column id –null-string ‘\\N’ –null-non-string ‘\\N’ –hive-import –hive-table ‘sqoopjoblog_20160425’ –hive-delims-replacement ‘ ‘





— works only if the table widgets  is empty

sqoop export –connect  jdbc:mysql://wdctestlab0044-fe1.systems.uk.hsbc/test –table widgets -m 1 –username forest –password abc123 —export-dir /user/ss-nano-uat/widgets –fields-terminated-by ‘,’ –lines-terminated-by ‘\n’


— works if the table isn’t empty

— if table has 1 record but file has two record, only the key matched will be updated. No append new records

sqoop export –connect  jdbc:mysql://wdctestlab0044-fe1.systems.uk.hsbc/test –table widgets -m 1 –username forest –password abc123 –export-dir /user/ss-nano-uat/widgets –fields-terminated-by ‘,’ –lines-terminated-by ‘\n’ –update-key id


— update and insert new records with new key

sqoop export –connect  jdbc:mysql://wdctestlab0044-fe1.systems.uk.hsbc/test –table widgets -m 1 –username forest –password abc123 –export-dir /user/ss-nano-uat/widgets –fields-terminated-by ‘,’ –lines-terminated-by ‘\n’ –update-key id  –update-mode allowinsert









(1) 可靠性

当节点出现故障时,日志能够被传送到其他节点上而不会丢失。Flume提供了三种级别的可靠性保障,从强到弱依次分别为:end-to-end(收到数据agent首先将event写到磁盘上,当数据传送成功后,再删除;如果数据发送失败,可以重新发送。),Store on failure(这也是scribe采用的策略,当数据接收方crash时,将数据写到本地,待恢复后,继续发送),Best effort(数据发送到接收方后,不会进行确认)。

(2) 可扩展性


(3) 可管理性

所有agentcolletormaster统一管理,这使得系统便于维护。多master情况,Flume利用ZooKeeper和gossip,保证动态配置数据的一致性。用户可以在master上查看各个数据源或者数据流执行情况,且可以对各个数据源配置和动态加载。Flume提供了web 和shell script command两种形式对数据流进行管理。

(4) 功能可扩展性

用户可以根据需要添加自己的agentcollector或者storage。此外,Flume自带了很多组件,包括各种agentfile syslog等),collectorstoragefileHDFS等)。

Flume is specifically designed to push data from a massive number of sources to the various storage systems in the Hadoop ecosystem, like HDFS and HBase

Each Flume agent has three components: the source, the channel, and the sink. The source is responsible for getting events into the Flume agent, while the sink is responsible for removing the events from the agent and forwarding them to the next agent in the topology, or to HDFS, HBase, Solr, etc

Machine generated alternative text:
2. Process the events
1. Receive events
3. Pass events to interceptor chain
4. Pass each event to channel selector
5. Return list of channels the event is to be written to
6. Write all events that need to go to each required channel. Only one transaction is opened.
with each channel, and all events to a channel are written as part of that transaction.
7. Repeat the same with optional channels

Machine generated alternative text:
Sink Runner
Sink Group
Sink Processor
One of the three sinks asked to read events from channel
Figure 2-1 Shzics, sink runners sink groups, andsink processors

flume-ng agent –conf conf –conf-file example.conf –name a1 -Dflume.root.logger=INFO,console

A source instance can specify multiple channels, but a sink instance can only specify one channel. The format is as follows

# list the sources, sinks and channels for the agent
agent_foo.sources = avro-appserver-src-1
agent_foo.sinks = hdfs-sink-1
agent_foo.channels = mem-channel-1

# set channel for source
agent_foo.sources.avro-appserver-src-1.channels = mem-channel-1

# set channel for sink
agent_foo.sinks.hdfs-sink-1.channel = mem-channel-1

multiple flows in an agent

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1 exec-tail-source2
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# flow #1 configuration
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1

# flow #2 configuration
agent_foo.sources.exec-tail-source2.channels = file-channel-2
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

Fan out flow

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# set channels for source
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1 file-channel-2

# set channel for sinks
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1


Running Pig

You can run Pig (execute Pig Latin statements and Pig commands) using various modes.

  Local Mode Tez Local Mode Mapreduce Mode Tez Mode
Interactive Mode yes experimental yes yes
Batch Mode yes experimental yes yes

Execution Modes

Pig has two execution modes or exectypes:

      • Local Mode – To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local).
      • Tez Local Mode – To run Pig in tez local mode. It is similar to local mode, except internally Pig will invoke tez runtime engine. Specify Tez local mode using the -x flag (pig -x tez_local).
        Note: Tez local mode is experimental. There are some queries which just error out on bigger data in local mode.
      • Mapreduce Mode – To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can,but don’t need to, specify it using the -x flag (pig OR pig -x mapreduce).
      • Tez Mode – To run Pig in Tez mode, you need access to a Hadoop cluster and HDFS installation. Specify Tez mode using the -x flag (-x tez).


Pig Latin statements are generally organized as follows:

      • A LOAD statement to read data from the file system.
      • A series of “transformation” statements to process the data.
      • A DUMP statement to view results or a STORE statement to save the results.

Note that a DUMP or STORE statement is required to generate output.


Shell and Utility Commands


Invokes any FsShell command from within a Pig script or the Grunt shell.

grunt> sh ls





Invokes any sh shell command from within a Pig script or the Grunt shell.



Run a Pig script.



Kills a job



Quits from the Pig grunt shell.


Pasted from <https://pig.apache.org/docs/r0.9.1/cmds.html>



Assigns values to keys used in Pig.




Relations, Bags, Tuples, Fields

Pig Latin statements work with relations. A relation can be defined as follows:

      • A relation is a bag (more specifically, an outer bag).
      • A bag is a collection of tuples.


A bag is a collection of tuples.

Syntax: Inner bag

{ tuple [, tuple …] }


{  } An inner bag is enclosed in curly brackets { }.
tuple A tuple.


In this example A is a relation or bag of tuples. You can think of this bag as an outer bag.

A = LOAD ‘data’ as (f1:int, f2:int, f3;int);



      • A tuple is an ordered set of fields.


A tuple is an ordered set of fields.


( field [, field …] )


(  ) A tuple is enclosed in parentheses ( ).
field A piece of data. A field can be any data type (including tuple and bag).






      • A field is a piece of data.

A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don’t require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type


Referencing Relations

Relations are referred to by name (or alias). Names are assigned by you as part of the Pig Latin statement. In this example the name (alias) of the relation is A.

A = LOAD ‘student’ USING PigStorage() AS (name:chararray, age:int, gpa:float);




A map is a set of key/value pairs.

Syntax (<> denotes optional)

[ key#value <, key#value …> ]


[ ] Maps are enclosed in straight brackets [ ].
# Key value pairs are separated by the pound sign #.
key Must be chararray data type. Must be a unique value.
value Any data type (the defaults to bytearray).


Key values within a relation must be unique.

Also see map schemas.


In this example the map includes two key value pairs.



Star Expressions

Star expressions ( * ) can be used to represent all the fields of a tuple


Project-Range Expressions

Project-range ( .. ) expressions can be used to project a range of columns from input. For example:

      • .. $x : projects columns $0 through $x, inclusive
      • $x .. : projects columns through end, inclusive
      • $x .. $y : projects columns through $y, inclusive


Schemas for Simple Data Types

Simple data types include int, long, float, double, chararray, bytearray, boolean, datetime, biginteger and bigdecimal.


(alias[:type]) [, (alias[:type]) …] )


Schemas for Complex Data Types

Complex data types include tuples, bags, and maps.

Tuple Schemas

A tuple is an ordered set of fields.


alias[:tuple] (alias[:type]) [, (alias[:type]) …] )


Pasted from <http://pig.apache.org/docs/r0.14.0/basic.html>



Loads data from the file system.


LOAD ‘data’ [USING function] [AS schema];


‘data’ The name of the file or directory, in single quotes.

If you specify a directory name, all the files in the directory are loaded.

You can use Hadoop globing to specify files at the file system or directory levels (see Hadoop globStatus for details on globing syntax).

Note: Pig uses Hadoop globbing so the functionality is IDENTICAL. However, when you run from the command line using the Hadoop fs command (rather than the Pig LOAD operator), the Unix shell may do some of the substitutions; this could alter the outcome giving the impression that globing works differently for Pig and Hadoop. For example:

    • This works
      hadoop fs -ls /mydata/20110423{00,01,02,03,04,05,06,07,08,09,{10..23}}00//part
    • This does not work
      LOAD ‘/mydata/20110423{00,01,02,03,04,05,06,07,08,09,{10..23}}00//part 
USING Keyword.

If the USING clause is omitted, the default load function PigStorage is used.

function The load function.

    • You can use a built in function (see Load/Store Functions). PigStorage is the default load function and does not need to be specified (simply omit the USING clause).
    • You can write your own load function if your data is in a format that cannot be processed by the built in functions (see User Defined Functions).
AS Keyword.
schema A schema using the AS keyword, enclosed in parentheses (see Schemas).

The loader produces the data of the type specified by the schema. If the data does not conform to the schema, depending on the loader, either a null value or an error is generated.

Note: For performance reasons the loader may not immediately convert the data to the specified format; however, you can still operate on the data assuming the specified type.


Use the Parallel Features

You can set the number of reduce tasks for the MapReduce jobs generated by Pig using two parallel features. (The parallel features only affect the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.)



The -useHCatalog Flag

To bring in the appropriate jars for working with HCatalog, simply include the following flag:

pig -useHCatalog


Store Examples

You can write to a non-partitioned table simply by using HCatStorer. The contents of the table will be overwritten:

store z into ‘web_data’ using org.apache.hive.hcatalog.pig.HCatStorer();

To add one new partition to a partitioned table, specify the partition value in the store function. Pay careful attention to the quoting, as the whole string must be single quoted and separated with an equals sign:

store z into ‘web_data’ using org.apache.hive.hcatalog.pig.HCatStorer(‘datestamp=20110924’);

To write into multiple partitions at once, make sure that the partition column is present in your data, then call HCatStorer with no argument:

store z into ‘web_data’ using org.apache.hive.hcatalog.pig.HCatStorer();
— datestamp must be a field in the relation z






pig -x local script1-local.pig
— no schema used

A = LOAD ‘myfile.txt’;  — default using PigStorage with tab as delimiter

A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’);

— schema used;

A = LOAD ‘myfile.txt’ AS (f1:int, f2:int, f3:int);

pig -useHCatalog

B = LOAD ‘trs_uat_20160315.tbl_ts_reg_exc’ USING org.apache.hive.hcatalog.pig.HCatLoader();

store B into ‘trs_uat_20160315.tbl_ts_reg_exc’ using org.apache.hive.hcatalog.pig.HCatStorer(); <– table must exist?  Already append with duplicate data, how to de-dup?

B = Load ‘student’ USING PigStorage(‘,’) as (name:chararray,age:int,score:float);


X = Foreach B generate name,age;


Y = FILTER X BY age < 20;




input_file = Load ‘group.txt’ USING PigStorage(‘,’) as (class:long,name:chararray,age:long,rate:long,salary:long);

ranked = rank input_file;

NoHeader = Filter ranked by (rank_input_file > 1); — remove header

Ordered = Order NoHeader by class;

New_input_file = foreach Ordered Generate class, name, age, rate,salary;

B = Group New_input_file by class;

C = Order New_input_file by name;

D = Distinct C;

E = Group New_input_file by class  PARALLEL 3;



A = Load ‘group.txt’ USING PigStorage(‘,’)

B = Filter A by $2 is null;



Group_file = Load ‘group.txt’ USING PigStorage(‘,’) as (class:long,name:chararray,age:long,rate:long,salary:long);

Class_file = Load ‘class.txt’ USING PigStorage(‘,’) as (class:long,name:chararray);

Inner_join = Join Group_file by class, Class_file by class;

Inner_skewed_join = Join Group_file by class, Class_file by class USING ‘skewed’;

Inner_merge_join = Join Group_file by class, Class_file by class USING ‘merge’;  — join the data in the map phase of a MapReduce job

Left_join = Join Group_file by class LEFT, Class_file by class;

Left_rep_join = Join Group_file by class LEFT, Class_file by class USING ‘replicated’; — class_file will be put into memory on map side, should be small file, large file comes first


grunt> describe Left_rep_join

Left_rep_join: {Group_file::class: long,Group_file::name: chararray,Group_file::age: long,Group_file::rate: long,Group_file::salary: long,Class_file::class: long,Class_file::name: chararray}

opp_3 = FOREACH Left_rep_join  GENERATE Group_file::class as class, Class_file::name as teacher;

grunt> describe opp_3

opp_3: {class: long,teacher: chararray}


use the :: operator to identify which column x to use – either relation A column x (A::x) or relation B column x (B::x)




Right_join = Join Group_file by class RIGHT, Class_file by class;


STORE Left_join INTO ‘join_result.txt’ USING PigStorage (‘|’);





DEFINE getClientInfo () RETURNS client_info {

client = LOAD ‘$srcHiveSchema.client’ USING ${maven_HCatLoader};


$client_info = FOREACH client_join_cps generate

clientID as clientID ,

mailingAddress as mailingAddress,

clientName as clientName,

industry as industry,

industryGroup as industryGroup,

sector as sector,

isDeleted as isDeleted,

isActive as isActive,

country as country,

region as region,

clientStatus as clientStatus,

clientPriority as clientPriority,

clientModifiedTime as clientModifiedTime,

clientCreationTime as clientCreationTime,

clientObjectClass as clientObjectClass,

mastergroup as mastergroup,

crimClient as crimClient,

clientPlanningSummary as clientPlanningSummary;





IMPORT ‘$cv_macroDir/etl/macro_client.pig’;


client = getClientInfo();



Site is a property(tag) of the xml


REGISTER ‘piggybank-0.15.0.jar’;

AC418WP = LOAD ‘xml/AC418WP-master.xml’ USING org.apache.pig.piggybank.storage.XMLLoader(‘site’);


Hadoop Certified Developer


Load data into a Pig relation without a schema

A = LOAD ‘text.txt’ USING PigStorage(‘,’) – as csv 

A = LOAD ‘text.txt USING TextLoader(); – as string

Load data into a pig relation with a schema

A = LOAD ‘text.txt’ USING PigStorage(‘,’) AS (lname:chararray,fname:chararray, …

Load data from a Hive table into a pig relation

pig -useHCatalogA = LOAD ‘hive_table’ USING org.apache.hive.hcatalog.pig.HCatLoader

Use pig to transform data into a specified format 

B = FOREACH A GENERATE $0..$9    B = FOREACH A GENERATE         (chararray) $0 as lname:chararray,        (chararray) $1 as fname:chararray,        (chararray) $6 as arrival_time:chararray;

Transform data to match a given Hive schema

A = FOREACH A generate field1, field2, misnamedField as rightName;

Group the data of one or more Pig relations

B = GROUP A ALL; B = GROUP A BY fieldname; B = GROUP A BY (name1, name2);

Use Pig to remove null values from a relation

B = FILTER A BY (fieldname is not NULL); 

Store pig data in HDFS folder

STORE relation INTO ‘folder_name’ USING PigStorage(‘,’);

Store pig data in HDFS as Json

STORE relation INTO ‘folder_name’  STORE relation INTO ‘folder_name’ USING JsonStorage();

Store pig data into a Hive Table

(in hive) create table table_name (field1 type, field2 type, field3 type); 

pig -useHCatalog STORE A into ‘table_name’ USING org.apache.hive.hcatalog.pig.HCatStorer();

Sort a pig relation’s output

B = ORDER A by (field1, field2) ASC;

Remove duplicates from a pig relation


Specify the number of reduce tasks for a Pig job


SET default_parallel 20;

Join two datasets using pig

C = JOIN A by field1, B by field1; 

Run pig on Tez

To run Pig in tez mode, simply add “-x tez” in pig command line.

Register a JAR file of UDFs in Pig

REGISTER jarname.jar;

define an alias for a UDF in Pig

DEFINE Alias path.to.UDF(values, to, pass);

Invoke a UDF in Pig 

B = FOREACH A GENERATE Alias(field_name) AS a, b, c;

Input a local file into HDFS using the Hadoop file system shell

hadoop fs -put ./local_file /user/hadoop/hadoopfile 

hadoop fs -moveFromLocal <localsrc> <dst> (deletes local source)

Make a new directory in HDFS using the Hadoop file system shell

hadoop fs -mkdir hadoopdir 

-p option makes parents along the path

Import data from a table in a relational database into HDFS

sqoop import –connect jdbc:mysql://host/nyse \–table StockPrices \ –target-dir /data/stockprice \ –as-text-file

How do you provide a password for Sqoop

sqoop import –connect jdbc:mysql://database.example.com/employees \ –username venkatesh –password 1234sqoop import –connect jdbc:mysql://database.example.com/employees \ –username venkatesh -P (enter from command prompt) 

Import data from a RDBS into HDFS using a query

sqoop import –connect jdbc:mysql://database.example.com/employees –query 

‘SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS’ \ –split-by a.id –target-dir /user/foo/joinresults

Import a table from a relational database into a new or existing Hive table

add –hive-import to Sqoop command line 

sqoop import –connect jdbc:mysql://database.example.com/employees –hive-import 

Insert or update data from HDFS into a table in a relational database

sqoop export –table –update-key–call (apply a stored procedure for each record) –export-dir –input-fields-terminated-by –update-mode

What are the four pieces of information needed to import from RDBS into HDFS?

connection:--connect jdbc:mysql://database.example.com/employees source: –table to choose table–query to specify querycredentials: –username –password destination: –target-dir

How do you limit the import of data from RDBS to HDFS to a specific set of rows/columns

sqoop import –table–columns to specify columns–where to control rows 

Update a table using data from HDFS

sqoop export –table–update-key–update-mode (updateonly/allowinsert)–export-dir–input-fields-terminated-by

Import the Table salaries from the test database in MySQL into HDFS

mysql test < salaries.sql 

sqoop import –connect jdbc:mysql://sandbox/test?user=root –table salaries 

Import just the columns Age and Salary from the table salaries in the database test into HDFS using 3 mappers to the HDFS filesystem in the directory salaries2

sqoop import –connect jdbc:mysql://sandbox/test?user=root –table salaries –columns age, salary -m 3 –target-dir salaries2

Write a Sqoop import command that imports the rows from salaries in MySQL whose salary column is greater than 90,000.00. Use gender as the –split-by value, specify only two mappers, and import the data into the salaries3 folder in HDFS.

sqoop import –connect jdbc:mysql://sandbox/test?user=root –query “select * from salaries s where s.salary > 90000.00 and \$CONDITIONS” –split-by gender -m 2 –target-dir salaries3 

First put the salarydata.txt file (located in /root/devph/labs/Lab3.2) into a new directory (named salarydata) and then export this directory into the salaries2 table in MySQL. At the end of the MapReduce output, you should see a log event stating that 10,000 records were exported.

hadoop fs -mkdir salarydatahadoop fs -put salarydata.txt salarydatasqoop export –connect jdbc:mysql://sandbox/test?user=root –export-dir salarydata –table salaries2 -input-fields-terminated-by “,”

Update the salaries2 table in MySQL using the directory salarydata. 

sqoop export –connect jdbc:mysql://sandbox/test?user=root –export-dir salarydata –table salaries2 –input-fields-terminated-by “,” –update-mode <updateonly/allowinsert>

Given a Flume configuration file, start a Flume agent

flume-ng agent -n my_agent -c conf -f myagent.conf

Given a configured sink and source, configure a Flume memory channel with a specified capacity

Add in: my_agent.channels = memoryChannel 

my_agent.sources.webserver.channels = memoryChannel  my_agent.channels.memoryChannel.type = memorymy_agent.channels.memoryChannel.capacity = 10000

How do you copy all the partitions of a reduce job to the local filesystem?

hadoop fs -getmerge <hadoopfolder> <local_filename>

How do I find out the user name, host name and port for mysql to configure a JDBC connection?

in mysql: 

username: select user();hostname and port: show variables;

How do I find the host and user for mysql to use for sqoop

use mysql;

select host, user from user;

** The user must have an entry in username and host for each hadoop node or Sqoop will throw an exception.

Pasted from <https://www.studyblue.com/printFlashcardDeck?deckId=15586022&note=true>


HDFS backup

The following are some of the important data sources that need to be protected against data loss:

  • The namenode metadata: The namenode metadata contains all the location of all the files in the HDFS.
  • The Hive metastore: The Hive metastore contains the metadata for all Hive tables and partitions.
  • HBase RegionServer data: This contains the information of all the HBase regions.
  • Application configuration files: This comprises the important configuration files required to configure Apache Hadoop. For example, core-site.xml, yarn-site.xml, and hdfs-site.xml.

Using the distributed copy (DistCp) to backup


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s