Sqoop

Sqoop 2 has a server component that runs jobs, as well as a range of clients: a command-line interface (CLI), a web UI, a REST API, and a Java API. Sqoop 2 also will be able to use alternative execution engines, such as Spark. Note that Sqoop 2’s CLI is not compatible with Sqoop 1’s CLI.

Generated Code

Sqoop can use generated code to handle the deserialization of table-specific data from the database source before writing it to HDFS.

If you’re working with records imported to SequenceFiles, it is inevitable that you’ll need to use the generated classes (to deserialize data from the SequenceFile storage). You can work with text-file-based records without using generated code, but as we’ll see in Working with Imported Data, Sqoop’s generated code can handle some tedious aspects of data processing for you.

Controlling the Import

– specify a –query argument. with “select xxx from xx Where = ”

Imports and Consistency

The best way to do this is to ensure that any processes that update existing rows of a table are disabled during the import.

Incremental Imports

Sqoop will import rows that have a column value (for the column specified with –check-column) that is greater than some specified value (set via –last-value).

Imported Data and Hive

sqoop create-hive-table –connect jdbc:mysql://localhost/hadoopguide –table widgets –fields-terminated-by ‘,’

Sqoop Sample

sqoop import –connect jdbc:mysql://localhost/test –table widgets -m 1

: Access denied for user ‘forest’@’localhost’ (using password: YES)
mysql -u root mysql
mysql -uforest -pabc123

sqoop import –connect jdbc:mysql://localhost/test –table widgets -m 1 -username forest –password abc123
— doesn’t work as the job is running in cluster, it doesn’t know which is the “localhost”

hdfs dfs -cat /user/user1/SOURCE_DATABASE/part-m-00000
hdfs dfs -ls /user/user1/SOURCE_DATABASE
sqoop import –connect jdbc:mysql://abcdlab123-fe1.systems.ku.xsxc/test –table widgets -m 1 –username forest –password abc123 –delete-target-dir

sqoop import –connect ‘jdbc:oracle:thin:@xyl103276.xx.abcd:2001/DxyBBU01.xx.abcd’ –table SOURCE_DATABASE -m 1 –username EN_XXXXXSIT –password pwd123

sqoop import –connect ‘jdbc:oracle:thin:@xyl103276.xx.abcd:2001/DxyBBU01.xx.abcd’ –table SOURCE_DATABASE -m 1 –username EN_XXXXXSIT –password pwd123 –append
Append to a new file part-m-00001

sqoop import –connect ‘jdbc:oracle:thin:@xyl103276.xx.abcd:2001/DxyBBU01.xx.abcd’ –table SOURCE_DATABASE -m 1 –username EN_XXXXXSIT –password pwd123 –append –columns “SOURCE_ID,SOURCE_SYSTEM_NAME”

Delete target dir before copying
sqoop import –connect ‘jdbc:oracle:thin:@xyl103276.xx.abcd:2001/DxyBBU01.xx.abcd’ –table SOURCE_DATABASE -m 1 –username EN_XXXXXSIT –password pwd123 –delete-target-dir
sqoop import –connect ‘jdbc:oracle:thin:@xyl103276.xx.abcd:2001/DxyBBU01.xx.abcd’ -m 1 –username EN_XXXXXSIT –password pwd123 –delete-target-dir –target-dir /user/user1/SOURCE_DATABASE_1 –query “select * from source_database where source_system_name = ‘TRS_SITDS’ and \$CONDITIONS”

If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the–split-by?argument.?
sqoop import –connect ‘jdbc:oracle:thin:@xyl103276.xx.abcd:2001/DxyBBU01.xx.abcd’ -m 5 –username EN_XXXXXSIT –password pwd123 –delete-target-dir –target-dir /user/user1/SOURCE_DATABASE_1 –query “select * from source_database where \$CONDITIONS” –split-by source_id

sqoop import –connect ‘jdbc:oracle:thin:@xyl103276.xx.abcd:2001/DxyBBU01.xx.abcd’ -m 5 –username EN_XXXXXSIT –password pwd123 –target-dir /user/user1/sqoopjoblog1 –query “select * from sqoopjoblog where \$CONDITIONS” –split-by id -compression-codec ‘org.apache.hadoop.io.compress.SnappyCodec’ –incremental append –check-column id
sqoop import-all-tables –connect jdbc:mysql://abcdlab123-fe1.systems.ku.xsxc/test -m 1 –username forest –password abc123

sqoop list-tables –connect jdbc:mysql://abcdlab123-fe1.systems.ku.xsxc/test –username forest –password abc123

sqoop list-databases –connect jdbc:mysql://abcdlab123-fe1.systems.ku.xsxc –username forest –password abc123
If you do use?–escaped-by,?–enclosed-by, or?–optionally-enclosed-by?when importing data into Hive, Sqoop will print a warning message.

Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\n?and?\r?characters) or column delimiters (\01characters) present in them. You can use the?–hive-drop-import-delims?option to drop those characters on import to give Hive-compatible text data. Alternatively, you can use the?–hive-delims-replacement?option to replace those characters with a user-defined string on import to give Hive-compatible text data.

Sqoop will pass the field and record delimiters through to Hive. If you do not set any delimiters and do use?–hive-import, the field delimiter will be set to?^A?and the record delimiter will be set to?\n?to be consistent with Hive’s defaults.

Sqoop will by default import NULL values as string?null. Hive is however using string?\N?to denote?NULL?values and therefore predicates dealing with?NULL?(like?IS NULL) will not work correctly. You should append parameters?–null-string?and?–null-non-string?in case of import job or?–input-null-string?and?–input-null-non-string?in case of an export job if you wish to properly preserve?NULL?values. Because sqoop is using those parameters in generated code, you need to properly escape value?\N?to?\\N:

$ sqoop import … –null-string ‘\\N’ –null-non-string ‘\\N’

sqoop import –connect ‘jdbc:oracle:thin:@xyl103276.xx.abcd:2001/DxyBBU01.xx.abcd’ -m 5 –username EN_XXXXXSIT –password pwd123 –target-dir /user/user1/sqoopjoblog –query “select * from sqoopjoblog where \$CONDITIONS” –split-by id -compression-codec ‘org.apache.hadoop.io.compress.SnappyCodec’ –incremental append –check-column id –null-string ‘\\N’ –null-non-string ‘\\N’ –hive-import –hive-table ‘sqoopjoblog_20160425’ –hive-delims-replacement ‘ ‘
Export

— works only if the table widgets is empty
sqoop export –connect jdbc:mysql://abcdlab123-fe1.systems.ku.xsxc/test –table widgets -m 1 –username forest –password abc123 –export-dir /user/user1/widgets –fields-terminated-by ‘,’ –lines-terminated-by ‘\n’

— works if the table isn’t empty
— if table has 1 record but file has two record, only the key matched will be updated. No append new records
sqoop export –connect jdbc:mysql://abcdlab123-fe1.systems.ku.xsxc/test –table widgets -m 1 –username forest –password abc123 –export-dir /user/user1/widgets –fields-terminated-by ‘,’ –lines-terminated-by ‘\n’ –update-key id

— update and insert new records with new key

sqoop export –connect jdbc:mysql://abcdlab123-fe1.systems.ku.xsxc/test –table widgets -m 1 –username forest –password abc123 –export-dir /user/user1/widgets –fields-terminated-by ‘,’ –lines-terminated-by ‘\n’ –update-key id –update-mode allowinsert

Advertisements