Pheonix学习笔记 --- Blk Data Loading，Pheonix导如CSV文件

来源：互联网发布：淘宝怎么做免费推广编辑：程序博客网时间：2024/06/05 16:57

Phoenix provides two methods for bulk loading data into Phoenix tables:

Single-threaded client loading tool for CSV formatted data via the psql command
MapReduce-based bulk load tool for CSV and JSON formatted data

The psql tool is typically appropriate for tens of megabytes, while the MapReduce-based loader is typically better for larger load volumes.

Use of both loaders is described below.

Sample data

For the following examples, we will assume that we have a CSV file named “data.csv” with the following content:

12345,John,Doe67890,Mary,Poppins

We will use a table with the following structure:

 CREATE TABLE example (    my_pk bigint not null,    m.first_name varchar(50),    m.last_name varchar(50)     CONSTRAINT pk PRIMARY KEY (my_pk))

Loading via PSQL

The psql command is invoked via psql.py in the Phoenix bin directory. In order to use it to load CSV data, it is invoked by providing the connection information for your HBase cluster, the name of the table to load data into, and the path to the CSV file or files. Note that all CSV files to be loaded must have the ‘.csv’ file extension (this is because arbitrary SQL scripts with the ‘.sql’ file extension can also be supplied on the PSQL command line).

To load the example data outlined above into HBase running on the local machine, run the following command:

bin/psql.py -t EXAMPLE localhost data.csv

The following parameters can be used for loading data with PSQL:

ParameterDescription-tProvide the name of the table in which to load data. By default, the name of the table is taken from the name of the CSV file. This parameter is case-sensitive-hOverrides the column names to which the CSV data maps and is case sensitive. A special value of in-line indicating that the first line of the CSV file determines the column to which the data maps.-sRun in strict mode, throwing an error on CSV parsing errors-dSupply a custom delimiter or delimiters for CSV parsing-qSupply a custom phrase delimiter, defaults to double quote character-eSupply a custom escape character, default is a backslash-aSupply an array delimiter (explained in more detail below)

Loading via MapReduce

For higher-throughput loading distributed over the cluster, the MapReduce loader can be used. This loader first converts all data into HFiles, and then provides the created HFiles to HBase after the HFile creation is complete.

The CSV MapReduce loader is launched using the hadoop command with the Phoenix client jar, as follows:

hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv

When using Phoenix 4.0 and above, there is a known HBase issue( “Notice to Mapreduce users of HBase 0.96.1 and above” https://hbase.apache.org/book.html ), you should use following command:

HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv

HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv

The JSON MapReduce loader is launched using the hadoop command with the Phoenix client jar, as follows:

hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.JsonBulkLoadTool --table EXAMPLE --input /data/example.json

The input file must be present on HDFS (not the local filesystem where the command is being run).

The following parameters can be used with the MapReduce loader.

ParameterDescription-i,–inputInput CSV path (mandatory)-t,–tablePhoenix table name (mandatory)-a,–array-delimiterArray element delimiter (optional)-c,–import-columnsComma-separated list of columns to be imported-d,–delimiterInput delimiter, defaults to comma-g,–ignore-errorsIgnore input errors-o,–outputOutput path for temporary HFiles (optional)-s,–schemaPhoenix schema name (optional)-z,–zookeeperZookeeper quorum to connect to (optional)-it,–index-tableIndex table name to load (optional)

Notes on the MapReduce importer

The current MR-based bulk loader will run one MR job to load your data table and one MR per index table to populate your indexes. Use the -it option to only load one of your index tables.

Permissions issues when uploading HFiles

There can be issues due to file permissions on the created HFiles in the final stage of a bulk load, when the created HFiles are handed over to HBase. HBase needs to be able to move the created HFiles, which means that it needs to have write access to the directories where the files have been written. If this is not the case, the uploading of HFiles will hang for a very long time before finally failing.

There are two main workarounds for this issue: running the bulk load process as the hbase user, or creating the output files with as readable for all users.

The first option can be done by simply starting the hadoop command with sudo -u hbase, i.e.

sudo -u hbase hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv

Creating the output files as readable by all can be done by setting the fs.permissions.umask-mode configuration setting to “000”. This can be set in the hadoop configuration on the machine being used to submit the job, or can be set for the job only during submission on the command line as follows:

hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000 --table EXAMPLE --input /data/example.csv

Loading array data

Both the PSQL loader and MapReduce loader support loading array values with the -a flag. Arrays in a CSV file are represented by a field that uses a different delimiter than the main CSV delimiter. For example, the following file would represent an id field and an array of integers:

1,2:3:42,3:4,5

To load this file, the default delimiter (comma) would be used, and the array delimiter (colon) would be supplied with the parameter -a ':'.

A note on separator characters

The default separator character for both loaders is a comma (,). A common separator for input files is the tab character, which can tricky to supply on the command line. A common mistake is trying to supply a tab as the separator by typing the following

-d '\t'

This will not work, as the shell will supply this value as two characters (a backslash and a ‘t’) to Phoenix.

Two ways in which you can supply a special character such as a tab on the command line are as follows:

By preceding the string representation of a tab with a dollar sign:
-d $‘\t’
By entering the separator as Ctrl+v, and then pressing the tab key:
-d '\t'

==============================================================================================

翻译过来可以理解为：

三种操作：

建表（此处引用TPC-DS的语句）：

CREATE TABLE web_sales("ws_sold_date_sk" "INTEGER","ws_sold_time_sk" "INTEGER","ws_ship_date_sk" "INTEGER","ws_item_sk" "INTEGER","ws_bill_customer_sk" "INTEGER","ws_bill_cdemo_sk" "INTEGER","ws_bill_hdemo_sk" "INTEGER","ws_bill_addr_sk" "INTEGER","ws_ship_customer_sk" "INTEGER","ws_ship_cdemo_sk" "INTEGER","ws_ship_hdemo_sk" "INTEGER","ws_ship_addr_sk" "INTEGER","ws_web_page_sk" "INTEGER","ws_web_site_sk" "INTEGER","ws_ship_mode_sk" "INTEGER","ws_warehouse_sk" "INTEGER","ws_promo_sk" "INTEGER","ws_order_number" "INTEGER","ws_quantity" "INTEGER","ws_wholesale_cost" "DECIMAL","ws_list_price" "DECIMAL","ws_sales_price" "DECIMAL","ws_ext_discount_amt" "DECIMAL","ws_ext_sales_price" "DECIMAL","ws_ext_wholesale_cost" "DECIMAL","ws_ext_list_price" "DECIMAL","ws_ext_tax" "DECIMAL","ws_coupon_amt" "DECIMAL","ws_ext_ship_cost" "DECIMAL","ws_net_paid" "DECIMAL","ws_net_paid_inc_tax" "DECIMAL","ws_net_paid_inc_ship" "DECIMAL","ws_net_paid_inc_ship_tax" "DECIMAL","ws_net_profit" "DECIMAL" CONSTRAINT pk PRIMARY KEY("ws_item_sk"))

首先必须指定主键，其次表名用引号引起来显示的是小写，不使用引号显示的是大写表名。

导入分为两种模式

本地模式：

导入命令基本为：

bin/psql.py –t tablename ip:2181 –d ‘分隔符’ importfile.csv

HDFS模式

hadoop jar phoenix-4.4.0.2.3.4.0-3485-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -z ip2181:/hbase-unsecure -d ‘分隔符’ tablename--input /hdfs/filename.csv

【注】：上述导入的语句中 “/hbase-unsecure” 是HDP自带的 hbase，如过使用的是Apache 原生 Hbase 此处填写为 hbase 即可。-z 指定的是zookeeperIp端口，以及hbase的版本， -d 指定的是分隔符。

命令还有其他参数，请参考原文。

0 0