Pheonix学习笔记 --- Blk Data Loading,Pheonix导如CSV文件
来源:互联网 发布:淘宝怎么做免费推广 编辑:程序博客网 时间:2024/06/05 16:57
Phoenix provides two methods for bulk loading data into Phoenix tables:
- Single-threaded client loading tool for CSV formatted data via the psql command
- MapReduce-based bulk load tool for CSV and JSON formatted data
The psql tool is typically appropriate for tens of megabytes, while the MapReduce-based loader is typically better for larger load volumes.
Use of both loaders is described below.
Sample data
For the following examples, we will assume that we have a CSV file named “data.csv” with the following content:
12345,John,Doe67890,Mary,Poppins
We will use a table with the following structure:
CREATE TABLE example ( my_pk bigint not null, m.first_name varchar(50), m.last_name varchar(50) CONSTRAINT pk PRIMARY KEY (my_pk))
Loading via PSQL
The psql command is invoked via psql.py in the Phoenix bin directory. In order to use it to load CSV data, it is invoked by providing the connection information for your HBase cluster, the name of the table to load data into, and the path to the CSV file or files. Note that all CSV files to be loaded must have the ‘.csv’ file extension (this is because arbitrary SQL scripts with the ‘.sql’ file extension can also be supplied on the PSQL command line).
To load the example data outlined above into HBase running on the local machine, run the following command:
bin/psql.py -t EXAMPLE localhost data.csv
The following parameters can be used for loading data with PSQL:
Loading via MapReduce
For higher-throughput loading distributed over the cluster, the MapReduce loader can be used. This loader first converts all data into HFiles, and then provides the created HFiles to HBase after the HFile creation is complete.
The CSV MapReduce loader is launched using the hadoop command with the Phoenix client jar, as follows:
hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv
When using Phoenix 4.0 and above, there is a known HBase issue( “Notice to Mapreduce users of HBase 0.96.1 and above” https://hbase.apache.org/book.html ), you should use following command:
HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv
OR
HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv
The JSON MapReduce loader is launched using the hadoop command with the Phoenix client jar, as follows:
hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.JsonBulkLoadTool --table EXAMPLE --input /data/example.json
The input file must be present on HDFS (not the local filesystem where the command is being run).
The following parameters can be used with the MapReduce loader.
Notes on the MapReduce importer
The current MR-based bulk loader will run one MR job to load your data table and one MR per index table to populate your indexes. Use the -it option to only load one of your index tables.
Permissions issues when uploading HFiles
There can be issues due to file permissions on the created HFiles in the final stage of a bulk load, when the created HFiles are handed over to HBase. HBase needs to be able to move the created HFiles, which means that it needs to have write access to the directories where the files have been written. If this is not the case, the uploading of HFiles will hang for a very long time before finally failing.
There are two main workarounds for this issue: running the bulk load process as the hbase user, or creating the output files with as readable for all users.
The first option can be done by simply starting the hadoop command with sudo -u hbase, i.e.
sudo -u hbase hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv
Creating the output files as readable by all can be done by setting the fs.permissions.umask-mode configuration setting to “000”. This can be set in the hadoop configuration on the machine being used to submit the job, or can be set for the job only during submission on the command line as follows:
hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000 --table EXAMPLE --input /data/example.csv
Loading array data
Both the PSQL loader and MapReduce loader support loading array values with the -a flag. Arrays in a CSV file are represented by a field that uses a different delimiter than the main CSV delimiter. For example, the following file would represent an id field and an array of integers:
1,2:3:42,3:4,5
To load this file, the default delimiter (comma) would be used, and the array delimiter (colon) would be supplied with the parameter -a ':'.
A note on separator characters
The default separator character for both loaders is a comma (,). A common separator for input files is the tab character, which can tricky to supply on the command line. A common mistake is trying to supply a tab as the separator by typing the following
-d '\t'
This will not work, as the shell will supply this value as two characters (a backslash and a ‘t’) to Phoenix.
Two ways in which you can supply a special character such as a tab on the command line are as follows:
By preceding the string representation of a tab with a dollar sign:
-d $‘\t’
By entering the separator as Ctrl+v, and then pressing the tab key:
-d '\t'
CREATE TABLE web_sales("ws_sold_date_sk" "INTEGER","ws_sold_time_sk" "INTEGER","ws_ship_date_sk" "INTEGER","ws_item_sk" "INTEGER","ws_bill_customer_sk" "INTEGER","ws_bill_cdemo_sk" "INTEGER","ws_bill_hdemo_sk" "INTEGER","ws_bill_addr_sk" "INTEGER","ws_ship_customer_sk" "INTEGER","ws_ship_cdemo_sk" "INTEGER","ws_ship_hdemo_sk" "INTEGER","ws_ship_addr_sk" "INTEGER","ws_web_page_sk" "INTEGER","ws_web_site_sk" "INTEGER","ws_ship_mode_sk" "INTEGER","ws_warehouse_sk" "INTEGER","ws_promo_sk" "INTEGER","ws_order_number" "INTEGER","ws_quantity" "INTEGER","ws_wholesale_cost" "DECIMAL","ws_list_price" "DECIMAL","ws_sales_price" "DECIMAL","ws_ext_discount_amt" "DECIMAL","ws_ext_sales_price" "DECIMAL","ws_ext_wholesale_cost" "DECIMAL","ws_ext_list_price" "DECIMAL","ws_ext_tax" "DECIMAL","ws_coupon_amt" "DECIMAL","ws_ext_ship_cost" "DECIMAL","ws_net_paid" "DECIMAL","ws_net_paid_inc_tax" "DECIMAL","ws_net_paid_inc_ship" "DECIMAL","ws_net_paid_inc_ship_tax" "DECIMAL","ws_net_profit" "DECIMAL" CONSTRAINT pk PRIMARY KEY("ws_item_sk"))
bin/psql.py –t tablename ip:2181 –d ‘分隔符’ importfile.csv
hadoop jar phoenix-4.4.0.2.3.4.0-3485-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -z ip2181:/hbase-unsecure -d ‘分隔符’ tablename--input /hdfs/filename.csv
- Pheonix学习笔记 --- Blk Data Loading,Pheonix导如CSV文件
- Pheonix MapReduce - Wordcount 错误及解决方法
- 素人pheonix框架(基于selenium+java+maven)
- SOAPUI 接口自动化学习笔记节选 如何用Groovy 脚本读取CSV 文件
- SOAPUI 接口自动化学习笔记节选 如何用Groovy 脚本读取CSV 文件
- python学习笔记-CSV文件读
- 如何用MATLAB读取csv文件
- 如何用MATLAB读取csv文件
- unity3d学习笔记(十七)--unity3d读取csv文件
- unity3d学习笔记(十七)--unity3d读取csv文件
- 【Unity3d】学习笔记(11)——处理CSV文件
- cocos2d-x学习笔记——Csv文件读取工具
- Python学习笔记 --- pandas将excel转化为csv文件
- <44>python学习笔记——读写csv文件
- JMeter学习笔记18-如何从csv文件读取变量
- MySQL 使用 LOAD DATA 导入 csv 文件
- MySQL 使用 LOAD DATA 导入 csv 文件
- MySQL 使用 LOAD DATA 导入 csv 文件
- Java在File类里定义列出系统目录的方法
- ConvertUtils转换器的使用
- 349. Intersection of Two Arrays
- QT中setLayout无效的问题
- 关于在win8.1(64位)上编写汇编教学
- Pheonix学习笔记 --- Blk Data Loading,Pheonix导如CSV文件
- codeforces 246 D. Colorful Graph (set)
- 领域驱动设计——浅析VO、DTO、DO、PO的概念、区别和用处
- RabbitMQ下的生产消费者模式与订阅发布模式
- 【AJAX进阶】——概念
- C语言itoa()函数和atoi()函数
- LinuxC测试题整理(一)
- C 线程与进程的区别与联系
- C++ int 型负数除法与求模运算