Demystifying the Skip Scan in Phoenix
来源:互联网 发布:阿里云网址 编辑:程序博客网 时间:2024/06/08 11:31
Today the Phoenix blog is brought to you by my esteemed colleague and man of many hats, Mujtaba Chohan, who today is wearing his performance engineer hat.
The Skip Scan leverages SEEK_NEXT_USING_HINT of HBase Filter. It stores information about what set of keys/ranges of keys are being searched for in each column. It then takes a key (passed to it during filter evaluation), and figures out if it's in one of the combinations or range or not. If not, it figures out what the next highest key is that should be jumped to.
Input to the SkipScanFilter is a List<List<KeyRange>> where the top level list represents each column in the row key (i.e. each primary key part), and the inner list represents ORed together byte array boundaries.
Consider the following query:
List<List<KeyRange>> for SkipScanFilter for the above query would be:
where [[a - b], [d - e]] is the range for KEY1 and [1, 2] keys for KEY2. Consider this running on the following data.
- Number of rows: 1 billion rows.
SKIP SCAN
Phoenix 1.2 uses a Skip Scan for intra-row scanning which allows for significant performance improvement over Multi Gets and Range Scan when rows are retrieved based on a given set of keys.The Skip Scan leverages SEEK_NEXT_USING_HINT of HBase Filter. It stores information about what set of keys/ranges of keys are being searched for in each column. It then takes a key (passed to it during filter evaluation), and figures out if it's in one of the combinations or range or not. If not, it figures out what the next highest key is that should be jumped to.
Input to the SkipScanFilter is a List<List<KeyRange>> where the top level list represents each column in the row key (i.e. each primary key part), and the inner list represents ORed together byte array boundaries.
Consider the following query:
SELECT * from TWHERE ((KEY1 >='a' AND KEY1 <= 'b') OR (KEY1 > 'c' AND KEY1 <= 'e')) ANDKEY2 IN (1, 2)
List<List<KeyRange>> for SkipScanFilter for the above query would be:
- [[[a - b], [d - e]], [1, 2]]
where [[a - b], [d - e]] is the range for KEY1 and [1, 2] keys for KEY2. Consider this running on the following data.
PERFORMANCE
For this performance comparison, we are using simulated data for a real use case outlined on the HBase user mailing list here.- Number of rows: 1 billion rows.
- Key consists of 50 million OBJECTID and 20 FIELDTYPE. Each key has 10 ATTRIBID and VALUE is random integer.
Phoenix Create Table DML
Query
IN-MEMORY TEST
Time taken to run the query when row are fetched from HBase Block Cache.
TestTimePhoenix1.7 secBatched Gets4.0 sec
DISK READ TEST
Time taken to run the query when data is fetched from disk.
TestTimePhoenix37 secBatched Gets82 secRange Scan12 minsHive over HBase20+ mins
SERIAL TEST
To further illustrate the performance gain by using Skip Scan, we will compare Phoenix Serial Skip Scan performance (phoenix.query.threadPoolSize=1) against Serial Batched Get and Scan. Total number of rows are 8M (all rows fit in HBase block cache). The percentage of random keys passed in IN clause is varied on X axis.
Phoenix Create Table DML
CONCLUSION
Due to Skip Scan use of reseek, it is about 3 times faster than Batched Gets. Skip Scan can be 20x faster that scans over large data sets that cannot all fit into memory, it's 8x faster even if the data is in memory (when 1% of the rows are selected). This in addition to Phoenix fast performance due to use of server side coprocessor for aggregation, query parallelization which is yet another reason to use the latest Phoenix release!
CONFIGURATION
HBase 0.94.7
Hadoop 1.04
Region Servers (RS): 4 (6 Core 3GHz, 12GB with 8GB HBase set as HBase heap on each RS)
Total number of regions: 20
Note: All the keys passed in IN clause are present therefore Bloom Filters were not used.
CREATE TABLE T(OBJECTID INTEGER NOT NULL, FIELDTYPE CHAR(2) NOT NULL,CF.ATTRIBID INTEGER,CF.VAL INTEGER CONSTRAINT PK PRIMARY KEY (OBJECTID,FIELDTYPE)) COMPRESSION='GZ', BLOCKSIZE='4096'
Query
SELECT AVG(VAL) FROM TWHERE OBJECTID IN (250K RANDOM OBJECTIDs) AND FIELDTYPE = 'F1' AND ATTRIBID='A1'
IN-MEMORY TEST
Time taken to run the query when row are fetched from HBase Block Cache.
DISK READ TEST
Time taken to run the query when data is fetched from disk.
SERIAL TEST
To further illustrate the performance gain by using Skip Scan, we will compare Phoenix Serial Skip Scan performance (phoenix.query.threadPoolSize=1) against Serial Batched Get and Scan. Total number of rows are 8M (all rows fit in HBase block cache). The percentage of random keys passed in IN clause is varied on X axis.
Phoenix Create Table DML
CREATE TABLE T(KEY VARCHAR NOT NULL AS KEY,CF.A BIGINT,CF.B BIGINT, CF2.C BIGINT
Query
SELECT A FROM T WHERE KEY IN (?,?,?...)
Comparison of Serial Skip Scan vs Serial Batched Gets, Scan by varying percentage of keys passed in IN clause
Due to Skip Scan use of reseek, it is about 3 times faster than Batched Gets. Skip Scan can be 20x faster that scans over large data sets that cannot all fit into memory, it's 8x faster even if the data is in memory (when 1% of the rows are selected). This in addition to Phoenix fast performance due to use of server side coprocessor for aggregation, query parallelization which is yet another reason to use the latest Phoenix release!
CONFIGURATION
HBase 0.94.7
Hadoop 1.04
Region Servers (RS): 4 (6 Core 3GHz, 12GB with 8GB HBase set as HBase heap on each RS)
Total number of regions: 20
Note: All the keys passed in IN clause are present therefore Bloom Filters were not used.
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html
0 0
- Demystifying the Skip Scan in Phoenix
- Phoenix Tips (11) Skip Scan
- 什么是INDEX SKIP SCAN
- 理解index skip scan
- 关于INDEX SKIP SCAN
- index skip scan
- index skip scan
- INDEX SKIP SCAN和INDEX FULL SCAN
- INDEX FAST FULL SCAN AND SKIP SCAN
- INDEX SKIP SCAN 和 INDEX RANGE SCAN
- index range scan,index fast full scan,index skip scan
- The Importance of Skip Connections in Biomedical Image segmentation_2016
- index skip scan的一些实验。
- index skip scan的一些实验
- 复合索引和INDEX SKIP SCAN
- 断点续传时in.skip
- Skip the Class
- Skip the Class
- unix 常用命令
- C语言及程序设计初步例程-38 用break和continue改变流程
- 使用 adt-eclipse 打包 Cordova (3.0及其以上版本) + sencha touch 项目
- 阿里云centos 6.5 java tomcat安装 [转]
- request.getHeader("user-agent")解析各种浏览器
- Demystifying the Skip Scan in Phoenix
- android adb shell 命令
- BZOJ 2300 动态凸包
- stat函数讲解:
- Linux下display环境变量的作用
- 关于Linux上的tomcat端口被占的问题
- Java Servlet完全教程
- CSS3发光按钮
- XenServer 6.5 与 XenServer 6.2之对比