How-to: use spark to suport query across mysql tables and hbase tables
来源:互联网 发布:无间道天台对白 知乎 编辑:程序博客网 时间:2024/05/17 21:53
It wil be good for data analyst that he just run "big sql" to process tables from mysql, hbase or something else. And one more importance thing is the performance thing. We should avoid running "big sql" directely on mysql, espacially join many tables. Our data analyst once run a bid sql which joined 33 tables, the result is our mysql is run to "death". Another case is very simple, that is "select count(*)" a 3 million lines table. Our data analyst said that it will need more than half an hour.
To resolve this, one good choice is spark whose parquet support and dataframe resolved this problem. Parquet is a good choice for performance consideration. Here is the steps:
- With sqlContext, the mysql big tables could be loaded and saved as parquet files in hdfs. Design this as daily job. The code could be like following:
Please notcie that this code is based on spark-1.3. From Spark 1.4, please use sqlContext.read
partitions = ( upper_bound - lower_bound ) / lines_each_partvar options: HashMap[String, String] = new HashMapoptions.put( "driver" , "com.mysql.jdbc.Driver" )options.put( "url" , url )options.put( "dbtable" , table )options.put( "lowerBound" , lower_bound .toString())options.put( "upperBound" , upper_bound .toString())//partitions are base don lower_bound and upper_boundoptions.put( "numPartitions" , partitions .toString())options.put( "partitionColumn" , id );val jdbcDF = sqlContext.load( "jdbc" , options )jdbcDF.save( output) - Load mysql data from hdfs parquet files like following:
sqlContext.parquetFile(base_dir + "/" + table).toDF().registerTempTable(table) - Load hbase table as DF and register as a table:var config = HBaseConfiguration.create()config.addResource( new Path(System.getenv("HBASE_HOME" ) + "/conf/hbase-site.xml" ))try {HBaseAdmin.checkHBaseAvailable( config)System. out.println( "Detected HBase is running" )} catch {case e => e .printStackTrace}config.set(TableInputFormat. INPUT_TABLE , hbase_table )config.set(TableInputFormat. SCAN_COLUMN_FAMILY , columnF )
......sqlContext.createDataFrame( hc .toRowRDD(hc .createPairRDD(jsc, config)),hc.schema()).toDF().registerTempTable(table)
//hc.createPaireRDD:public JavaPairRDD<ImmutableBytesWritable, Result> createPairRDD (JavaSparkContext jsc, Configuration conf) {return jsc .newAPIHadoopRDD(conf,TableInputFormat. class ,ImmutableBytesWritable. class ,Result. class).cache();}//hc.toRowRDD:public JavaRDD<Row> toRowRDD(JavaPairRDD<ImmutableBytesWritable, Result> pairRDD ) {return pairRDD .map( new Function<Tuple2<ImmutableBytesWritable, Result>, Row>() {private static final long serialVersionUID = -4887770613787757684L;public Row call(Tuple2<ImmutableBytesWritable, Result> re)throws Exception {Result result = re._2();Row row = null ;if (schema .getColumns(). length == 0) {row = getAll( result);} else {row = get( result);}return row ;}public Row get(Result result ) throws Exception {List<Object> values = new ArrayList<Object>();for (String col : schema .getColumns()) {byte [] b = result .getValue(schema .getFamily().getBytes(), col.getBytes());if (b == null) {values.add( "0" );continue ;}values.add( new String(b ));}Row row = RowFactory. create( values.toArray( new Object[values .size()]));return row ;}public Row getAll(Result result ) throws Exception {NavigableMap< byte [], byte []> map = result .getFamilyMap(schema .getFamily().getBytes());List<Object> values = new ArrayList<Object>();for (byte [] key : map .keySet()) {values.add( new String(map .get(key )));}Row row = RowFactory. create( values.toArray( new Object[values .size()]));return row ;}});}//hc.schema():public StructType schema() {final List<StructField> keyFields = new ArrayList<StructField>();for (String fieldName : this.hbase_columns) {//hbase_columns is String[]keyFields .add(DataTypes.createStructField( fieldName , DataTypes.StringType , true));}return DataTypes.createStructType( keyFields );} - run sql as following and save result in hdfs:val rdd_parquet = sqlContext .sql(sql )rdd_parquet.rdd.saveAsTextFile( output)
0 0
- How-to: use spark to suport query across mysql tables and hbase tables
- java How to Use Tables
- How to Create Tables
- 转 -- How To Repair Corrupted MySQL Tables Using myisamchk
- Two ways to load mysql tables into hdfs via spark
- how-to-use-grahite-and-grafana-to-monitor-spark
- Cloudera Impala官方文档中文翻译-2(Using Impala to Query HBase Tables)
- Cloudera Impala官方文档中文翻译-2(Using Impala to Query HBase Tables)
- How to define hash tables in bash?
- how to lock statistics for the tables
- How MySQL Opens and Closes Tables
- How to Use Tables(Java JTable)目前我见过最详细教程!!!
- hbase how to use zk
- How to use mysql?
- How-to: Use HBase Bulk Loading, and Why
- How to create simple and advanced pivot tables in C# and ASP.NET
- how-to-configure-and-use-spark-history-server
- How to dump, extract and iasl acpi tables on x86 platform
- 完美删除vector的“内存空洞”
- MySQL:远程访问权限,is not allowed to connect to this MySQL server
- MYSQL查询优化:数据类型与效率
- 唐巧
- 【jQuery应用】5秒后释放button键,position的relative和absolute心得
- How-to: use spark to suport query across mysql tables and hbase tables
- 【慕课笔记】PHP面向对象编程——特殊实践(DAY 4)
- MYSQL查询优化:调度和锁定
- Android四舍五入保留两位小数
- MYSQL查询优化:管理员的优化措施
- js之DOM操作中获取属性方法
- cf C. A Problem about Polyline (数学题)
- (八) HTTP/2的各种帧定义
- 解决maven不能打包的问题