How-to: use spark to suport query across mysql tables and hbase tables

来源：互联网发布：无间道天台对白知乎编辑：程序博客网时间：2024/05/17 21:53

It wil be good for data analyst that he just run "big sql" to process tables from mysql, hbase or something else. And one more importance thing is the performance thing. We should avoid running "big sql" directely on mysql, espacially join many tables. Our data analyst once run a bid sql which joined 33 tables, the result is our mysql is run to "death". Another case is very simple, that is "select count(*)" a 3 million lines table. Our data analyst said that it will need more than half an hour.

To resolve this, one good choice is spark whose parquet support and dataframe resolved this problem. Parquet is a good choice for performance consideration. Here is the steps:

With sqlContext, the mysql big tables could be loaded and saved as parquet files in hdfs. Design this as daily job. The code could be like following:
Please notcie that this code is based on spark-1.3. From Spark 1.4, please use sqlContext.read
partitions = ( upper_bound - lower_bound ) / lines_each_part
var options: HashMap[String, String] = new HashMap
options.put( "driver" , "com.mysql.jdbc.Driver" )
options.put( "url" , url )
options.put( "dbtable" , table )
options.put( "lowerBound" , lower_bound .toString())
options.put( "upperBound" , upper_bound .toString())
//partitions are base don lower_bound and upper_bound
options.put( "numPartitions" , partitions .toString())
options.put( "partitionColumn" , id );

val jdbcDF = sqlContext.load( "jdbc" , options )
jdbcDF.save( output)
Load mysql data from hdfs parquet files like following:
sqlContext.parquetFile(base_dir + "/" + table).toDF().registerTempTable(table)
Load hbase table as DF and register as a table:
var config = HBaseConfiguration.create()
config.addResource( new Path(System.getenv("HBASE_HOME" ) + "/conf/hbase-site.xml" ))
try {
HBaseAdmin.checkHBaseAvailable( config)
System. out.println( "Detected HBase is running" )
} catch {
  case e => e .printStackTrace
}
config.set(TableInputFormat. INPUT_TABLE , hbase_table )
config.set(TableInputFormat. SCAN_COLUMN_FAMILY , columnF )
......
sqlContext.createDataFrame( hc .toRowRDD(hc .createPairRDD(jsc, config)),hc.schema()).toDF().registerTempTable(table)

//hc.createPaireRDD:
public JavaPairRDD<ImmutableBytesWritable, Result> createPairRDD (
JavaSparkContext jsc, Configuration conf) {

  return jsc .newAPIHadoopRDD(
   conf,
TableInputFormat. class ,
ImmutableBytesWritable. class ,
Result. class
).cache();
}

//hc.toRowRDD:
public JavaRDD<Row> toRowRDD(JavaPairRDD<ImmutableBytesWritable, Result> pairRDD ) {
  return pairRDD .map( new Function<Tuple2<ImmutableBytesWritable, Result>, Row>() {

  private static final long serialVersionUID = -4887770613787757684L;

  public Row call(Tuple2<ImmutableBytesWritable, Result> re)
   throws Exception {
Result result = re._2();
Row row = null ;
  if (schema .getColumns(). length == 0) {
   row = getAll( result);
} else {
   row = get( result);
}
  return row ;
}
  public Row get(Result result ) throws Exception {
List<Object> values = new ArrayList<Object>();
  for (String col : schema .getColumns()) {
   byte [] b = result .getValue(schema .getFamily().getBytes(), col.getBytes());
   if (b == null) {
  values.add( "0" );
  continue ;
}
   values.add( new String(b ));
}
Row row = RowFactory. create( values.toArray( new Object[values .size()]));
   return row ;
}
  public Row getAll(Result result ) throws Exception {
NavigableMap< byte [], byte []> map = result .getFamilyMap(schema .getFamily().getBytes());
List<Object> values = new ArrayList<Object>();
  for (byte [] key : map .keySet()) {
   values.add( new String(map .get(key )));
}
Row row = RowFactory. create( values.toArray( new Object[values .size()]));
  return row ;
}
});
}
//hc.schema():
public StructType schema() {
   final List<StructField> keyFields = new ArrayList<StructField>();
   for (String fieldName : this.hbase_columns) {//hbase_columns is String[]
  keyFields .add(DataTypes.createStructField( fieldName , DataTypes.StringType , true));
}
   return DataTypes.createStructType( keyFields );
       }
run sql as following and save result in hdfs:
val rdd_parquet = sqlContext .sql(sql )
rdd_parquet.rdd.saveAsTextFile( output)

0 0