pig- Join 优化

来源:互联网 发布:酒店行业的大数据应用 编辑:程序博客网 时间:2024/05/16 10:08

Specialized Joins

Pig Latin includes three "specialized" joins: replicated joins, skewed joins, and merge joins.

  • Replicated, skewed, and merge joins can be performed using inner joins.
  • Replicated and skewed joins can also be performed using outer joins.

Replicated Joins(similar with Hive MapJoin)

Fragment replicate join is a special type of join that works well if one or more relations are small enough to fit into main memory. In such cases, Pig can perform a very efficient join because all of the hadoop work is done on the map side. In this type of join the large relation is followed by one or more small relations. The small relations must be small enough to fit into main memory; if they don't, the process fails and an error is generated.

Usage

Perform a replicated join with the USING clause (see inner joins and outer joins). In this example, a large relation is joined with two smaller relations. Note that the large relation comes first followed by the smaller relations; and, all small relations together must fit into main memory, otherwise an error is generated.

big = LOAD 'big_data' AS (b1,b2,b3);tiny = LOAD 'tiny_data' AS (t1,t2,t3);mini = LOAD 'mini_data' AS (m1,m2,m3);C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';

Conditions

Fragment replicate joins are experimental; we don't have a strong sense of how small the small relation must be to fit into memory. In our tests with a simple query that involves just a JOIN, a relation of up to 100 M can be used if the process overall gets 1 GB of memory. Please share your observations and experience with us.

Skewed Joins

Parallel joins are vulnerable to the presence of skew in the underlying data. If the underlying data is sufficiently skewed, load imbalances will swamp any of the parallelism gains. In order to counteract this problem, skewed join computes a histogram of the key space and uses this data to allocate reducers for a given key. Skewed join does not place a restriction on the size of the input keys. It accomplishes this by splitting the left input on the join predicate and streaming the right input. The left input is sampled to create the histogram.

Skewed join can be used when the underlying data is sufficiently skewed and you need a finer control over the allocation of reducers to counteract the skew. It should also be used when the data associated with a given key is too large to fit in memory.

Usage

Perform a skewed join with the USING clause (see inner joins and outer joins).

big = LOAD 'big_data' AS (b1,b2,b3);massive = LOAD 'massive_data' AS (m1,m2,m3);C = JOIN big BY b1, massive BY m1 USING 'skewed';

Conditions

Skewed join will only work under these conditions:

  • Skewed join works withtwo-table inner join. Currently we do not support more than two tables for skewed join. Specifying three-way (or more) joins will fail validation. For such joins, we rely on you to break them up into two-way joins.
  • The pig.skewedjoin.reduce.memusage Java parameter specifies the fraction of heap available for the reducer to perform the join. A low fraction forces pig to use more reducers but increases copying cost. We have seen good performance when we set this value in the range 0.1 - 0.4. However, note that this is hardly an accurate range. Its value depends on the amount of heap available for the operation, the number of columns in the input and the skew. An appropriate value is best obtained by conducting experiments to achieve a good performance. The default value is =0.5=.

Merge Joins

Often user data is stored such thatboth inputs are already sorted on the join key. In this case, it is possible to join the data in the map phase of a MapReduce job. This provides a significant performance improvement compared to passing all of the data through unneeded sort and shuffle phases.

Pig has implemented a merge join algorithm, or sort-merge join, although in this case the sort is already assumed to have been done (see the Conditions, below). Pig implements the merge join algorithm by selecting the left input of the join to be the input file for the map phase, and the right input of the join to be the side file. It then samples records from the right input to build an index that contains, for each sampled record, the key(s) the filename and the offset into the file the record begins at. This sampling is done in an initial map only job. A second MapReduce job is then initiated, with the left input as its input. Each map uses the index to seek to the appropriate record in the right input and begin doing the join.

Usage

Perform a merge join with the USING clause (see inner joins).

C = JOIN A BY a1, B BY b1 USING 'merge';

Conditions

Merge join will only work under these conditions:

  • Both inputs are sorted in *ascending* order of join keys. If an input consists of many files, there should be a total ordering across the files in the *ascending order of file name*. So for example if one of the inputs to the join is a directory called input1 with files a and b under it, the data should be sorted in ascending order of join key when read starting at a and ending in b. Likewise if an input directory has part files part-00000, part-00001, part-00002 and part-00003, the data should be sorted if the files are read in the sequence part-00000, part-00001, part-00002 and part-00003.
  • The merge join only has two inputs
  • The loadfunc for the right input of the join should implement the OrderedLoadFunc interface (PigStorage does implement the OrderedLoadFunc interface).
  • Only inner join will be supported
  • Between the load of the sorted input and the merge join statement there can only be filter statements and foreach statement where the foreach statement should meet the following conditions:
    • There should be no UDFs in the foreach statement
    • The foreach statement should not change the position of the join keys
    • There should not transformation on the join keys which will change the sort order

For optimal performance, each part file of the left (sorted) input of the join should have a size of at least 1 hdfs block size (for example if the hdfs block size is 128 MB, each part file should be less than 128 MB). If the total input size (including all part files) is greater than blocksize, then the part files should be uniform in size (without large skews in sizes). The main idea is to eliminate skew in the amount of input the final map job performing the merge-join will process.

In local mode, merge join will revert to regular join.

0 0