spark学习-20-Spark的sample理解

来源:互联网 发布:日本ip代理地址和端口 编辑:程序博客网 时间:2024/05/21 08:36

1.语法(java):

JavaPairRDD<K,V> sample(boolean withReplacement,                    double fraction)JavaPairRDD<K,V> sample(boolean withReplacement,                    double fraction,                    long seed)

2.说明:

对RDD进行抽样,其中参数withReplacement为true时表示抽样之后还放回,可以被多次抽样,false表示不放回;fraction表示抽样比例;seed为随机数种子,比如当前时间戳

3.程序演示

package mysample;import java.util.Arrays;import java.util.List;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.sql.SparkSession;public class Sample {    public static void main(String[] args) {        SparkSession spark= SparkSession.builder()                      .appName("lcc_java_read_hbase_register_to_table")                      .master("local[*]")                      .getOrCreate();          JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());        List<Integer> datas = Arrays.asList(1, 2, 3, 4,5,6,7,8,9,10);        JavaRDD<Integer> dataRDD = sc.parallelize(datas);        JavaRDD<Integer> sampleRDD = dataRDD.sample(false, 0.5, System.currentTimeMillis());        System.out.println("==========sampleRDD=====1==========");        sampleRDD.foreach(v -> System.out.println(v));        JavaRDD<Integer> sampleRDD2 = dataRDD.sample(true, 0.5, System.currentTimeMillis());        System.out.println("==========sampleRDD=====2==========");        sampleRDD2.foreach(v -> System.out.println(v));        sc.close();    }}

输出结果

==========sampleRDD=====1==========583==========sampleRDD=====2==========7758

小结:每次运行打印的都不相同,相当于java中的随机函数,场景类似与,黑盒子里拿红白球,
有两种拿法一种,
拿出来后在放进去,让别人拿,可能相同,dataRDD.sample(false, 0.5, System.currentTimeMillis());
拿出来后不放进去,让别人拿,绝对不相同 dataRDD.sample(true, 0.5, System.currentTimeMillis());

4.测试第二个参数

JavaRDD<Integer> sampleRDD = dataRDD.sample(false, 0.1, System.currentTimeMillis());JavaRDD<Integer> sampleRDD2 = dataRDD.sample(true, 0.1, System.currentTimeMillis());第一次运行==========sampleRDD=====1==========9==========sampleRDD=====2==========第二次运行==========sampleRDD=====1==========910==========sampleRDD=====2==========110JavaRDD<Integer> sampleRDD = dataRDD.sample(false, 0.6, System.currentTimeMillis());JavaRDD<Integer> sampleRDD2 = dataRDD.sample(true, 0.6, System.currentTimeMillis());第一次运行==========sampleRDD=====1==========68910134第二次运行==========sampleRDD=====2==========244678101345==========sampleRDD=====2==========124568889