SparkSql寻医问药问答分析第二次分析

来源:互联网 发布:godaddy域名被转走 编辑:程序博客网 时间:2024/05/17 01:36

1. 爬虫和数据导入


miaofu@master:~/healthQA$ ls -l -h总用量 3.7G-rw-r--r-- 1 miaofu miaofu  80M  9月 24 13:22 2016-05-01content.txt-rw-r--r-- 1 miaofu miaofu  90M  9月 24 13:22 2016-05-02content.txt-rw-r--r-- 1 miaofu miaofu  82M  9月 24 13:22 2016-05-03content.txt-rw-r--r-- 1 miaofu miaofu  96M  9月 24 13:22 2016-05-04content.txt-rw-r--r-- 1 miaofu miaofu 102M  9月 24 13:22 2016-05-05content.txt-rw-r--r-- 1 miaofu miaofu 101M  9月 24 13:22 2016-05-06content.txt-rw-r--r-- 1 miaofu miaofu  97M  9月 24 13:22 2016-05-07content.txt-rw-r--r-- 1 miaofu miaofu  96M  9月 24 13:22 2016-05-08content.txt-rw-r--r-- 1 miaofu miaofu  99M  9月 24 13:22 2016-05-09content.txt-rw-r--r-- 1 miaofu miaofu  97M  9月 24 13:22 2016-05-10content.txt-rw-r--r-- 1 miaofu miaofu  96M  9月 24 13:22 2016-05-11content.txt-rw-r--r-- 1 miaofu miaofu  96M  9月 24 13:22 2016-05-12content.txt-rw-r--r-- 1 miaofu miaofu  94M  9月 24 13:22 2016-05-13content.txt-rw-r--r-- 1 miaofu miaofu  97M  9月 24 13:22 2016-05-14content.txt-rw-r--r-- 1 miaofu miaofu  94M  9月 24 13:22 2016-05-15content.txt-rw-r--r-- 1 miaofu miaofu  93M  9月 24 13:22 2016-05-16content.txt-rw-r--r-- 1 miaofu miaofu  96M  9月 24 13:22 2016-05-17content.txt-rw-r--r-- 1 miaofu miaofu 111M  9月 24 13:22 2016-05-18content.txt-rw-r--r-- 1 miaofu miaofu  95M  9月 24 13:22 2016-05-19content.txt-rw-r--r-- 1 miaofu miaofu  92M  9月 24 13:22 2016-05-20content.txt-rw-r--r-- 1 miaofu miaofu  91M  9月 24 13:22 2016-05-21content.txt-rw-r--r-- 1 miaofu miaofu  93M  9月 24 13:22 2016-05-22content.txt-rw-r--r-- 1 miaofu miaofu  95M  9月 24 13:22 2016-05-23content.txt-rw-r--r-- 1 miaofu miaofu  90M  9月 24 13:22 2016-05-24content.txt-rw-r--r-- 1 miaofu miaofu  95M  9月 24 13:22 2016-05-25content.txt-rw-r--r-- 1 miaofu miaofu  92M  9月 24 13:22 2016-05-26content.txt-rw-r--r-- 1 miaofu miaofu  92M  9月 24 13:22 2016-05-27content.txt-rw-r--r-- 1 miaofu miaofu  89M  9月 24 13:22 2016-05-28content.txt-rw-r--r-- 1 miaofu miaofu  86M  9月 24 13:22 2016-05-29content.txt-rw-r--r-- 1 miaofu miaofu  90M  9月 24 13:22 2016-05-30content.txt-rw-r--r-- 1 miaofu miaofu  84M  9月 24 13:22 2016-05-31content.txt-rw-r--r-- 1 miaofu miaofu  90M  9月 24 13:22 2016-06-01content.txt-rw-r--r-- 1 miaofu miaofu 120M  9月 24 13:22 2016-06-02content.txt-rw-r--r-- 1 miaofu miaofu 108M  9月 24 13:22 2016-06-03content.txt-rw-r--r-- 1 miaofu miaofu 117M  9月 24 13:23 2016-06-04content.txt-rw-rw-r-- 1 miaofu miaofu 115M  9月 24 13:23 2016-06-05content.txt-rw-rw-r-- 1 miaofu miaofu 118M  9月 24 13:23 2016-06-06content.txt-rw-rw-r-- 1 miaofu miaofu 100M  9月 24 13:23 2016-06-07content.txt-rw-rw-r-- 1 miaofu miaofu  94M  9月 24 13:23 2016-06-08content.txt-rw-rw-r-- 1 miaofu miaofu  12M  9月 24 13:23 2016-06-09content.txtmiaofu@master:~/healthQA$ hadoop fs -mkdir /usermiaofu@master:~/healthQA$ hadoop fs -mkdir /user/miaofu/miaofu@master:~/healthQA$ hadoop fs -mkdir /user/miaofu/healthQAmiaofu@master:~/healthQA$ hadoop fs -put * healthQA

导入花了2分钟。左右导入以后,查看一下hdfs文件系统

miaofu@slave02:~/hadoop-2.6.4$ hadoop dfsadmin -reportDEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.Configured Capacity: 320759169024 (298.73 GB)Present Capacity: 223364142776 (208.02 GB)DFS Remaining: 215425167360 (200.63 GB)DFS Used: 7938975416 (7.39 GB)DFS Used%: 3.55%Under replicated blocks: 0Blocks with corrupt replicas: 0Missing blocks: 0-------------------------------------------------Live datanodes (4):Name: 192.168.202.107:50010 (192.168.202.107)Hostname: slave04Decommission Status : NormalConfigured Capacity: 80189792256 (74.68 GB)DFS Used: 1313626737 (1.22 GB)Non DFS Used: 22749664655 (21.19 GB)DFS Remaining: 56126500864 (52.27 GB)DFS Used%: 1.64%DFS Remaining%: 69.99%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Sat Sep 24 13:28:04 CST 2016Name: 192.168.202.104:50010 (master)Hostname: masterDecommission Status : NormalConfigured Capacity: 80189792256 (74.68 GB)DFS Used: 3969463132 (3.70 GB)Non DFS Used: 29238848676 (27.23 GB)DFS Remaining: 46981480448 (43.75 GB)DFS Used%: 4.95%DFS Remaining%: 58.59%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Sat Sep 24 13:28:05 CST 2016Name: 192.168.202.106:50010 (slave03)Hostname: slave03Decommission Status : NormalConfigured Capacity: 80189792256 (74.68 GB)DFS Used: 1549846117 (1.44 GB)Non DFS Used: 22752852379 (21.19 GB)DFS Remaining: 55887093760 (52.05 GB)DFS Used%: 1.93%DFS Remaining%: 69.69%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Sat Sep 24 13:28:03 CST 2016Name: 192.168.202.105:50010 (slave02)Hostname: slave02Decommission Status : NormalConfigured Capacity: 80189792256 (74.68 GB)DFS Used: 1106039430 (1.03 GB)Non DFS Used: 22653660538 (21.10 GB)DFS Remaining: 56430092288 (52.55 GB)DFS Used%: 1.38%DFS Remaining%: 70.37%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Sat Sep 24 13:28:05 CST 2016


2. 用SparkSql查询分布式的非结构化数据库


scala> val in = sqlContext.jsonFile("hdfs:///user/miaofu/healthQA")warning: there were 1 deprecation warning(s); re-run with -deprecation for details16/09/24 14:12:30 INFO JSONRelation: Listing hdfs://master:8020/user/miaofu/healthQA on driver16/09/24 14:12:32 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 234.9 KB, free 234.9 KB)16/09/24 14:12:32 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.9 KB, free 254.7 KB)16/09/24 14:12:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.202.104:51148 (size: 19.9 KB, free: 511.5 MB)16/09/24 14:12:32 INFO SparkContext: Created broadcast 0 from jsonFile at <console>:2516/09/24 14:12:33 INFO FileInputFormat: Total input paths to process : 4016/09/24 14:12:34 INFO SparkContext: Starting job: jsonFile at <console>:2516/09/24 14:12:34 INFO DAGScheduler: Registering RDD 4 (jsonFile at <console>:25)16/09/24 14:12:34 INFO DAGScheduler: Got job 0 (jsonFile at <console>:25) with 5 output partitions16/09/24 14:12:34 INFO DAGScheduler: Final stage: ResultStage 1 (jsonFile at <console>:25)16/09/24 14:12:34 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)16/09/24 14:12:34 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)16/09/24 14:12:34 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[4] at jsonFile at <console>:25), which has no missing parents16/09/24 14:12:34 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.4 KB, free 260.1 KB)16/09/24 14:12:34 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.0 KB, free 263.1 KB)16/09/24 14:12:34 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.202.104:51148 (size: 3.0 KB, free: 511.5 MB)16/09/24 14:12:34 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:100616/09/24 14:12:34 INFO DAGScheduler: Submitting 40 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[4] at jsonFile at <console>:25)16/09/24 14:12:34 INFO YarnScheduler: Adding task set 0.0 with 40 tasks16/09/24 14:12:34 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, master, partition 0,NODE_LOCAL, 2152 bytes)16/09/24 14:12:34 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, slave03, partition 1,NODE_LOCAL, 2152 bytes)16/09/24 14:12:35 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on slave03:57242 (size: 3.0 KB, free: 1247.6 MB)16/09/24 14:12:35 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on slave03:57242 (size: 19.9 KB, free: 1247.6 MB)16/09/24 14:12:40 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on master:47865 (size: 3.0 KB, free: 1247.6 MB)16/09/24 14:12:42 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on master:47865 (size: 19.9 KB, free: 1247.6 MB)16/09/24 14:12:45 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, slave03, partition 2,NODE_LOCAL, 2152 bytes)16/09/24 14:12:46 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 11416 ms on slave03 (1/40)16/09/24 14:12:47 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, slave03, partition 3,NODE_LOCAL, 2152 bytes)16/09/24 14:12:47 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2063 ms on slave03 (2/40)16/09/24 14:12:49 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 4, slave03, partition 8,NODE_LOCAL, 2152 bytes)16/09/24 14:12:49 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 1470 ms on slave03 (3/40)16/09/24 14:12:50 INFO TaskSetManager: Starting task 11.0 in stage 0.0 (TID 5, slave03, partition 11,NODE_LOCAL, 2152 bytes)16/09/24 14:12:50 INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 4) in 1485 ms on slave03 (4/40)16/09/24 14:12:52 INFO TaskSetManager: Starting task 15.0 in stage 0.0 (TID 6, slave03, partition 15,NODE_LOCAL, 2152 bytes)16/09/24 14:12:52 INFO TaskSetManager: Finished task 11.0 in stage 0.0 (TID 5) in 1632 ms on slave03 (5/40)16/09/24 14:12:53 INFO TaskSetManager: Starting task 16.0 in stage 0.0 (TID 7, slave03, partition 16,NODE_LOCAL, 2152 bytes)16/09/24 14:12:53 INFO TaskSetManager: Finished task 15.0 in stage 0.0 (TID 6) in 1328 ms on slave03 (6/40)16/09/24 14:12:57 INFO TaskSetManager: Starting task 17.0 in stage 0.0 (TID 8, slave03, partition 17,NODE_LOCAL, 2152 bytes)16/09/24 14:12:57 INFO TaskSetManager: Finished task 16.0 in stage 0.0 (TID 7) in 3248 ms on slave03 (7/40)16/09/24 14:12:58 INFO TaskSetManager: Starting task 21.0 in stage 0.0 (TID 9, slave03, partition 21,NODE_LOCAL, 2152 bytes)16/09/24 14:12:58 INFO TaskSetManager: Finished task 17.0 in stage 0.0 (TID 8) in 1687 ms on slave03 (8/40)16/09/24 14:12:59 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 10, master, partition 4,NODE_LOCAL, 2152 bytes)16/09/24 14:12:59 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 24625 ms on master (9/40)16/09/24 14:13:01 INFO TaskSetManager: Starting task 25.0 in stage 0.0 (TID 11, slave03, partition 25,NODE_LOCAL, 2152 bytes)16/09/24 14:13:01 INFO TaskSetManager: Finished task 21.0 in stage 0.0 (TID 9) in 2319 ms on slave03 (10/40)16/09/24 14:13:02 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 12, master, partition 5,NODE_LOCAL, 2152 bytes)16/09/24 14:13:02 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 10) in 3168 ms on master (11/40)16/09/24 14:13:02 INFO TaskSetManager: Starting task 29.0 in stage 0.0 (TID 13, slave03, partition 29,NODE_LOCAL, 2152 bytes)16/09/24 14:13:02 INFO TaskSetManager: Finished task 25.0 in stage 0.0 (TID 11) in 1511 ms on slave03 (12/40)16/09/24 14:13:04 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 14, master, partition 6,NODE_LOCAL, 2152 bytes)16/09/24 14:13:04 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 12) in 2434 ms on master (13/40)16/09/24 14:13:05 INFO TaskSetManager: Starting task 31.0 in stage 0.0 (TID 15, slave03, partition 31,NODE_LOCAL, 2152 bytes)16/09/24 14:13:05 INFO TaskSetManager: Finished task 29.0 in stage 0.0 (TID 13) in 2625 ms on slave03 (14/40)16/09/24 14:13:06 INFO TaskSetManager: Starting task 32.0 in stage 0.0 (TID 16, slave03, partition 32,NODE_LOCAL, 2152 bytes)16/09/24 14:13:06 INFO TaskSetManager: Finished task 31.0 in stage 0.0 (TID 15) in 1786 ms on slave03 (15/40)16/09/24 14:13:07 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 17, master, partition 7,NODE_LOCAL, 2152 bytes)16/09/24 14:13:07 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 14) in 2640 ms on master (16/40)16/09/24 14:13:10 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 18, master, partition 9,NODE_LOCAL, 2152 bytes)16/09/24 14:13:10 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 17) in 3121 ms on master (17/40)16/09/24 14:13:10 INFO TaskSetManager: Starting task 33.0 in stage 0.0 (TID 19, slave03, partition 33,NODE_LOCAL, 2152 bytes)16/09/24 14:13:10 INFO TaskSetManager: Finished task 32.0 in stage 0.0 (TID 16) in 3807 ms on slave03 (18/40)16/09/24 14:13:12 INFO TaskSetManager: Starting task 10.0 in stage 0.0 (TID 20, master, partition 10,NODE_LOCAL, 2152 bytes)16/09/24 14:13:12 INFO TaskSetManager: Finished task 9.0 in stage 0.0 (TID 18) in 1617 ms on master (19/40)16/09/24 14:13:13 INFO TaskSetManager: Starting task 34.0 in stage 0.0 (TID 21, slave03, partition 34,NODE_LOCAL, 2152 bytes)16/09/24 14:13:13 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 19) in 2640 ms on slave03 (20/40)16/09/24 14:13:13 INFO TaskSetManager: Starting task 12.0 in stage 0.0 (TID 22, master, partition 12,NODE_LOCAL, 2152 bytes)16/09/24 14:13:13 INFO TaskSetManager: Finished task 10.0 in stage 0.0 (TID 20) in 1635 ms on master (21/40)16/09/24 14:13:15 INFO TaskSetManager: Finished task 34.0 in stage 0.0 (TID 21) in 2263 ms on slave03 (22/40)16/09/24 14:13:15 INFO TaskSetManager: Starting task 13.0 in stage 0.0 (TID 23, master, partition 13,NODE_LOCAL, 2152 bytes)16/09/24 14:13:15 INFO TaskSetManager: Finished task 12.0 in stage 0.0 (TID 22) in 2107 ms on master (23/40)16/09/24 14:13:17 INFO TaskSetManager: Starting task 14.0 in stage 0.0 (TID 24, master, partition 14,NODE_LOCAL, 2152 bytes)16/09/24 14:13:17 INFO TaskSetManager: Finished task 13.0 in stage 0.0 (TID 23) in 1695 ms on master (24/40)16/09/24 14:13:19 INFO TaskSetManager: Starting task 18.0 in stage 0.0 (TID 25, master, partition 18,NODE_LOCAL, 2152 bytes)16/09/24 14:13:19 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 24) in 1585 ms on master (25/40)16/09/24 14:13:20 INFO TaskSetManager: Starting task 19.0 in stage 0.0 (TID 26, master, partition 19,NODE_LOCAL, 2152 bytes)16/09/24 14:13:20 INFO TaskSetManager: Finished task 18.0 in stage 0.0 (TID 25) in 1401 ms on master (26/40)16/09/24 14:13:22 INFO TaskSetManager: Starting task 20.0 in stage 0.0 (TID 27, master, partition 20,NODE_LOCAL, 2152 bytes)16/09/24 14:13:22 INFO TaskSetManager: Finished task 19.0 in stage 0.0 (TID 26) in 2136 ms on master (27/40)16/09/24 14:13:24 INFO TaskSetManager: Starting task 22.0 in stage 0.0 (TID 28, master, partition 22,NODE_LOCAL, 2152 bytes)16/09/24 14:13:24 INFO TaskSetManager: Finished task 20.0 in stage 0.0 (TID 27) in 1877 ms on master (28/40)16/09/24 14:13:27 INFO TaskSetManager: Starting task 23.0 in stage 0.0 (TID 29, master, partition 23,NODE_LOCAL, 2152 bytes)16/09/24 14:13:27 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 28) in 2821 ms on master (29/40)16/09/24 14:13:28 INFO TaskSetManager: Starting task 24.0 in stage 0.0 (TID 30, master, partition 24,NODE_LOCAL, 2152 bytes)16/09/24 14:13:28 INFO TaskSetManager: Finished task 23.0 in stage 0.0 (TID 29) in 1447 ms on master (30/40)16/09/24 14:13:30 INFO TaskSetManager: Starting task 26.0 in stage 0.0 (TID 31, master, partition 26,NODE_LOCAL, 2152 bytes)16/09/24 14:13:30 INFO TaskSetManager: Finished task 24.0 in stage 0.0 (TID 30) in 1549 ms on master (31/40)16/09/24 14:13:31 INFO TaskSetManager: Starting task 27.0 in stage 0.0 (TID 32, master, partition 27,NODE_LOCAL, 2152 bytes)16/09/24 14:13:31 INFO TaskSetManager: Finished task 26.0 in stage 0.0 (TID 31) in 1272 ms on master (32/40)16/09/24 14:13:32 INFO TaskSetManager: Starting task 28.0 in stage 0.0 (TID 33, master, partition 28,NODE_LOCAL, 2152 bytes)16/09/24 14:13:32 INFO TaskSetManager: Finished task 27.0 in stage 0.0 (TID 32) in 1189 ms on master (33/40)16/09/24 14:13:34 INFO TaskSetManager: Starting task 30.0 in stage 0.0 (TID 34, master, partition 30,NODE_LOCAL, 2152 bytes)16/09/24 14:13:34 INFO TaskSetManager: Finished task 28.0 in stage 0.0 (TID 33) in 1198 ms on master (34/40)16/09/24 14:13:35 INFO TaskSetManager: Starting task 35.0 in stage 0.0 (TID 35, master, partition 35,NODE_LOCAL, 2152 bytes)16/09/24 14:13:35 INFO TaskSetManager: Finished task 30.0 in stage 0.0 (TID 34) in 1152 ms on master (35/40)16/09/24 14:13:36 INFO TaskSetManager: Starting task 36.0 in stage 0.0 (TID 36, master, partition 36,NODE_LOCAL, 2152 bytes)16/09/24 14:13:36 INFO TaskSetManager: Finished task 35.0 in stage 0.0 (TID 35) in 1553 ms on master (36/40)16/09/24 14:13:38 INFO TaskSetManager: Starting task 37.0 in stage 0.0 (TID 37, master, partition 37,NODE_LOCAL, 2152 bytes)16/09/24 14:13:38 INFO TaskSetManager: Finished task 36.0 in stage 0.0 (TID 36) in 1652 ms on master (37/40)16/09/24 14:13:39 INFO TaskSetManager: Starting task 38.0 in stage 0.0 (TID 38, master, partition 38,NODE_LOCAL, 2152 bytes)16/09/24 14:13:39 INFO TaskSetManager: Finished task 37.0 in stage 0.0 (TID 37) in 1616 ms on master (38/40)16/09/24 14:13:41 INFO TaskSetManager: Starting task 39.0 in stage 0.0 (TID 39, master, partition 39,NODE_LOCAL, 2152 bytes)16/09/24 14:13:41 INFO TaskSetManager: Finished task 38.0 in stage 0.0 (TID 38) in 1441 ms on master (39/40)16/09/24 14:13:41 INFO TaskSetManager: Finished task 39.0 in stage 0.0 (TID 39) in 315 ms on master (40/40)16/09/24 14:13:41 INFO DAGScheduler: ShuffleMapStage 0 (jsonFile at <console>:25) finished in 67.196 s16/09/24 14:13:41 INFO DAGScheduler: looking for newly runnable stages16/09/24 14:13:41 INFO DAGScheduler: running: Set()16/09/24 14:13:41 INFO DAGScheduler: waiting: Set(ResultStage 1)16/09/24 14:13:41 INFO DAGScheduler: failed: Set()16/09/24 14:13:41 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at jsonFile at <console>:25), which has no missing parents16/09/24 14:13:41 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/09/24 14:13:41 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 266.2 KB)16/09/24 14:13:42 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1885.0 B, free 268.0 KB)16/09/24 14:13:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.202.104:51148 (size: 1885.0 B, free: 511.5 MB)16/09/24 14:13:42 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:100616/09/24 14:13:42 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at jsonFile at <console>:25)16/09/24 14:13:42 INFO YarnScheduler: Adding task set 1.0 with 5 tasks16/09/24 14:13:42 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 40, slave03, partition 0,NODE_LOCAL, 1894 bytes)16/09/24 14:13:42 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 41, master, partition 1,NODE_LOCAL, 1894 bytes)16/09/24 14:13:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on slave03:57242 (size: 1885.0 B, free: 1247.6 MB)16/09/24 14:13:42 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to slave03:4874716/09/24 14:13:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on master:47865 (size: 1885.0 B, free: 1247.6 MB)16/09/24 14:13:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 239 bytes16/09/24 14:13:42 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to master:5729716/09/24 14:13:43 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 42, master, partition 2,NODE_LOCAL, 1894 bytes)16/09/24 14:13:43 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 43, slave03, partition 3,NODE_LOCAL, 1894 bytes)16/09/24 14:13:43 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 40) in 1524 ms on slave03 (1/5)16/09/24 14:13:43 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 41) in 1522 ms on master (2/5)16/09/24 14:13:43 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 44, master, partition 4,NODE_LOCAL, 1894 bytes)16/09/24 14:13:43 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 42) in 166 ms on master (3/5)16/09/24 14:13:43 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 43) in 167 ms on slave03 (4/5)16/09/24 14:13:43 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 44) in 109 ms on master (5/5)16/09/24 14:13:43 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/09/24 14:13:44 INFO DAGScheduler: ResultStage 1 (jsonFile at <console>:25) finished in 2.050 s16/09/24 14:13:44 INFO DAGScheduler: Job 0 finished: jsonFile at <console>:25, took 70.200793 sin: org.apache.spark.sql.DataFrame = [answer: array<struct<answer_text:string,answer_time:string,mainpage:string,major:string,name:string,position:string>>, question: struct<question:string,time:string>, userinfo: struct<age:string,gender:string,name:string>]scala> in.count()16/09/24 14:13:55 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 86.5 KB, free 354.5 KB)16/09/24 14:13:55 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 19.5 KB, free 374.0 KB)16/09/24 14:13:55 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.202.104:51148 (size: 19.5 KB, free: 511.5 MB)16/09/24 14:13:55 INFO SparkContext: Created broadcast 3 from count at <console>:2816/09/24 14:13:55 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 234.9 KB, free 608.8 KB)16/09/24 14:13:55 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 19.9 KB, free 628.7 KB)16/09/24 14:13:55 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.202.104:51148 (size: 19.9 KB, free: 511.4 MB)16/09/24 14:13:55 INFO SparkContext: Created broadcast 4 from count at <console>:2816/09/24 14:13:56 INFO FileInputFormat: Total input paths to process : 4016/09/24 14:13:57 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.202.104:51148 in memory (size: 19.5 KB, free: 511.5 MB)16/09/24 14:13:57 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.202.104:51148 in memory (size: 1885.0 B, free: 511.5 MB)16/09/24 14:13:57 INFO SparkContext: Starting job: count at <console>:2816/09/24 14:13:57 INFO DAGScheduler: Registering RDD 12 (count at <console>:28)16/09/24 14:13:57 INFO DAGScheduler: Got job 1 (count at <console>:28) with 1 output partitions16/09/24 14:13:57 INFO DAGScheduler: Final stage: ResultStage 3 (count at <console>:28)16/09/24 14:13:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)16/09/24 14:13:57 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 2)16/09/24 14:13:57 INFO DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[12] at count at <console>:28), which has no missing parents16/09/24 14:13:57 INFO BlockManagerInfo: Removed broadcast_2_piece0 on slave03:57242 in memory (size: 1885.0 B, free: 1247.6 MB)16/09/24 14:13:57 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 10.4 KB, free 528.2 KB)16/09/24 14:13:57 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 5.3 KB, free 533.5 KB)16/09/24 14:13:57 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.202.104:51148 (size: 5.3 KB, free: 511.5 MB)16/09/24 14:13:57 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:100616/09/24 14:13:57 INFO DAGScheduler: Submitting 40 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[12] at count at <console>:28)16/09/24 14:13:57 INFO YarnScheduler: Adding task set 2.0 with 40 tasks16/09/24 14:13:57 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 45, master, partition 0,NODE_LOCAL, 2152 bytes)16/09/24 14:13:57 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 46, slave03, partition 1,NODE_LOCAL, 2152 bytes)16/09/24 14:13:57 INFO BlockManagerInfo: Removed broadcast_2_piece0 on master:47865 in memory (size: 1885.0 B, free: 1247.6 MB)16/09/24 14:13:57 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on master:47865 (size: 5.3 KB, free: 1247.6 MB)16/09/24 14:13:57 INFO ContextCleaner: Cleaned accumulator 216/09/24 14:13:57 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on slave03:57242 (size: 5.3 KB, free: 1247.6 MB)16/09/24 14:13:57 INFO BlockManagerInfo: Removed broadcast_1_piece0 on master:47865 in memory (size: 3.0 KB, free: 1247.6 MB)16/09/24 14:13:57 INFO BlockManagerInfo: Removed broadcast_1_piece0 on slave03:57242 in memory (size: 3.0 KB, free: 1247.6 MB)16/09/24 14:13:57 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.202.104:51148 in memory (size: 3.0 KB, free: 511.5 MB)16/09/24 14:13:57 INFO ContextCleaner: Cleaned accumulator 116/09/24 14:13:57 INFO ContextCleaner: Cleaned shuffle 016/09/24 14:13:57 INFO BlockManagerInfo: Removed broadcast_0_piece0 on master:47865 in memory (size: 19.9 KB, free: 1247.6 MB)16/09/24 14:13:57 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.202.104:51148 in memory (size: 19.9 KB, free: 511.5 MB)16/09/24 14:13:57 INFO BlockManagerInfo: Removed broadcast_0_piece0 on slave03:57242 in memory (size: 19.9 KB, free: 1247.6 MB)16/09/24 14:13:57 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on slave03:57242 (size: 19.9 KB, free: 1247.6 MB)16/09/24 14:13:58 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on master:47865 (size: 19.9 KB, free: 1247.6 MB)16/09/24 14:14:00 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 47, slave03, partition 2,NODE_LOCAL, 2152 bytes)16/09/24 14:14:00 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 46) in 2792 ms on slave03 (1/40)16/09/24 14:14:01 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 48, master, partition 3,NODE_LOCAL, 2152 bytes)16/09/24 14:14:01 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 45) in 4017 ms on master (2/40)16/09/24 14:14:01 INFO TaskSetManager: Starting task 8.0 in stage 2.0 (TID 49, slave03, partition 8,NODE_LOCAL, 2152 bytes)16/09/24 14:14:01 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 47) in 1427 ms on slave03 (3/40)16/09/24 14:14:03 INFO TaskSetManager: Starting task 11.0 in stage 2.0 (TID 50, slave03, partition 11,NODE_LOCAL, 2152 bytes)16/09/24 14:14:03 INFO TaskSetManager: Finished task 8.0 in stage 2.0 (TID 49) in 1305 ms on slave03 (4/40)16/09/24 14:14:03 INFO TaskSetManager: Starting task 4.0 in stage 2.0 (TID 51, master, partition 4,NODE_LOCAL, 2152 bytes)16/09/24 14:14:03 INFO TaskSetManager: Finished task 3.0 in stage 2.0 (TID 48) in 1558 ms on master (5/40)16/09/24 14:14:04 INFO TaskSetManager: Starting task 15.0 in stage 2.0 (TID 52, slave03, partition 15,NODE_LOCAL, 2152 bytes)16/09/24 14:14:04 INFO TaskSetManager: Finished task 11.0 in stage 2.0 (TID 50) in 1197 ms on slave03 (6/40)16/09/24 14:14:04 INFO TaskSetManager: Starting task 5.0 in stage 2.0 (TID 53, master, partition 5,NODE_LOCAL, 2152 bytes)16/09/24 14:14:04 INFO TaskSetManager: Finished task 4.0 in stage 2.0 (TID 51) in 1672 ms on master (7/40)16/09/24 14:14:05 INFO TaskSetManager: Starting task 16.0 in stage 2.0 (TID 54, slave03, partition 16,NODE_LOCAL, 2152 bytes)16/09/24 14:14:05 INFO TaskSetManager: Finished task 15.0 in stage 2.0 (TID 52) in 1164 ms on slave03 (8/40)16/09/24 14:14:05 INFO TaskSetManager: Starting task 6.0 in stage 2.0 (TID 55, master, partition 6,NODE_LOCAL, 2152 bytes)16/09/24 14:14:05 INFO TaskSetManager: Finished task 5.0 in stage 2.0 (TID 53) in 997 ms on master (9/40)16/09/24 14:14:06 INFO TaskSetManager: Starting task 17.0 in stage 2.0 (TID 56, slave03, partition 17,NODE_LOCAL, 2152 bytes)16/09/24 14:14:06 INFO TaskSetManager: Finished task 16.0 in stage 2.0 (TID 54) in 1056 ms on slave03 (10/40)16/09/24 14:14:06 INFO TaskSetManager: Starting task 7.0 in stage 2.0 (TID 57, master, partition 7,NODE_LOCAL, 2152 bytes)16/09/24 14:14:06 INFO TaskSetManager: Finished task 6.0 in stage 2.0 (TID 55) in 932 ms on master (11/40)16/09/24 14:14:07 INFO TaskSetManager: Starting task 9.0 in stage 2.0 (TID 58, master, partition 9,NODE_LOCAL, 2152 bytes)16/09/24 14:14:07 INFO TaskSetManager: Finished task 7.0 in stage 2.0 (TID 57) in 932 ms on master (12/40)16/09/24 14:14:07 INFO TaskSetManager: Starting task 21.0 in stage 2.0 (TID 59, slave03, partition 21,NODE_LOCAL, 2152 bytes)16/09/24 14:14:07 INFO TaskSetManager: Finished task 17.0 in stage 2.0 (TID 56) in 1398 ms on slave03 (13/40)16/09/24 14:14:09 INFO TaskSetManager: Starting task 10.0 in stage 2.0 (TID 60, master, partition 10,NODE_LOCAL, 2152 bytes)16/09/24 14:14:09 INFO TaskSetManager: Finished task 9.0 in stage 2.0 (TID 58) in 1337 ms on master (14/40)16/09/24 14:14:09 INFO TaskSetManager: Starting task 25.0 in stage 2.0 (TID 61, slave03, partition 25,NODE_LOCAL, 2152 bytes)16/09/24 14:14:09 INFO TaskSetManager: Finished task 21.0 in stage 2.0 (TID 59) in 1232 ms on slave03 (15/40)16/09/24 14:14:10 INFO TaskSetManager: Starting task 29.0 in stage 2.0 (TID 62, slave03, partition 29,NODE_LOCAL, 2152 bytes)16/09/24 14:14:10 INFO TaskSetManager: Finished task 25.0 in stage 2.0 (TID 61) in 1415 ms on slave03 (16/40)16/09/24 14:14:10 INFO TaskSetManager: Starting task 12.0 in stage 2.0 (TID 63, master, partition 12,NODE_LOCAL, 2152 bytes)16/09/24 14:14:10 INFO TaskSetManager: Finished task 10.0 in stage 2.0 (TID 60) in 1529 ms on master (17/40)16/09/24 14:14:11 INFO TaskSetManager: Starting task 31.0 in stage 2.0 (TID 64, slave03, partition 31,NODE_LOCAL, 2152 bytes)16/09/24 14:14:11 INFO TaskSetManager: Finished task 29.0 in stage 2.0 (TID 62) in 1025 ms on slave03 (18/40)16/09/24 14:14:12 INFO TaskSetManager: Starting task 13.0 in stage 2.0 (TID 65, master, partition 13,NODE_LOCAL, 2152 bytes)16/09/24 14:14:12 INFO TaskSetManager: Finished task 12.0 in stage 2.0 (TID 63) in 1648 ms on master (19/40)16/09/24 14:14:12 INFO TaskSetManager: Starting task 32.0 in stage 2.0 (TID 66, slave03, partition 32,NODE_LOCAL, 2152 bytes)16/09/24 14:14:12 INFO TaskSetManager: Finished task 31.0 in stage 2.0 (TID 64) in 1128 ms on slave03 (20/40)16/09/24 14:14:13 INFO TaskSetManager: Starting task 14.0 in stage 2.0 (TID 67, master, partition 14,NODE_LOCAL, 2152 bytes)16/09/24 14:14:13 INFO TaskSetManager: Finished task 13.0 in stage 2.0 (TID 65) in 1392 ms on master (21/40)16/09/24 14:14:14 INFO TaskSetManager: Starting task 33.0 in stage 2.0 (TID 68, slave03, partition 33,NODE_LOCAL, 2152 bytes)16/09/24 14:14:14 INFO TaskSetManager: Finished task 32.0 in stage 2.0 (TID 66) in 1381 ms on slave03 (22/40)16/09/24 14:14:14 INFO TaskSetManager: Starting task 18.0 in stage 2.0 (TID 69, master, partition 18,NODE_LOCAL, 2152 bytes)16/09/24 14:14:14 INFO TaskSetManager: Finished task 14.0 in stage 2.0 (TID 67) in 1354 ms on master (23/40)16/09/24 14:14:15 INFO TaskSetManager: Starting task 34.0 in stage 2.0 (TID 70, slave03, partition 34,NODE_LOCAL, 2152 bytes)16/09/24 14:14:15 INFO TaskSetManager: Finished task 33.0 in stage 2.0 (TID 68) in 1279 ms on slave03 (24/40)16/09/24 14:14:16 INFO TaskSetManager: Starting task 19.0 in stage 2.0 (TID 71, master, partition 19,NODE_LOCAL, 2152 bytes)16/09/24 14:14:16 INFO TaskSetManager: Finished task 18.0 in stage 2.0 (TID 69) in 1109 ms on master (25/40)16/09/24 14:14:16 INFO TaskSetManager: Finished task 34.0 in stage 2.0 (TID 70) in 1361 ms on slave03 (26/40)16/09/24 14:14:17 INFO TaskSetManager: Starting task 20.0 in stage 2.0 (TID 72, master, partition 20,NODE_LOCAL, 2152 bytes)16/09/24 14:14:17 INFO TaskSetManager: Finished task 19.0 in stage 2.0 (TID 71) in 1280 ms on master (27/40)16/09/24 14:14:18 INFO TaskSetManager: Starting task 22.0 in stage 2.0 (TID 73, master, partition 22,NODE_LOCAL, 2152 bytes)16/09/24 14:14:18 INFO TaskSetManager: Finished task 20.0 in stage 2.0 (TID 72) in 1067 ms on master (28/40)16/09/24 14:14:19 INFO TaskSetManager: Starting task 23.0 in stage 2.0 (TID 74, master, partition 23,NODE_LOCAL, 2152 bytes)16/09/24 14:14:19 INFO TaskSetManager: Finished task 22.0 in stage 2.0 (TID 73) in 1068 ms on master (29/40)16/09/24 14:14:20 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 75, master, partition 24,NODE_LOCAL, 2152 bytes)16/09/24 14:14:20 INFO TaskSetManager: Finished task 23.0 in stage 2.0 (TID 74) in 929 ms on master (30/40)16/09/24 14:14:21 INFO TaskSetManager: Starting task 26.0 in stage 2.0 (TID 76, master, partition 26,NODE_LOCAL, 2152 bytes)16/09/24 14:14:21 INFO TaskSetManager: Finished task 24.0 in stage 2.0 (TID 75) in 873 ms on master (31/40)16/09/24 14:14:22 INFO TaskSetManager: Starting task 27.0 in stage 2.0 (TID 77, master, partition 27,NODE_LOCAL, 2152 bytes)16/09/24 14:14:22 INFO TaskSetManager: Finished task 26.0 in stage 2.0 (TID 76) in 1021 ms on master (32/40)16/09/24 14:14:23 INFO TaskSetManager: Starting task 28.0 in stage 2.0 (TID 78, master, partition 28,NODE_LOCAL, 2152 bytes)16/09/24 14:14:23 INFO TaskSetManager: Finished task 27.0 in stage 2.0 (TID 77) in 1255 ms on master (33/40)16/09/24 14:14:24 INFO TaskSetManager: Starting task 30.0 in stage 2.0 (TID 79, master, partition 30,NODE_LOCAL, 2152 bytes)16/09/24 14:14:24 INFO TaskSetManager: Finished task 28.0 in stage 2.0 (TID 78) in 897 ms on master (34/40)16/09/24 14:14:25 INFO TaskSetManager: Starting task 35.0 in stage 2.0 (TID 80, master, partition 35,NODE_LOCAL, 2152 bytes)16/09/24 14:14:25 INFO TaskSetManager: Finished task 30.0 in stage 2.0 (TID 79) in 1017 ms on master (35/40)16/09/24 14:14:26 INFO TaskSetManager: Starting task 36.0 in stage 2.0 (TID 81, master, partition 36,NODE_LOCAL, 2152 bytes)16/09/24 14:14:26 INFO TaskSetManager: Finished task 35.0 in stage 2.0 (TID 80) in 1402 ms on master (36/40)16/09/24 14:14:28 INFO TaskSetManager: Starting task 37.0 in stage 2.0 (TID 82, master, partition 37,NODE_LOCAL, 2152 bytes)16/09/24 14:14:28 INFO TaskSetManager: Finished task 36.0 in stage 2.0 (TID 81) in 1231 ms on master (37/40)16/09/24 14:14:29 INFO TaskSetManager: Starting task 38.0 in stage 2.0 (TID 83, master, partition 38,NODE_LOCAL, 2152 bytes)16/09/24 14:14:29 INFO TaskSetManager: Finished task 37.0 in stage 2.0 (TID 82) in 1319 ms on master (38/40)16/09/24 14:14:30 INFO TaskSetManager: Starting task 39.0 in stage 2.0 (TID 84, master, partition 39,NODE_LOCAL, 2152 bytes)16/09/24 14:14:30 INFO TaskSetManager: Finished task 38.0 in stage 2.0 (TID 83) in 1076 ms on master (39/40)16/09/24 14:14:30 INFO TaskSetManager: Finished task 39.0 in stage 2.0 (TID 84) in 186 ms on master (40/40)16/09/24 14:14:30 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 16/09/24 14:14:30 INFO DAGScheduler: ShuffleMapStage 2 (count at <console>:28) finished in 33.059 s16/09/24 14:14:30 INFO DAGScheduler: looking for newly runnable stages16/09/24 14:14:30 INFO DAGScheduler: running: Set()16/09/24 14:14:30 INFO DAGScheduler: waiting: Set(ResultStage 3)16/09/24 14:14:30 INFO DAGScheduler: failed: Set()16/09/24 14:14:30 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[15] at count at <console>:28), which has no missing parents16/09/24 14:14:30 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 11.5 KB, free 282.0 KB)16/09/24 14:14:30 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.8 KB, free 287.8 KB)16/09/24 14:14:30 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 192.168.202.104:51148 (size: 5.8 KB, free: 511.5 MB)16/09/24 14:14:30 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:100616/09/24 14:14:30 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[15] at count at <console>:28)16/09/24 14:14:30 INFO YarnScheduler: Adding task set 3.0 with 1 tasks16/09/24 14:14:30 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 85, master, partition 0,NODE_LOCAL, 1999 bytes)16/09/24 14:14:30 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on master:47865 (size: 5.8 KB, free: 1247.6 MB)16/09/24 14:14:31 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to master:5729716/09/24 14:14:31 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 187 bytes16/09/24 14:14:32 INFO DAGScheduler: ResultStage 3 (count at <console>:28) finished in 1.753 s16/09/24 14:14:32 INFO DAGScheduler: Job 1 finished: count at <console>:28, took 34.948111 s16/09/24 14:14:32 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 85) in 1752 ms on master (1/1)16/09/24 14:14:32 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool res0: Long = 1342920scala> in.cache()16/09/24 14:14:46 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 229.7 KB, free 517.5 KB)16/09/24 14:14:46 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 19.5 KB, free 537.0 KB)16/09/24 14:14:46 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on 192.168.202.104:51148 (size: 19.5 KB, free: 511.5 MB)16/09/24 14:14:46 INFO SparkContext: Created broadcast 7 from cache at <console>:2816/09/24 14:14:46 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 234.9 KB, free 771.9 KB)16/09/24 14:14:46 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 19.9 KB, free 791.7 KB)16/09/24 14:14:46 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.202.104:51148 (size: 19.9 KB, free: 511.4 MB)16/09/24 14:14:46 INFO SparkContext: Created broadcast 8 from cache at <console>:28res1: in.type = [answer: array<struct<answer_text:string,answer_time:string,mainpage:string,major:string,name:string,position:string>>, question: struct<question:string,time:string>, userinfo: struct<age:string,gender:string,name:string>]scala> in.groupBy("userinfo.gender").count().show()16/09/24 14:15:12 INFO FileInputFormat: Total input paths to process : 4016/09/24 14:15:12 INFO SparkContext: Starting job: show at <console>:2816/09/24 14:15:12 INFO DAGScheduler: Registering RDD 23 (show at <console>:28)16/09/24 14:15:12 INFO DAGScheduler: Got job 2 (show at <console>:28) with 1 output partitions16/09/24 14:15:12 INFO DAGScheduler: Final stage: ResultStage 5 (show at <console>:28)16/09/24 14:15:12 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)16/09/24 14:15:12 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 4)16/09/24 14:15:13 INFO DAGScheduler: Submitting ShuffleMapStage 4 (MapPartitionsRDD[23] at show at <console>:28), which has no missing parents16/09/24 14:15:13 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 14.6 KB, free 806.3 KB)16/09/24 14:15:13 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 7.0 KB, free 813.3 KB)......16/09/24 14:16:38 INFO TaskSetManager: Starting task 197.0 in stage 7.0 (TID 335, slave03, partition 198,NODE_LOCAL, 1999 bytes)16/09/24 14:16:38 INFO TaskSetManager: Finished task 194.0 in stage 7.0 (TID 332) in 49 ms on slave03 (196/199)16/09/24 14:16:38 INFO TaskSetManager: Starting task 198.0 in stage 7.0 (TID 336, slave03, partition 199,NODE_LOCAL, 1999 bytes)16/09/24 14:16:38 INFO TaskSetManager: Finished task 196.0 in stage 7.0 (TID 334) in 30 ms on slave03 (197/199)16/09/24 14:16:38 INFO TaskSetManager: Finished task 198.0 in stage 7.0 (TID 336) in 33 ms on slave03 (198/199)16/09/24 14:16:38 INFO TaskSetManager: Finished task 197.0 in stage 7.0 (TID 335) in 64 ms on slave03 (199/199)16/09/24 14:16:38 INFO YarnScheduler: Removed TaskSet 7.0, whose tasks have all completed, from pool 16/09/24 14:16:38 INFO DAGScheduler: ResultStage 7 (show at <console>:28) finished in 4.624 s16/09/24 14:16:38 INFO DAGScheduler: Job 3 finished: show at <console>:28, took 4.747212 s+------+------+|gender| count|+------+------+|     男|521920||     女|821000|+------+------+scala> in.groupBy("userinfo.age").count().show()16/09/24 14:19:50 INFO SparkContext: Starting job: show at <console>:2816/09/24 14:19:50 INFO DAGScheduler: Registering RDD 30 (show at <console>:28)16/09/24 14:19:50 INFO DAGScheduler: Got job 4 (show at <console>:28) with 1 output partitions16/09/24 14:19:50 INFO DAGScheduler: Final stage: ResultStage 9 (show at <console>:28)16/09/24 14:19:50 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 8)16/09/24 14:19:50 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 8)16/09/24 14:19:50 INFO DAGScheduler: Submitting ShuffleMapStage 8 (MapPartitionsRDD[30] at show at <console>:28), which has no missing parents16/09/24 14:19:50 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 14.5 KB, free 325.4 KB)16/09/24 14:19:50 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 7.0 KB, free 321.1 KB)16/09/24 14:19:50 INFO BlockManagerInfo: Removed broadcast_11_piece0 on 192.168.202.104:51148 in memory (size: 6.0 KB, free: 511.5 MB)16/09/24 14:19:50 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on 192.168.202.104:51148 (size: 7.0 KB, free: 511.5 MB)16/09/24 14:19:50 INFO SparkContext: Created broadcast 12 from broadcast at DAGScheduler.scala:100616/09/24 14:19:50 INFO DAGScheduler: Submitting 40 missing tasks from ShuffleMapStage 8 (MapPartitionsRDD[30] at show at <console>:28)16/09/24 14:19:50 INFO YarnScheduler: Adding task set 8.0 with 40 tasks16/09/24 14:19:50 INFO BlockManagerInfo: Removed broadcast_11_piece0 on slave03:57242 in memory (size: 6.0 KB, free: 100.8 MB)16/09/24 14:19:50 INFO BlockManagerInfo: Removed broadcast_11_piece0 on slave03:59667 in memory (size: 6.0 KB, free: 886.1 MB)16/09/24 14:19:50 INFO ContextCleaner: Cleaned accumulator 2516/09/24 14:19:50 INFO BlockManagerInfo: Removed broadcast_10_piece0 on 192.168.202.104:51148 in memory (size: 6.0 KB, free: 511.5 MB)16/09/24 14:19:50 INFO BlockManagerInfo: Removed broadcast_10_piece0 on slave03:57242 in memory (size: 6.0 KB, free: 100.8 MB)16/09/24 14:19:50 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 337, slave03, partition 0,PROCESS_LOCAL, 2152 bytes)16/09/24 14:19:50 INFO TaskSetManager: Starting task 4.0 in stage 8.0 (TID 338, slave03, partition 4,PROCESS_LOCAL, 2152 bytes)....16/09/24 14:20:09 INFO TaskSetManager: Starting task 7.0 in stage 11.0 (TID 385, slave03, partition 8,NODE_LOCAL, 1999 bytes)16/09/24 14:20:09 INFO TaskSetManager: Finished task 5.0 in stage 11.0 (TID 383) in 31 ms on slave03 (6/10)16/09/24 14:20:09 INFO TaskSetManager: Starting task 8.0 in stage 11.0 (TID 386, slave03, partition 9,NODE_LOCAL, 1999 bytes)16/09/24 14:20:09 INFO TaskSetManager: Finished task 7.0 in stage 11.0 (TID 385) in 29 ms on slave03 (7/10)16/09/24 14:20:09 INFO TaskSetManager: Starting task 9.0 in stage 11.0 (TID 387, slave03, partition 10,NODE_LOCAL, 1999 bytes)16/09/24 14:20:09 INFO TaskSetManager: Finished task 6.0 in stage 11.0 (TID 384) in 55 ms on slave03 (8/10)16/09/24 14:20:09 INFO TaskSetManager: Finished task 8.0 in stage 11.0 (TID 386) in 30 ms on slave03 (9/10)16/09/24 14:20:09 INFO TaskSetManager: Finished task 9.0 in stage 11.0 (TID 387) in 47 ms on slave03 (10/10)16/09/24 14:20:09 INFO YarnScheduler: Removed TaskSet 11.0, whose tasks have all completed, from pool 16/09/24 14:20:09 INFO DAGScheduler: ResultStage 11 (show at <console>:28) finished in 0.237 s16/09/24 14:20:09 INFO DAGScheduler: Job 5 finished: show at <console>:28, took 0.250153 s+--------------+-----+|           age|count|+--------------+-----+|           51岁| 5188||           3o岁|    1||          13个月|    5||           25岁|48233||          104岁|    4||          50个月|    1||          25个月|    4||         如:68岁|    1||           11月| 1148||           26岁|    1||           五个月|    6||          62个月|    1||         如:42岁|    1||           鞠小琼|    1||           76天|   78||          4.5岁|   21||           00岁|   23||          31 岁|    1||           45岁|    3||            7岁|    1|+--------------+-----+only showing top 20 rows



中间break了一次



scala> in.cache()16/09/24 14:32:21 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 86.5 KB, free 354.5 KB)16/09/24 14:32:21 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 19.5 KB, free 374.0 KB)16/09/24 14:32:21 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.202.104:56842 (size: 19.5 KB, free: 511.5 MB)16/09/24 14:32:21 INFO SparkContext: Created broadcast 3 from cache at <console>:2816/09/24 14:32:21 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 234.9 KB, free 608.8 KB)16/09/24 14:32:22 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 19.9 KB, free 628.7 KB)16/09/24 14:32:22 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.202.104:56842 (size: 19.9 KB, free: 511.4 MB)16/09/24 14:32:22 INFO SparkContext: Created broadcast 4 from cache at <console>:28res0: in.type = [answer: array<struct<answer_text:string,answer_time:string,mainpage:string,major:string,name:string,position:string>>, question: struct<question:string,time:string>, userinfo: struct<age:string,gender:string,name:string>]scala> val age = in.groupBy("userinfo.age").count()age: org.apache.spark.sql.DataFrame = [age: string, count: bigint]scala> age.sort($"count".desc).show(50)16/09/24 14:43:05 INFO SparkContext: Starting job: show at <console>:3016/09/24 14:43:05 INFO DAGScheduler: Registering RDD 24 (cache at <console>:30)16/09/24 14:43:05 INFO DAGScheduler: Got job 2 (show at <console>:30) with 200 output partitions16/09/24 14:43:05 INFO DAGScheduler: Final stage: ResultStage 6 (show at <console>:30)16/09/24 14:43:05 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 5)16/09/24 14:43:05 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 5).....16/09/24 14:43:15 INFO BlockManagerInfo: Added rdd_27_197 in memory on slave04:52825 (size: 520.0 B, free: 819.2 MB)16/09/24 14:43:15 INFO BlockManagerInfo: Added rdd_27_196 in memory on master:50166 (size: 496.0 B, free: 212.1 MB)16/09/24 14:43:15 INFO TaskSetManager: Starting task 198.0 in stage 6.0 (TID 524, slave04, partition 198,NODE_LOCAL, 1999 bytes)16/09/24 14:43:15 INFO TaskSetManager: Finished task 197.0 in stage 6.0 (TID 523) in 26 ms on slave04 (197/200)16/09/24 14:43:15 INFO TaskSetManager: Starting task 199.0 in stage 6.0 (TID 525, master, partition 199,NODE_LOCAL, 1999 bytes)16/09/24 14:43:15 INFO TaskSetManager: Finished task 196.0 in stage 6.0 (TID 522) in 36 ms on master (198/200)16/09/24 14:43:15 INFO BlockManagerInfo: Added rdd_27_198 in memory on slave04:52825 (size: 584.0 B, free: 819.2 MB)16/09/24 14:43:15 INFO TaskSetManager: Finished task 198.0 in stage 6.0 (TID 524) in 32 ms on slave04 (199/200)16/09/24 14:43:15 INFO BlockManagerInfo: Added rdd_27_199 in memory on master:50166 (size: 496.0 B, free: 212.1 MB)16/09/24 14:43:15 INFO TaskSetManager: Finished task 199.0 in stage 6.0 (TID 525) in 34 ms on master (200/200)16/09/24 14:43:15 INFO YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool 16/09/24 14:43:15 INFO DAGScheduler: ResultStage 6 (show at <console>:30) finished in 4.572 s16/09/24 14:43:15 INFO DAGScheduler: Job 2 finished: show at <console>:30, took 10.332326 s+---+-----+|age|count|+---+-----+|20岁|62901||22岁|50805||21岁|49951||23岁|49046||25岁|48233||26岁|47529||24岁|45865||28岁|42684||27岁|42272||18岁|41792||19岁|37736||35岁|35438||30岁|35065||29岁|31617||17岁|31285||16岁|23708||31岁|20297||32岁|18843|| 1岁|17713||33岁|17172||15岁|13881||40岁|13490||34岁|13276||  0|12059||38岁|11793|| 2岁|10973||45岁|10890||37岁|10728|| 27|10448||36岁|10441|| 23|10250|| 25|10215||42岁| 9933||43岁| 9853|| 3岁| 9828||39岁| 9558||50岁| 9428|| 20| 9418||14岁| 9390|| 22| 9226||46岁| 9100|| 24| 8967|| 4岁| 8653||41岁| 8632||48岁| 8561|| 26| 8290|| 21| 8087||44岁| 7979|| 5岁| 7601||47岁| 7532|+---+-----+only showing top 50 rows
寻医问药网上的用户基本上都是20岁刚出头的人,这一部分人可能比较既关注健康,有比较喜欢上网。

scala> in.printSchema()root |-- answer: array (nullable = true) |    |-- element: struct (containsNull = true) |    |    |-- answer_text: string (nullable = true) |    |    |-- answer_time: string (nullable = true) |    |    |-- mainpage: string (nullable = true) |    |    |-- major: string (nullable = true) |    |    |-- name: string (nullable = true) |    |    |-- position: string (nullable = true) |-- question: struct (nullable = true) |    |-- question: string (nullable = true) |    |-- time: string (nullable = true) |-- userinfo: struct (nullable = true) |    |-- age: string (nullable = true) |    |-- gender: string (nullable = true) |    |-- name: string (nullable = true)scala> val time = in.groupBy("question.time").count()time: org.apache.spark.sql.DataFrame = [time: string, count: bigint]scala> time.cache()res6: time.type = [time: string, count: bigint]
scala> val time1 = time.withColumn("hour",time("time").substr(11,3))scala> val tmp = time1.groupBy("hour")tmp: org.apache.spark.sql.GroupedData = org.apache.spark.sql.GroupedData@623835cbscala> tmp.sum("count")res26: org.apache.spark.sql.DataFrame = [hour: string, sum(count): bigint]scala> tmp.sum("count").sort($"sum(count)".desc).show(50)16/09/24 15:30:46 INFO SparkContext: Starting job: show at <console>:3416/09/24 15:30:46 INFO DAGScheduler: Registering RDD 59 (show at <console>:34)16/09/24 15:30:46 INFO DAGScheduler: Got job 7 (show at <console>:34) with 200 output partitions16/09/24 15:30:46 INFO DAGScheduler: Final stage: ResultStage 17 (show at <console>:34)16/09/24 15:30:46 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 16)16/09/24 15:30:46 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 16)16/09/24 15:30:46 INFO DAGScheduler: Submitting ShuffleMapStage 16 (MapPartitionsRDD[59] at show at <console>:34), which has no missing parents16/09/24 15:30:46 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 15.9 KB, free 270.6 KB)16/09/24 15:30:46 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 7.7 KB, free 278.3 KB)16/09/24 15:30:46 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.202.104:56842 (size: 7.7 KB, free: 511.5 MB)16/09/24 15:30:46 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:100616/09/24 15:30:46 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 16 (MapPartitionsRDD[59] at show at <console>:34)16/09/24 15:30:46 INFO YarnScheduler: Adding task set 16.0 with 200 tasks16/09/24 15:30:46 INFO TaskSetManager: Starting task 1.0 in stage 16.0 (TID 769, slave04, partition 1,PROCESS_LOCAL, 1988 bytes)16/09/24 15:30:46 INFO TaskSetManager: Starting task 0.0 in stage 16.0 (TID 770, master, partition 0,PROCESS_LOCAL, 1988 bytes)16/09/24 15:30:46 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on slave04:52825 (size: 7.7 KB, free: 1095.9 MB)16/09/24 15:30:46 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on master:50166 (size: 7.7 KB, free: 199.6 MB)....16/09/24 15:33:18 INFO TaskSetManager: Starting task 199.0 in stage 17.0 (TID 1168, slave04, partition 199,NODE_LOCAL, 1999 bytes)16/09/24 15:33:18 INFO TaskSetManager: Finished task 198.0 in stage 17.0 (TID 1167) in 30 ms on slave04 (198/200)16/09/24 15:33:18 INFO TaskSetManager: Finished task 197.0 in stage 17.0 (TID 1166) in 40 ms on master (199/200)16/09/24 15:33:18 INFO TaskSetManager: Finished task 199.0 in stage 17.0 (TID 1168) in 169 ms on slave04 (200/200)16/09/24 15:33:18 INFO DAGScheduler: ResultStage 17 (show at <console>:34) finished in 7.057 s16/09/24 15:33:18 INFO YarnScheduler: Removed TaskSet 17.0, whose tasks have all completed, from pool 16/09/24 15:33:18 INFO DAGScheduler: Job 7 finished: show at <console>:34, took 152.605098 s+----+----------+|hour|sum(count)|+----+----------+|  21|     95955||  22|     87067||  20|     86400||  19|     76345||  16|     75729||  17|     75022||  18|     73725||  13|     73447||  15|     73315||  10|     73023||  12|     71875||  14|     71470||  11|     70371||  09|     68254||  08|     60260||  23|     59491||  07|     46269||  06|     28989||  00|     28390||  01|     13992||  05|     11878||  02|      8715||  03|      6556||  04|      6382|+----+----------+scala> time.explain(true)== Parsed Logical Plan ==Aggregate [question#1.time AS time#79], [question#1.time AS time#79,(count(1),mode=Complete,isDistinct=false) AS count#80L]+- Relation[answer#0,question#1,userinfo#2] JSONRelation== Analyzed Logical Plan ==time: string, count: bigintAggregate [question#1.time], [question#1.time AS time#79,(count(1),mode=Complete,isDistinct=false) AS count#80L]+- Relation[answer#0,question#1,userinfo#2] JSONRelation== Optimized Logical Plan ==InMemoryRelation [time#79,count#80L], true, 10000, StorageLevel(true, true, false, true, 1), TungstenAggregate(key=[question#1.time#97], functions=[(count(1),mode=Final,isDistinct=false)], output=[time#79,count#80L]), None== Physical Plan ==InMemoryColumnarTableScan [time#79,count#80L], InMemoryRelation [time#79,count#80L], true, 10000, StorageLevel(true, true, false, true, 1), TungstenAggregate(key=[question#1.time#97], functions=[(count(1),mode=Final,isDistinct=false)], output=[time#79,count#80L]), None
大部分询问的时间都是在21点左右,呈现出标准的正太分布,20,22就有点显得过早,过晚。说明大家一般都是在晚上9点左右有时间提问,这个时间也是用户最为活跃的时间。


--- 分析一下文本

scala> quesres52: org.apache.spark.sql.DataFrame = [question: struct<question:string,time:string>]scala> val line =ques.take(10)(2)16/09/24 16:03:00 INFO SparkContext: Starting job: take at <console>:2916/09/24 16:03:00 INFO DAGScheduler: Got job 13 (take at <console>:29) with 1 output partitions16/09/24 16:03:00 INFO DAGScheduler: Final stage: ResultStage 24 (take at <console>:29)16/09/24 16:03:00 INFO DAGScheduler: Parents of final stage: List()16/09/24 16:03:00 INFO DAGScheduler: Missing parents: List()16/09/24 16:03:00 INFO DAGScheduler: Submitting ResultStage 24 (MapPartitionsRDD[81] at take at <console>:29), which has no missing parents16/09/24 16:03:00 INFO MemoryStore: Block broadcast_23 stored as values in memory (estimated size 9.6 KB, free 357.0 KB)16/09/24 16:03:00 INFO MemoryStore: Block broadcast_23_piece0 stored as bytes in memory (estimated size 4.9 KB, free 361.8 KB)16/09/24 16:03:00 INFO BlockManagerInfo: Added broadcast_23_piece0 in memory on 192.168.202.104:56842 (size: 4.9 KB, free: 511.4 MB)16/09/24 16:03:00 INFO SparkContext: Created broadcast 23 from broadcast at DAGScheduler.scala:100616/09/24 16:03:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 24 (MapPartitionsRDD[81] at take at <console>:29)16/09/24 16:03:00 INFO YarnScheduler: Adding task set 24.0 with 1 tasks16/09/24 16:03:00 INFO TaskSetManager: Starting task 0.0 in stage 24.0 (TID 1228, master, partition 0,PROCESS_LOCAL, 2163 bytes)16/09/24 16:03:00 INFO BlockManagerInfo: Added broadcast_23_piece0 in memory on master:50166 (size: 4.9 KB, free: 199.6 MB)16/09/24 16:03:00 INFO TaskSetManager: Finished task 0.0 in stage 24.0 (TID 1228) in 74 ms on master (1/1)16/09/24 16:03:00 INFO YarnScheduler: Removed TaskSet 24.0, whose tasks have all completed, from pool 16/09/24 16:03:00 INFO DAGScheduler: ResultStage 24 (take at <console>:29) finished in 0.073 s16/09/24 16:03:00 INFO DAGScheduler: Job 13 finished: take at <console>:29, took 0.089934 sline: org.apache.spark.sql.Row = [[跖骨骨折髓腔可以打钢钉吗,2016-05-01 23:59:52]]scala> lineres56: org.apache.spark.sql.Row = [[跖骨骨折髓腔可以打钢钉吗,2016-05-01 23:59:52]]scala> line.anyNull        apply          asInstanceOf   copy           fieldIndex     get            getAs          getBoolean     getByte        getDate        getDecimal     getDouble      getFloat       getInt         getJavaMap     getList        getLong        getMap         getSeq         getShort       getString      getStruct      getTimestamp   getValuesMap   isInstanceOf   isNullAt       length         mkString       schema         size           toSeq          toString       scala> line.toSeqres57: Seq[Any] = WrappedArray([跖骨骨折髓腔可以打钢钉吗,2016-05-01 23:59:52])scala> line.toSeq(0)res58: Any = [跖骨骨折髓腔可以打钢钉吗,2016-05-01 23:59:52]scala> val line1 = line.toSeq(0)line1: Any = [跖骨骨折髓腔可以打钢钉吗,2016-05-01 23:59:52]scala> line1res61: Any = [跖骨骨折髓腔可以打钢钉吗,2016-05-01 23:59:52]scala> line1.asInstanceOf   isInstanceOf   toString       scala> line1.toStringres62: String = [跖骨骨折髓腔可以打钢钉吗,2016-05-01 23:59:52]scala> line1.toString()res63: String = [跖骨骨折髓腔可以打钢钉吗,2016-05-01 23:59:52]scala> line1.asInstanceOf   isInstanceOf   toString       scala> line1.asInstanceOf   isInstanceOf   toString       scala> line1.toString().split(",")(0)res64: String = [跖骨骨折髓腔可以打钢钉吗scala> line1.toString().split(",")(0).substring(1)res65: String = 跖骨骨折髓腔可以打钢钉吗scala> val t ="  病"t: String = 心胀病scala> tres66: String = 心胀病scala> val ques01 = ques.map{ r =>     | r.toSeq(0).toString().split(",")(0).substring(1)     | }ques01: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[82] at map at <console>:29scala> ques01.count()16/09/24 16:15:33 INFO SparkContext: Starting job: count at <console>:3216/09/24 16:15:33 INFO DAGScheduler: Got job 14 (count at <console>:32) with 40 output partitions16/09/24 16:15:33 INFO DAGScheduler: Final stage: ResultStage 25 (count at <console>:32)16/09/24 16:15:33 INFO DAGScheduler: Parents of final stage: List()16/09/24 16:15:33 INFO DAGScheduler: Missing parents: List()16/09/24 16:15:33 INFO DAGScheduler: Submitting ResultStage 25 (MapPartitionsRDD[82] at map at <console>:29), which has no missing parents16/09/24 16:15:33 INFO MemoryStore: Block broadcast_24 stored as values in memory (estimated size 9.5 KB, free 278.8 KB)16/09/24 16:15:33 INFO MemoryStore: Block broadcast_24_piece0 stored as bytes in memory (estimated size 4.8 KB, free 283.7 KB)16/09/24 16:15:33 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on 192.168.202.104:56842 (size: 4.8 KB, free: 511.5 MB)16/09/24 16:15:33 INFO SparkContext: Created broadcast 24 from broadcast at DAGScheduler.scala:100616/09/24 16:15:33 INFO DAGScheduler: Submitting 40 missing tasks from ResultStage 25 (MapPartitionsRDD[82] at map at <console>:29)16/09/24 16:15:33 INFO YarnScheduler: Adding task set 25.0 with 40 tasks16/09/24 16:15:33 INFO TaskSetManager: Starting task 4.0 in stage 25.0 (TID 1229, slave04, partition 4,PROCESS_LOCAL, 2163 bytes)16/09/24 16:15:33 INFO TaskSetManager: Starting task 0.0 in stage 25.0 (TID 1230, master, partition 0,PROCESS_LOCAL, 2163 bytes)16/09/24 16:15:33 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on slave04:52825 (size: 4.8 KB, free: 1165.9 MB).....16/09/24 16:15:53 INFO TaskSetManager: Starting task 22.0 in stage 25.0 (TID 1264, master, partition 22,PROCESS_LOCAL, 2163 bytes)16/09/24 16:15:53 INFO TaskSetManager: Finished task 21.0 in stage 25.0 (TID 1263) in 2206 ms on master (35/40)16/09/24 16:15:55 INFO TaskSetManager: Starting task 23.0 in stage 25.0 (TID 1265, master, partition 23,PROCESS_LOCAL, 2163 bytes)16/09/24 16:15:55 INFO TaskSetManager: Finished task 22.0 in stage 25.0 (TID 1264) in 2399 ms on master (36/40)16/09/24 16:16:02 INFO TaskSetManager: Starting task 29.0 in stage 25.0 (TID 1266, slave04, partition 29,RACK_LOCAL, 2163 bytes)16/09/24 16:16:03 INFO TaskSetManager: Starting task 33.0 in stage 25.0 (TID 1267, master, partition 33,PROCESS_LOCAL, 2163 bytes)16/09/24 16:16:03 INFO TaskSetManager: Finished task 23.0 in stage 25.0 (TID 1265) in 7499 ms on master (37/40)16/09/24 16:16:07 INFO TaskSetManager: Starting task 35.0 in stage 25.0 (TID 1268, master, partition 35,PROCESS_LOCAL, 2163 bytes)16/09/24 16:16:07 INFO TaskSetManager: Finished task 33.0 in stage 25.0 (TID 1267) in 4134 ms on master (38/40)16/09/24 16:16:09 INFO TaskSetManager: Finished task 35.0 in stage 25.0 (TID 1268) in 2554 ms on master (39/40)16/09/24 16:16:39 INFO TaskSetManager: Finished task 29.0 in stage 25.0 (TID 1266) in 37028 ms on slave04 (40/40)16/09/24 16:16:39 INFO YarnScheduler: Removed TaskSet 25.0, whose tasks have all completed, from pool 16/09/24 16:16:39 INFO DAGScheduler: ResultStage 25 (count at <console>:32) finished in 66.124 s16/09/24 16:16:39 INFO DAGScheduler: Job 14 finished: count at <console>:32, took 66.169305 sres68: Long = 1342920scala> ques01res69: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[82] at map at <console>:29scala> ques01.filter{ l =>     | l.contains(t)     | }.count()16/09/24 16:20:34 INFO SparkContext: Starting job: count at <console>:3616/09/24 16:20:34 INFO DAGScheduler: Got job 15 (count at <console>:36) with 40 output partitions16/09/24 16:20:34 INFO DAGScheduler: Final stage: ResultStage 26 (count at <console>:36)16/09/24 16:20:34 INFO DAGScheduler: Parents of final stage: List()16/09/24 16:20:34 INFO DAGScheduler: Missing parents: List()16/09/24 16:20:34 INFO DAGScheduler: Submitting ResultStage 26 (MapPartitionsRDD[83] at filter at <console>:34), which has no missing parents16/09/24 16:20:34 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 15.2 KB, free 298.9 KB)16/09/24 16:20:34 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 6.8 KB, free 305.7 KB)16/09/24 16:20:34 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 192.168.202.104:56842 (size: 6.8 KB, free: 511.5 MB)16/09/24 16:20:34 INFO SparkContext: Created broadcast 25 from broadcast at DAGScheduler.scala:100616/09/24 16:20:34 INFO DAGScheduler: Submitting 40 missing tasks from ResultStage 26 (MapPartitionsRDD[83] at filter at <console>:34)16/09/24 16:20:34 INFO YarnScheduler: Adding task set 26.0 with 40 tasks16/09/24 16:20:34 INFO TaskSetManager: Starting task 0.0 in stage 26.0 (TID 1269, master, partition 0,PROCESS_LOCAL, 2163 bytes)16/09/24 16:20:34 INFO TaskSetManager: Starting task 4.0 in stage 26.0 (TID 1270, slave04, partition 4,PROCESS_LOCAL, 2163 bytes)16/09/24 16:20:34 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on slave04:52825 (size: 6.8 KB, free: 1165.9 MB)16/09/24 16:20:34 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on master:50166 (size: 6.8 KB, free: 199.6 MB)16/09/24 16:20:35 INFO TaskSetManager: Starting task 5.0 in stage 26.0 (TID 1271, slave04, partition 5,PROCESS_LOCAL, 2163 bytes)16/09/24 16:20:35 INFO TaskSetManager: Finished task 4.0 in stage 26.0 (TID 1270) in 691 ms on slave04 (1/40)16/09/24 16:20:35 INFO TaskSetManager: Starting task 6.0 in stage 26.0 (TID 1272, slave04, partition 6,PROCESS_LOCAL, 2163 bytes)16/09/24 16:20:35 INFO TaskSetManager: Finished task 5.0 in stage 26.0 (TID 1271) in 124 ms on slave04 (2/40)....16/09/24 16:20:39 INFO TaskSetManager: Starting task 35.0 in stage 26.0 (TID 1308, master, partition 35,PROCESS_LOCAL, 2163 bytes)16/09/24 16:20:39 INFO TaskSetManager: Finished task 33.0 in stage 26.0 (TID 1307) in 115 ms on master (39/40)16/09/24 16:20:39 INFO TaskSetManager: Finished task 35.0 in stage 26.0 (TID 1308) in 128 ms on master (40/40)16/09/24 16:20:39 INFO YarnScheduler: Removed TaskSet 26.0, whose tasks have all completed, from pool 16/09/24 16:20:39 INFO DAGScheduler: ResultStage 26 (count at <console>:36) finished in 4.734 s16/09/24 16:20:39 INFO DAGScheduler: Job 15 finished: count at <console>:36, took 4.853073 sres70: Long = 8


scala> ques01.filter{ l =>     | l.contains(t)     | }.collect().foreach(println)16/09/24 16:28:17 INFO SparkContext: Starting job: collect at <console>:3616/09/24 16:28:17 INFO DAGScheduler: Got job 16 (collect at <console>:36) with 40 output partitions16/09/24 16:28:17 INFO DAGScheduler: Final stage: ResultStage 27 (collect at <console>:36)16/09/24 16:28:17 INFO DAGScheduler: Parents of final stage: List()16/09/24 16:28:17 INFO DAGScheduler: Missing parents: List()16/09/24 16:28:17 INFO DAGScheduler: Submitting ResultStage 27 (MapPartitionsRDD[84] at filter at <console>:34), which has no missing parents16/09/24 16:28:17 INFO MemoryStore: Block broadcast_26 stored as values in memory (estimated size 15.4 KB, free 270.1 KB)16/09/24 16:28:17 INFO MemoryStore: Block broadcast_26_piece0 stored as bytes in memory (estimated size 6.9 KB, free 277.0 KB)16/09/24 16:28:17 INFO BlockManagerInfo: Added broadcast_26_piece0 in memory on 192.168.202.104:56842 (size: 6.9 KB, free: 511.5 MB)....16/09/24 16:28:29 INFO TaskSetManager: Starting task 33.0 in stage 27.0 (TID 1347, master, partition 33,PROCESS_LOCAL, 2163 bytes)16/09/24 16:28:29 INFO TaskSetManager: Finished task 29.0 in stage 27.0 (TID 1346) in 84 ms on master (38/40)16/09/24 16:28:29 INFO TaskSetManager: Starting task 35.0 in stage 27.0 (TID 1348, master, partition 35,PROCESS_LOCAL, 2163 bytes)16/09/24 16:28:29 INFO TaskSetManager: Finished task 33.0 in stage 27.0 (TID 1347) in 122 ms on master (39/40)16/09/24 16:28:29 INFO TaskSetManager: Finished task 35.0 in stage 27.0 (TID 1348) in 442 ms on master (40/40)16/09/24 16:28:29 INFO YarnScheduler: Removed TaskSet 27.0, whose tasks have all completed, from pool 16/09/24 16:28:29 INFO DAGScheduler: ResultStage 27 (collect at <console>:36) finished in 12.920 s16/09/24 16:28:29 INFO DAGScheduler: Job 16 finished: collect at <console>:36, took 12.948956 s我妈便秘十来年了,最近6.7年,老是觉得大肠有东西堵,拉不出,还痛,7.8年前做过两次大肠镜,说没事,就是炎症,开些药,这些年什么办法都试过,没大的效果,去医院看了几次,也建议做肠镜,我妈有心胀病,医院不敢做。特咨询您。谢谢病情描述(发病时间、主要症状、症状变化等):72岁的肺癌病人现在肿块已经长大了,有心胀病,装了一个支架,能不能做肺穿刺检查?曾经治疗情况和效果:在医院已经确诊是肺癌想得到怎样的帮助:医生建议做个穿刺检查,有心胀病还装了支架的,我怕老人做穿刺受不了您好主任:宝宝两个月今天体检说宝宝心胀有杂音,应该怎么办呀?会不会是先天性心胀病?您好,宝宝两个月今天去扎疫苗医护人员听诊说宝宝心胀有杂音,去医大二院医生听了也说有轻微的杂音,我想问问应该怎么办?会不会是先天性心胀病?你好,我想问问我爸他这段时间偶尔会感觉心闷乏力有时还蹲一会才能站起来,他年青的时候就有这种症状,不过没有现在平凡他这样是心胀病吗先天性心胀病:房间隔缺失,心动过速病情描述(发病时间、主要症状、症状变化等):请问先天性心胀病通过小时候7岁是做的手术,到现在18岁了,去年肺动脉高压查出。从去年开始吃的波生坦片享受中华慈善机构提供的波生坦片,今年因为低保办不到了,现在只能自己买药吃每个月4000元的药费。父母都是农民负担太大承受不了。但是他们也没放弃我。我想问的是咱们国内的这个药太贵了所以通过朋友得知可以到印度去买这个药,请问医生可以吗?曾经治疗情况和效果:自从吃了波生坦片也没有副着用 也稳定了 但是每次照片都看不到不知是怎么回事想得到怎样的帮助:请问医生可以到印度去买波生坦片吗?因为国内这个实在是太贵了。我小孩有6个多月了,医院确认是复杂性先天性心胀病,我该怎么办


未完待续

针对于问题:

建立疾病,症状,检查,药品,身体部位,感觉特征等词典库

分析上面词语的分布,如药品的分布特征,疾病的分布特征等-》总体知识

结合人群特征分析不同病的性别,年龄分布-》总体知识

探索在问题中疾病,症状,检查,药品,身体部位,感觉特征的共同出现的分析,这种共线分析是一种粗略的知识-》粗略的微观知识

对于答案:

.......................


updated in 2016.9.24 第二次分析




0 0