Optimize map performamce with mapPartitions
来源:互联网 发布:哪个漫画软件日漫最全 编辑:程序博客网 时间:2024/05/16 12:43
As we can see in previous article "CSV Parser" we may need to create a new object for each record of an RDD as in
def
mLine(line
:
String)
=
{
val
parser
=
new
CSVParser(
'\t'
)
parser.parseLine(line)
}
...
...myRDD.map(mLine(
_
).size)...
The mLine
function is used in the map
method of an RDD. In this case the parser
object is created each time for each record, although they are exactly the same thing.
Actually, whenever we need to apply some complicated operation on each record there is a high chance we need to create some helper objects within map
. By combining mapPartition with Scala map, we can reduce the unnecessary new object creation. Let’s rewrite above example with mapPartitions:
def
pLines(lines
:
Iterator[String])
=
{
val
parser
=
new
CSVParser(
'\t'
)
lines.map(parser.parseLine(
_
).size)
}
...
myRDD.mapPartitions(pLines)
On my single box test machine, execution time of the same task reduced from 65 seconds to 35 seconds. Surprisingly the opencsv parser with the mapPartitions optimization is significantly faster than map(_split('\t'))
.
- Optimize map performamce with mapPartitions
- spark:map mapPartitions flatmap
- spark:map mapPartitions flatmap
- Spark API 之 map、mapPartitions
- mapPartitions
- spark map和mapPartitions的区别
- map、mapPartitions、mapValues、mapWith、flatMap、flatMapWith、flatMapValues
- linux network optimize with sysctl
- linux network optimize with sysctl
- linux network optimize with sysctl
- Android Layout Tricks #3: Optimize with stubs
- spark map flatMap flatMapToPair mapPartitions 的区别和用途
- Spark RDD中Transformation的map、flatMap、mapPartitions、glom详解
- spark 的transformations之map,flatMap,mapPartitions,mapPartitionsWithIndex的用法
- spark map flatMap flatMapToPair mapPartitions 的区别和用途
- hive hive.optimize.ppd=false导致map数量很大
- Optimize memory footprint with the string intern pool
- How To Optimize Your Site With GZIP Compression
- practice experience of deep learning from Ilya Sutskever
- Core Java笔记—第3章 Java基本的程序设计结构
- href和iframe的区别
- JAVA多线程的共享数据操作
- Linux ACL 学习笔记
- Optimize map performamce with mapPartitions
- 一个女人的爱情观
- Oracle数据库中的ROWNUM和ORDER BY执行顺序
- SenchaTouch对Dom的访问及控制
- 表格前面加序号
- okr、团队管理
- UnityCode常用数据结构大起底
- ADO访问Excel 2007
- Apache Maven入门