脚本处理大数据文件

来源：互联网发布：沙发的网络意思是什么编辑：程序博客网时间：2024/05/19 10:34

在处理XX行的生产问题时，碰到过一次一个几十个G的文件，其中有几万条数据有一些问题，应急版本又来不及下发。最后采用了脚本处理的方法。用时大概几分钟吧。
下面详细介绍脚本的处理过程。
处理的核心是利用sed将问题文件中的错误数据替换为正确的数据
1.由于文件中内容过多，为了防止误操作，首先将文件按列切分为多个文件（问题文件内容每行定长）。
cutfile.sh

#! /bin/kshif [ $# != 1 ]then    echo "FORMAT: cutfile.sh inputfile !"    exit 1;fifilename=$1head -n 1  $filename > .filehead.txttail -n +2 $filename > .filebody.txtcut -c1-44     .filebody.txt  > .fileblock_1cut -c45-52    .filebody.txt  > fileblock_2_gmt_date.txt        # gmt_datecut -c53-68    .filebody.txt  > .fileblock_3cut -c69-76    .filebody.txt  > fileblock_4_local_date.txt      # local_datecut -c77-82    .filebody.txt  > .fileblock_5cut -c83-90    .filebody.txt  > fileblock_6_settlement_date.txt # settlement_datecut -c91-98    .filebody.txt  > fileblock_7_capture_date.txt    # capture_datecut -c99-294   .filebody.txt  > .fileblock_8cut -c295-302  .filebody.txt  > fileblock_9_orig_date.txt       # orig_datecut -c303-     .filebody.txt  > .fileblock_10touch .CUT.OK

2.对切分开的错误文件做统计，查找出错次数
uniqfile.sh

#! /bin/kshif [ $# != 1 ]then    echo "FORMAT: nuiqfile.sh inputfile !"    exit 1;fifilename=$1cat  $filename | sort | uniq -c

3.对出错切分文件进行替换操作

#! /bin/kshif [ $# != 3 ]then    echo "FORMAT: sedfile.sh filename replacebefore replaceafter!"    exit 1;fiif [ ! -f .CUT.OK ];then    echo "please cut file,first!"    exit 0fifilename=$1repstr1=$2repstr2=$3sed "s/${repstr1}/${repstr2}/" $filename > .sedtmpfilerm -f $filenamemv .sedtmpfile $filename

4.对处理过的文件进行合并，最后生成正确的文件

#! /bin/kshif [ $# != 1 ]then    echo "FORMAT: pastefile.sh inputfile !"    exit 1;fiif [ ! -f .CUT.OK ];then    echo "please cut file,first!"    exit 0fifilename=$1# 如果文件中有\t,那么指定paste时的间隔符比如@,所以操作之前要用grep " " filename | wc -l验证#paste -d @ .fileblock_1                \paste .fileblock_1                     \      fileblock_2_gmt_date.txt         \      .fileblock_3                     \      fileblock_4_local_date.txt       \      .fileblock_5                     \      fileblock_6_settlement_date.txt  \      fileblock_7_capture_date.txt     \      .fileblock_8                     \      fileblock_9_orig_date.txt        \      .fileblock_10 > .filebodytmp.txt#sed -e 's/@//g' .filebodytmp.txt >  .filenewbody.txtsed -e 's/  //g' .filebodytmp.txt >  .filenewbody.txtcat .filehead.txt .filenewbody.txt > ${filename}.newrm -f .fileblock_1                     \      fileblock_2_gmt_date.txt        \      .fileblock_3                     \      fileblock_4_local_date.txt      \      .fileblock_5                     \      fileblock_6_settlement_date.txt \      fileblock_7_capture_date.txt    \      .fileblock_8                     \      fileblock_9_orig_date.txt       \      .fileblock_10rm -f .filehead.txt .filebody.txt .filenewbody.txt .filebodytmp.txt .CUT.OK

阅读全文

0 0