程序博客网 > 淘宝信誉小号查询

从大量数据中除去重复数据

来源：互联网发布：淘宝信誉小号查询编辑：程序博客网时间：2024/05/15 10:30

有道题说的是，如何从大量的数据中消除重复的数据
比如有1w个数据，怎么快速的删除重复的数据呢

有一些解法是先排序，然后逐一删除
如果采取快速排序的方式呢，复杂度是O(nlogn)
接着还有遍历一边，删除重复的数据。

如果采用hash来做，似乎可以取得更好的结果：
大概方法如下：
采用取模hash函数，
找一个hash函数了，就这么映射过去，采用链接法避免冲撞
如果A 映射后的值和B，C,D...映射的相同，采用字符串匹配，如果A！=B,C,D...,把A加入链表，若相同，删除A，继续遍历

下面偶采用了map和set来实现。速度还是非常快的，测试数据10W条。
map装载(hashed number,Set)
set装载测试的数据

代码如下，在VS2005下测试通过：

#include <iostream>

#include <fstream>

#include <map>

#include <set>

using namespace std;

int main()

{

ifstream inf("aa.txt");

ofstream outf("bb.txt");

if(!inf || !outf)

{

cerr<<"Can not load files"<<endl;

exit(0);

}

typedef set<int> collision_data;

map<int , collision_data *> hash_data;

const int Ihash=323;

int imod;

int itemp;

map<int , collision_data *>::iterator it;

collision_data *ctemp;

while(inf>>itemp)

{

imod=itemp%Ihash;

it=hash_data.find(imod);

//----------if we can find the hashed data,let's check the number

if(it!=hash_data.end())

{

ctemp=it->second;

//---------if can not find the number in the set,insert into the set and output to the file

if(ctemp->count(itemp)<1)

{

ctemp->insert(itemp);

outf<<itemp<<endl;//--output the number

}

}

else //we can not find the hashed data,then input it into the map

{

ctemp=new collision_data;

hash_data.insert(make_pair(imod,ctemp));

ctemp->insert(itemp);

outf<<itemp<<endl;

}

}

system("pause");

return 0;

}

如果你有更好的想法，请告诉偶

淘宝信誉小号查询

淘宝信誉小号查询

原创粉丝点击

热门问题 老师的惩罚人脸识别我在镇武司摸鱼那些年重生之率土为王我在大康的咸鱼生活盘龙之生命进化天生仙种凡人之先天五行春回大明朝姑娘不必设防，我是瞎子爱知中学西安爱知中学日本爱知大学爱知教育大学爱知国际学院爱知工业大学尊德中学西安行知中学西安中学西安益新中学爱私欲7天vip兑换码爱私欲兑换码爱码验证码平台爱私欲vip兑换码大全爱江山更爱美人兑换码爱私欲免费兑换码爱码族爱私欲会员兑换码爱江山更爱美人礼品兑换码单翼的爱玩姬礼包码爱码电动车价格及图片爱他美防伪码查询系统飞码爱码电动车价格爱玛战鹰爱玛价格艾码爱玛图片爱神爱神eros 靓装爱神爱神丘比特爱神修罗场爱神今天爱神之影爱神之手爱神闪蝶爱神传奇爱神今天也在下爱神帮帮我爱神