多线程编程学习笔记-海量数据求和

来源:互联网 发布:成人网络教育培训机构 编辑:程序博客网 时间:2024/06/05 15:26

背景

任务数量大约100W,如果在一个线程下跑的话,巨耗时,所以考虑,在主线程下,创建多线程的方式,并行进行海量任务的处理。本文以多次循环求和作为例子。

单线程

int main(){boost::posix_time::ptime start =boost::posix_time::microsec_clock::local_time();uint64_t result = 0;for (int i = 0; i < max_sum_item; i++)    result += i;std::cout << "sum="<<result<<std::endl;boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();boost::posix_time::time_duration timeTaken = end - start;std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;}

运行结果如下:
sum=499999999500000000
cost time:4061
如果将其以新建一个线程的方式处理这个任务呢?
代码:

const int max_sum_item = 1000000000;void do_sum(uint64_t *total){      *total = 0;       for (int i = 0; i < max_sum_item; i++)                *total += i;}int main(){    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();    uint64_t result = 0;    boost::thread worker(do_sum, &result);    worker.join();    std::cout << "sum="<<result<<std::endl;    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();    boost::posix_time::time_duration timeTaken = end - start;    std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;}

运行结果:
sum=499999999500000000
cost time:4346
能够看出,不用main的线程进行运算,而是自己新建一个线程,做处理的话,运行时间会稍微多些,毕竟多做的这些(新建和删除线程)是需要开销的。但是这开销好像有点儿大?
将do_sum函数优化下,代码如下:

void do_sum(uint64_t *total){  uint64_t localTotal = 0;  for (int i = 0; i < max_sum_item; i++)    localTotal += i;  *total = localTotal;}

采用优化过的do_sum进行运算,耗时如下:
sum=499999999500000000
cost time:4068
这是因为在每轮的循环中,未做优化的do_sum中我们采用引用的方式使其指向total(*total += i;),但是这部分的时间开销大于算数运算的耗时。所以,最优化方案是在函数内部采用一个局部的localTotal 变量来存储求和的结果,只在最后步骤写一次给引用的指针total 。

多线程:

注意,C++11 lambdas表达式需要GCC/G++ 4.5以上版本, 对于 G++ 4.4.是不允许的,编译时候 直接报错,所以请注意了。可以参考: http://gcc.gnu.org/projects/cxx0x.html。
否则是可以采用lambdas来求和的。

std::for_each(part_sums.begin(), part_sums.end(), [&result] (uint64_t *subtotal) { result += *subtotal; });

代码如下:

std::vector<uint64_t *> part_sums;const int threads_to_use = 2;void do_partial_sum(uint64_t *final, int start_val, int sums_to_do){    uint64_t sub_result = 0    for (int i = start_val; i < start_val + sums_to_do; i++)        sub_result += i;    *final = sub_result;}int main(){    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();    part_sums.clear();    part_sums1.clear();    for (int i = 0; i < threads_to_use; i++)    {        part_sums.push_back(new uint64_t(0));    }    std::vector<boost::thread *> t;    int sums_per_thread = max_sum_item / threads_to_use;    for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)    {        t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_per_thread));    }    for (int i = 0; i < threads_to_use; i++)        t[i]->join();    uint64_t result = 0;    // std::for_each(part_sums.begin(), part_sums.end(),myfunc);    //vector中元素求和    for(int i = 0; i < threads_to_use; i++)    {        uint64_t *temp = part_sums[i];        // std::cout<<*temp<<std::endl;        result += *temp;//注意这里的取值方式    }    // result = accumulate(part_sums1.begin() , part_sums1.end() ,0);    for (int i = 0; i < threads_to_use; i++)    {        delete t[i];        delete part_sums[i];    }    std::cout << "sum="<<result<<std::endl;    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();    boost::posix_time::time_duration timeTaken = end - start;    std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;}

开启两个线程,允许结果如下:
sum=499999999500000000
cost time:1907
提速非常明显。
注意上述的vector求和,也可以简化写成

for (std::vector<boost::uint64_t *>::iterator it = part_sums.begin(); it != part_sums.end(); ++it)  result += **it;

线程数和任务数的分配问题

比如上述const int max_sum_item = 1000000000;如果此时的线程数量为7的话,每个线程负责的数据量为142,857,142.8 。为此,我们进行向下取整,142,857,142。此时7个进程处理的总数为999,999,994 而对于尾数那些数据,我们可以指定给最后一个线程进行处理。

int sums_per_thread = max_sum_item / threads_to_use;for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++){    // Lump extra bits onto last thread if work items is not equally divisible by number of threads    int sums_to_do = sums_per_thread;    if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)        sums_to_do = max_sum_item - start_val;//尾部处理,一倍间距之上,两倍间距以内    t.push_back(new std::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));    if (sums_to_do != sums_per_thread)        break;//当第一个非标准任务数量被分配的时候,因为尾部线程的任务数量是大于1倍标准任务数的。如果该循环没有的话,则会进入下一个外循环,使得start_val=999,999,994,此时便会再创建一个没有必要的错误线程。}

完整代码如下(开启7个线程):

const int max_sum_item = 1000000000;std::vector<uint64_t *> part_sums;const int threads_to_use = 7;void do_partial_sum(uint64_t *final, int start_val, int sums_to_do){    uint64_t sub_result = 0;    for (int i = start_val; i < start_val + sums_to_do; i++)        sub_result += i;    *final = sub_result;}int main(){    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();    uint64_t result = 0;    part_sums.clear();    part_sums1.clear();    for (int i = 0; i < threads_to_use; i++)    {        part_sums.push_back(new uint64_t(0));    }    std::vector<boost::thread *> t;    int sums_per_thread = max_sum_item / threads_to_use;    for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)    {        // Lump extra bits onto last thread if work items is not equally divisible by number of threads        int sums_to_do = sums_per_thread;        if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)            sums_to_do = max_sum_item - start_val;//尾部处理,一倍间距之上,两倍间距以内        t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));        if (sums_to_do != sums_per_thread)            break;    }    for (int i = 0; i < threads_to_use; i++)        t[i]->join();    //vector中元素求和    int tt=0;    for(int i = 0; i < threads_to_use; i++)    {        uint64_t *temp = part_sums[i];        // std::cout<<*temp<<std::endl;        result += *temp;    }    // result = accumulate(part_sums1.begin() , part_sums1.end() ,0);    for (int i = 0; i < threads_to_use; i++)    {        delete t[i];        delete part_sums[i];        // delete part_sums1[i];    }    std::cout << "sum="<<result<<std::endl;    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();    boost::posix_time::time_duration timeTaken = end - start;    std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;    //************************多线程测试************************************//    return 0;}

运行结果如下:

sum=499999999500000000cost time:546

发现提升的速度,不仅仅是7 倍。这是什么原因呢?难道是由于单个线程的任务数变少了,任务数的处理过程并不是线性耗时的?欢迎大家对此进行补充,讨论。

准确记录每个线程的耗时情况

主要代码:

std::vector<uint64_t *> part_sums;boost::mutex coutmutex;//同步对象const int threads_to_use = 7;void do_partial_sum(uint64_t *final, int start_val, int sums_to_do){    coutmutex.lock();    std::cout << "Start: TID " << boost::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;    coutmutex.unlock();    //You can simply output text to cout or a file stream, but as discussed in the first part of this series, stream operations in C++ are not atomic so you must wrap their use in a synchronization //object.    //Notice that all uses of std::cout must be wrapped in mutex locks as provided by the lock() method of std::mutex (or boost::mutex).    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();    uint64_t sub_result = 0;    for (int i = start_val; i < start_val + sums_to_do; i++)        sub_result += i;    *final = sub_result;    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();    boost::posix_time::time_duration timeTaken = end - start;    coutmutex.lock();    std::cout << "End  : TID " << boost::this_thread::get_id() << " with result " << sub_result << ", time taken "<< timeTaken.total_milliseconds() << std::endl;    //Notice that all uses of std::cout must be wrapped in mutex locks as provided by the lock() method of std::mutex (or boost::mutex).    coutmutex.unlock();//如果没有解锁的话,就一直尴尬地等待了}

主函数代码和上述例子是一样的。

运行结果如下:
Start: TID 7f7a85a6a700 starting at 142857142, workload of 142857142 items
Start: TID 7f7a84668700 starting at 428571426, workload of 142857142 items
Start: TID 7f7a83266700 starting at 714285710, workload of 142857142 items
Start: TID 7f7a82865700 starting at 857142852, workload of 142857148 items
Start: TID 7f7a8646b700 starting at 0, workload of 142857142 items
Start: TID 7f7a85069700 starting at 285714284, workload of 142857142 items
Start: TID 7f7a83c67700 starting at 571428568, workload of 142857142 items
End : TID 7f7a85a6a700 with result 30612244459183675, time taken 542
End : TID 7f7a82865700 with result 132653065561224474, time taken 543
End : TID 7f7a84668700 with result 71428570500000003, time taken 543
End : TID 7f7a8646b700 with result 10204081438775511, time taken 544
End : TID 7f7a83266700 with result 112244896540816331, time taken 544
End : TID 7f7a83c67700 with result 91836733520408167, time taken 582
End : TID 7f7a85069700 with result 51020407479591839, time taken 583
sum=499999999500000000
cost time:583
注意前面提到的,输出操作并不是原子操作,所以注意加锁。
其他部分,有待补充。。。。

完整代码

C11版本

#include <iostream>       // for std::cout#include <cstdint>        // for uint64_t#include <chrono>     // for std::chrono::high_resolution_clock#include <thread>     // for std::thread#include <vector>     // for std::vector#include <algorithm>  // for std::for_each#include <cassert>        // for assert#define TRACE#ifdef TRACE#include <mutex>      // for std::mutexstd::mutex coutmutex;#endifstd::vector<uint64_t *> part_sums;const int max_sum_item = 1000000000;const int threads_to_use = 7;void do_partial_sum(uint64_t *final, int start_val, int sums_to_do){#ifdef TRACE    coutmutex.lock();    std::cout << "Start: TID " << std::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;    coutmutex.unlock();    auto start = std::chrono::high_resolution_clock::now();#endif    uint64_t sub_result = 0;    for (int i = start_val; i < start_val + sums_to_do; i++)        sub_result += i;    *final = sub_result;#ifdef TRACE    auto end = std::chrono::high_resolution_clock::now();    coutmutex.lock();    std::cout << "End  : TID " << std::this_thread::get_id() << " with result " << sub_result << ", time taken "        << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;    coutmutex.unlock();#endif}int main(){  part_sums.clear();  for (int i = 0; i < threads_to_use; i++)    part_sums.push_back(new uint64_t(0));  std::vector<std::thread *> t;  int sums_per_thread = max_sum_item / threads_to_use;  auto start = std::chrono::high_resolution_clock::now();  for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)  {    // Lump extra bits onto last thread if work items is not equally divisible by number of threads    int sums_to_do = sums_per_thread;    if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)        sums_to_do = max_sum_item - start_val;    t.push_back(new std::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));    if (sums_to_do != sums_per_thread)        break;  }  for (int i = 0; i < threads_to_use; i++)    t[i]->join();  uint64_t result = 0;  std::for_each(part_sums.begin(), part_sums.end(), [&result] (uint64_t *subtotal) { result += *subtotal; });  auto end = std::chrono::high_resolution_clock::now();  for (int i = 0; i < threads_to_use; i++)  {    delete t[i];    delete part_sums[i];  }  assert(result == uint64_t(499999999500000000));  std::cout << "Result is correct" << std::endl;  std::cout << "Time taken: " << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;}

boost版本

#include <iostream>                   // for std::cout#include <boost/cstdint.hpp>      // for boost::boost::uint64_t#include <boost/chrono.hpp>           // for boost::chrono::high_resolution_clock#include <boost/thread.hpp>           // for boost::thread and boost::mutex#include <vector>                 // for std::vector#include <cassert>                    // for assert#define TRACE#ifdef TRACEboost::mutex coutmutex;#endifstd::vector<boost::uint64_t *> part_sums;const int max_sum_item = 1000000000;const int threads_to_use = 7;void do_partial_sum(boost::uint64_t *final, int start_val, int sums_to_do){#ifdef TRACE    coutmutex.lock();    std::cout << "Start: TID " << boost::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;    coutmutex.unlock();    boost::chrono::high_resolution_clock::time_point start = boost::chrono::high_resolution_clock::now();#endif    boost::uint64_t sub_result = 0;    for (int i = start_val; i < start_val + sums_to_do; i++)        sub_result += i;    *final = sub_result;#ifdef TRACE    boost::chrono::high_resolution_clock::time_point end = boost::chrono::high_resolution_clock::now();    coutmutex.lock();    std::cout << "End  : TID " << boost::this_thread::get_id() << " with result " << sub_result << ", time taken "        << (end - start).count() * ((double) boost::chrono::high_resolution_clock::period::num / boost::chrono::high_resolution_clock::period::den) << std::endl;    coutmutex.unlock();#endif}int main(){  part_sums.clear();  for (int i = 0; i < threads_to_use; i++)    part_sums.push_back(new boost::uint64_t(0));  std::vector<boost::thread *> t;  int sums_per_thread = max_sum_item / threads_to_use;  boost::chrono::high_resolution_clock::time_point start = boost::chrono::high_resolution_clock::now();  for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)  {    // Lump extra bits onto last thread if work items is not equally divisible by number of threads    int sums_to_do = sums_per_thread;    if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)        sums_to_do = max_sum_item - start_val;    t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));    if (sums_to_do != sums_per_thread)        break;  }  for (int i = 0; i < threads_to_use; i++)    t[i]->join();  boost::uint64_t result = 0;  for (std::vector<boost::uint64_t *>::iterator it = part_sums.begin(); it != part_sums.end(); ++it)      result += **it;  boost::chrono::high_resolution_clock::time_point end = boost::chrono::high_resolution_clock::now();  for (int i = 0; i < threads_to_use; i++)  {    delete t[i];    delete part_sums[i];  }  assert(result == boost::uint64_t(499999999500000000));  std::cout << "Result is correct" << std::endl;  std::cout << "Time taken: " << (end - start).count() * ((double) boost::chrono::high_resolution_clock::period::num / boost::chrono::high_resolution_clock::period::den) << std::endl;}

多核处理器

多处理器方式是真正的并行,而不是通过系统的调度实现时间切片的方式。那如何确定多核机器上面线程的开启数量呢?
std::thread::hardware_concurrency() (or boost::thread::hardware_concurrency())
可以获悉CPU上面正在运行的处理器核数。注意,这里的结果是系统所能够探析到的逻辑核数量。例如拥有4核处理器的i7准测试机的超线程能够实现8核。
使用方法如下:

for (int threads_to_use = 1; threads_to_use <= static_cast<int>(std::thread::hardware_concurrency()); threads_to_use++){  // original code  std::cout << "Time taken with " << threads_to_use << " core" << (threads_to_use != 1? "s":"") << ": " << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;}

boost版本下采用boost::thread::hardware_concurrency()。
动态方式设置线程数,运行代码只要去除原先设计的threads_to_use的const 属性,而设置成一个动态的,并在原来的main函数部分增加一层循环
for (int threads_to_use = 1; threads_to_use <= static_cast(boost::thread::hardware_concurrency()); threads_to_use++)
{
// original code
std::cout << “Time taken with ” << threads_to_use << ” core” << (threads_to_use != 1? “s”:”“) << “: ” << timeTaken.total_milliseconds()<< std::endl;
}
具体如下:

int main(){    for (int threads_to_use = 1; threads_to_use <= static_cast<int>(boost::thread::hardware_concurrency()); threads_to_use++)    {        //原先的代码放在这里。在这里,启用的进程数threads_to_use,以一个循环进行变化。注意,本文所用的测试机,为24核。    }    std::cout << "Time taken with " << threads_to_use << " core" << (threads_to_use != 1? "s":"") << ": " << timeTaken.total_milliseconds()<< std::endl;        return 0;}

运行结果如下:
Time taken with 1 core: 3874
Time taken with 2 cores: 1927
Time taken with 3 cores: 1289
Time taken with 4 cores: 965
Time taken with 5 cores: 773
Time taken with 6 cores: 643
Time taken with 7 cores: 552
Time taken with 8 cores: 482
Time taken with 9 cores: 429
Time taken with 10 cores: 386
Time taken with 11 cores: 358
Time taken with 12 cores: 327
Time taken with 13 cores: 406
Time taken with 14 cores: 387
Time taken with 15 cores: 374
Time taken with 16 cores: 394
Time taken with 17 cores: 337
Time taken with 18 cores: 304
Time taken with 19 cores: 314
Time taken with 20 cores: 303
Time taken with 21 cores: 296
Time taken with 22 cores: 285
Time taken with 23 cores: 279
Time taken with 24 cores: 267
从下图可以看出,大概在12的时候,开始出现了反弹现象,并出现了波动。所以,最佳线程数选择可用核数的一半。

这里写图片描述
查看机器的cpu数量和核数,从中可以看出该机器有2个cpu(物理cpu,cat /proc/cpuinfo |grep “physical id”|sort |uniq|wc -l),逻辑cpu个数为24(核数,cat /proc/cpuinfo |grep “processor”|wc -l ),每个cpu有6个核(cat /proc/cpuinfo |grep “cores”|uniq )。之所以是24个逻辑处理器,是因为支持超线程。
这里写图片描述
绘图所用python代码:

线程同步

虽然多线程的使用可以提高应用程序的性能,但也增加了复杂性。 如果使用线程在同一时间执行几个函数,访问共享资源时必须相应地同步。 一旦应用达到了一定规模,这涉及相当一些工作。 本段介绍了Boost.Thread提供同步线程的类。
代码:

import matplotlib.pyplot as pltimport numpy as npy = [3874,1927,1289,965,773,643,552,482,429,386,358,327,406,387,374,394,337,304,314,303,296,285,279,267]x = np.arange(1,25)x1 = x.tolist()print(type(x1))print(len(x1))print(len(y))print(y)plt.plot(x1,y,'r--')plt.axis([1, 24, 0, 4000])plt.title('cost time of cores')plt.xlabel('number of cores')plt.ylabel('cost time/milliseconds')plt.show()

参考:
https://katyscode.wordpress.com/2013/08/15/c11-boost-multi-threading-the-parallel-aggregation-pattern/

0 0
原创粉丝点击