多线程编程学习笔记-海量数据求和

来源：互联网发布：成人网络教育培训机构编辑：程序博客网时间：2024/06/05 15:26

背景

任务数量大约100W，如果在一个线程下跑的话，巨耗时，所以考虑，在主线程下，创建多线程的方式，并行进行海量任务的处理。本文以多次循环求和作为例子。

单线程：

int main(){boost::posix_time::ptime start =boost::posix_time::microsec_clock::local_time();uint64_t result = 0;for (int i = 0; i < max_sum_item; i++)    result += i;std::cout << "sum="<<result<<std::endl;boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();boost::posix_time::time_duration timeTaken = end - start;std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;}

运行结果如下：
sum=499999999500000000
cost time:4061
如果将其以新建一个线程的方式处理这个任务呢？
代码：

const int max_sum_item = 1000000000;void do_sum(uint64_t *total){      *total = 0;       for (int i = 0; i < max_sum_item; i++)                *total += i;}int main(){    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();    uint64_t result = 0;    boost::thread worker(do_sum, &result);    worker.join();    std::cout << "sum="<<result<<std::endl;    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();    boost::posix_time::time_duration timeTaken = end - start;    std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;}

运行结果：
sum=499999999500000000
cost time:4346
能够看出，不用main的线程进行运算，而是自己新建一个线程，做处理的话，运行时间会稍微多些，毕竟多做的这些（新建和删除线程）是需要开销的。但是这开销好像有点儿大？
将do_sum函数优化下，代码如下：

void do_sum(uint64_t *total){  uint64_t localTotal = 0;  for (int i = 0; i < max_sum_item; i++)    localTotal += i;  *total = localTotal;}

采用优化过的do_sum进行运算，耗时如下：
sum=499999999500000000
cost time:4068
这是因为在每轮的循环中，未做优化的do_sum中我们采用引用的方式使其指向total（*total += i;），但是这部分的时间开销大于算数运算的耗时。所以，最优化方案是在函数内部采用一个局部的localTotal 变量来存储求和的结果，只在最后步骤写一次给引用的指针total 。

多线程：

注意，C++11 lambdas表达式需要GCC/G++ 4.5以上版本, 对于 G++ 4.4.是不允许的，编译时候直接报错，所以请注意了。可以参考： http://gcc.gnu.org/projects/cxx0x.html。
否则是可以采用lambdas来求和的。

std::for_each(part_sums.begin(), part_sums.end(), [&result] (uint64_t *subtotal) { result += *subtotal; });

代码如下：

std::vector<uint64_t *> part_sums;const int threads_to_use = 2;void do_partial_sum(uint64_t *final, int start_val, int sums_to_do){    uint64_t sub_result = 0    for (int i = start_val; i < start_val + sums_to_do; i++)        sub_result += i;    *final = sub_result;}int main(){    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();    part_sums.clear();    part_sums1.clear();    for (int i = 0; i < threads_to_use; i++)    {        part_sums.push_back(new uint64_t(0));    }    std::vector<boost::thread *> t;    int sums_per_thread = max_sum_item / threads_to_use;    for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)    {        t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_per_thread));    }    for (int i = 0; i < threads_to_use; i++)        t[i]->join();    uint64_t result = 0;    // std::for_each(part_sums.begin(), part_sums.end(),myfunc);    //vector中元素求和    for(int i = 0; i < threads_to_use; i++)    {        uint64_t *temp = part_sums[i];        // std::cout<<*temp<<std::endl;        result += *temp;//注意这里的取值方式    }    // result = accumulate(part_sums1.begin() , part_sums1.end() ,0);    for (int i = 0; i < threads_to_use; i++)    {        delete t[i];        delete part_sums[i];    }    std::cout << "sum="<<result<<std::endl;    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();    boost::posix_time::time_duration timeTaken = end - start;    std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;}

开启两个线程，允许结果如下：
sum=499999999500000000
cost time:1907
提速非常明显。
注意上述的vector求和，也可以简化写成

for (std::vector<boost::uint64_t *>::iterator it = part_sums.begin(); it != part_sums.end(); ++it)  result += **it;

线程数和任务数的分配问题

比如上述const int max_sum_item = 1000000000;如果此时的线程数量为7的话，每个线程负责的数据量为142,857,142.8 。为此，我们进行向下取整，142,857,142。此时7个进程处理的总数为999,999,994 而对于尾数那些数据，我们可以指定给最后一个线程进行处理。

int sums_per_thread = max_sum_item / threads_to_use;for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++){    // Lump extra bits onto last thread if work items is not equally divisible by number of threads    int sums_to_do = sums_per_thread;    if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)        sums_to_do = max_sum_item - start_val;//尾部处理，一倍间距之上，两倍间距以内    t.push_back(new std::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));    if (sums_to_do != sums_per_thread)        break;//当第一个非标准任务数量被分配的时候，因为尾部线程的任务数量是大于1倍标准任务数的。如果该循环没有的话，则会进入下一个外循环，使得start_val=999,999,994，此时便会再创建一个没有必要的错误线程。}

完整代码如下（开启7个线程）：

const int max_sum_item = 1000000000;std::vector<uint64_t *> part_sums;const int threads_to_use = 7;void do_partial_sum(uint64_t *final, int start_val, int sums_to_do){    uint64_t sub_result = 0;    for (int i = start_val; i < start_val + sums_to_do; i++)        sub_result += i;    *final = sub_result;}int main(){    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();    uint64_t result = 0;    part_sums.clear();    part_sums1.clear();    for (int i = 0; i < threads_to_use; i++)    {        part_sums.push_back(new uint64_t(0));    }    std::vector<boost::thread *> t;    int sums_per_thread = max_sum_item / threads_to_use;    for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)    {        // Lump extra bits onto last thread if work items is not equally divisible by number of threads        int sums_to_do = sums_per_thread;        if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)            sums_to_do = max_sum_item - start_val;//尾部处理，一倍间距之上，两倍间距以内        t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));        if (sums_to_do != sums_per_thread)            break;    }    for (int i = 0; i < threads_to_use; i++)        t[i]->join();    //vector中元素求和    int tt=0;    for(int i = 0; i < threads_to_use; i++)    {        uint64_t *temp = part_sums[i];        // std::cout<<*temp<<std::endl;        result += *temp;    }    // result = accumulate(part_sums1.begin() , part_sums1.end() ,0);    for (int i = 0; i < threads_to_use; i++)    {        delete t[i];        delete part_sums[i];        // delete part_sums1[i];    }    std::cout << "sum="<<result<<std::endl;    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();    boost::posix_time::time_duration timeTaken = end - start;    std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;    //************************多线程测试************************************//    return 0;}

运行结果如下：

sum=499999999500000000cost time:546

发现提升的速度，不仅仅是7 倍。这是什么原因呢？难道是由于单个线程的任务数变少了，任务数的处理过程并不是线性耗时的？欢迎大家对此进行补充，讨论。

准确记录每个线程的耗时情况

主要代码：

std::vector<uint64_t *> part_sums;boost::mutex coutmutex;//同步对象const int threads_to_use = 7;void do_partial_sum(uint64_t *final, int start_val, int sums_to_do){    coutmutex.lock();    std::cout << "Start: TID " << boost::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;    coutmutex.unlock();    //You can simply output text to cout or a file stream, but as discussed in the first part of this series, stream operations in C++ are not atomic so you must wrap their use in a synchronization //object.    //Notice that all uses of std::cout must be wrapped in mutex locks as provided by the lock() method of std::mutex (or boost::mutex).    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();    uint64_t sub_result = 0;    for (int i = start_val; i < start_val + sums_to_do; i++)        sub_result += i;    *final = sub_result;    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();    boost::posix_time::time_duration timeTaken = end - start;    coutmutex.lock();    std::cout << "End  : TID " << boost::this_thread::get_id() << " with result " << sub_result << ", time taken "<< timeTaken.total_milliseconds() << std::endl;    //Notice that all uses of std::cout must be wrapped in mutex locks as provided by the lock() method of std::mutex (or boost::mutex).    coutmutex.unlock();//如果没有解锁的话，就一直尴尬地等待了}

主函数代码和上述例子是一样的。

运行结果如下：
Start: TID 7f7a85a6a700 starting at 142857142, workload of 142857142 items
Start: TID 7f7a84668700 starting at 428571426, workload of 142857142 items
Start: TID 7f7a83266700 starting at 714285710, workload of 142857142 items
Start: TID 7f7a82865700 starting at 857142852, workload of 142857148 items
Start: TID 7f7a8646b700 starting at 0, workload of 142857142 items
Start: TID 7f7a85069700 starting at 285714284, workload of 142857142 items
Start: TID 7f7a83c67700 starting at 571428568, workload of 142857142 items
End : TID 7f7a85a6a700 with result 30612244459183675, time taken 542
End : TID 7f7a82865700 with result 132653065561224474, time taken 543
End : TID 7f7a84668700 with result 71428570500000003, time taken 543
End : TID 7f7a8646b700 with result 10204081438775511, time taken 544
End : TID 7f7a83266700 with result 112244896540816331, time taken 544
End : TID 7f7a83c67700 with result 91836733520408167, time taken 582
End : TID 7f7a85069700 with result 51020407479591839, time taken 583
sum=499999999500000000
cost time:583
注意前面提到的，输出操作并不是原子操作，所以注意加锁。
其他部分，有待补充。。。。

完整代码

C11版本

#include <iostream>       // for std::cout#include <cstdint>        // for uint64_t#include <chrono>     // for std::chrono::high_resolution_clock#include <thread>     // for std::thread#include <vector>     // for std::vector#include <algorithm>  // for std::for_each#include <cassert>        // for assert#define TRACE#ifdef TRACE#include <mutex>      // for std::mutexstd::mutex coutmutex;#endifstd::vector<uint64_t *> part_sums;const int max_sum_item = 1000000000;const int threads_to_use = 7;void do_partial_sum(uint64_t *final, int start_val, int sums_to_do){#ifdef TRACE    coutmutex.lock();    std::cout << "Start: TID " << std::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;    coutmutex.unlock();    auto start = std::chrono::high_resolution_clock::now();#endif    uint64_t sub_result = 0;    for (int i = start_val; i < start_val + sums_to_do; i++)        sub_result += i;    *final = sub_result;#ifdef TRACE    auto end = std::chrono::high_resolution_clock::now();    coutmutex.lock();    std::cout << "End  : TID " << std::this_thread::get_id() << " with result " << sub_result << ", time taken "        << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;    coutmutex.unlock();#endif}int main(){  part_sums.clear();  for (int i = 0; i < threads_to_use; i++)    part_sums.push_back(new uint64_t(0));  std::vector<std::thread *> t;  int sums_per_thread = max_sum_item / threads_to_use;  auto start = std::chrono::high_resolution_clock::now();  for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)  {    // Lump extra bits onto last thread if work items is not equally divisible by number of threads    int sums_to_do = sums_per_thread;    if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)        sums_to_do = max_sum_item - start_val;    t.push_back(new std::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));    if (sums_to_do != sums_per_thread)        break;  }  for (int i = 0; i < threads_to_use; i++)    t[i]->join();  uint64_t result = 0;  std::for_each(part_sums.begin(), part_sums.end(), [&result] (uint64_t *subtotal) { result += *subtotal; });  auto end = std::chrono::high_resolution_clock::now();  for (int i = 0; i < threads_to_use; i++)  {    delete t[i];    delete part_sums[i];  }  assert(result == uint64_t(499999999500000000));  std::cout << "Result is correct" << std::endl;  std::cout << "Time taken: " << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;}

boost版本

#include <iostream>                   // for std::cout#include <boost/cstdint.hpp>      // for boost::boost::uint64_t#include <boost/chrono.hpp>           // for boost::chrono::high_resolution_clock#include <boost/thread.hpp>           // for boost::thread and boost::mutex#include <vector>                 // for std::vector#include <cassert>                    // for assert#define TRACE#ifdef TRACEboost::mutex coutmutex;#endifstd::vector<boost::uint64_t *> part_sums;const int max_sum_item = 1000000000;const int threads_to_use = 7;void do_partial_sum(boost::uint64_t *final, int start_val, int sums_to_do){#ifdef TRACE    coutmutex.lock();    std::cout << "Start: TID " << boost::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;    coutmutex.unlock();    boost::chrono::high_resolution_clock::time_point start = boost::chrono::high_resolution_clock::now();#endif    boost::uint64_t sub_result = 0;    for (int i = start_val; i < start_val + sums_to_do; i++)        sub_result += i;    *final = sub_result;#ifdef TRACE    boost::chrono::high_resolution_clock::time_point end = boost::chrono::high_resolution_clock::now();    coutmutex.lock();    std::cout << "End  : TID " << boost::this_thread::get_id() << " with result " << sub_result << ", time taken "        << (end - start).count() * ((double) boost::chrono::high_resolution_clock::period::num / boost::chrono::high_resolution_clock::period::den) << std::endl;    coutmutex.unlock();#endif}int main(){  part_sums.clear();  for (int i = 0; i < threads_to_use; i++)    part_sums.push_back(new boost::uint64_t(0));  std::vector<boost::thread *> t;  int sums_per_thread = max_sum_item / threads_to_use;  boost::chrono::high_resolution_clock::time_point start = boost::chrono::high_resolution_clock::now();  for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)  {    // Lump extra bits onto last thread if work items is not equally divisible by number of threads    int sums_to_do = sums_per_thread;    if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)        sums_to_do = max_sum_item - start_val;    t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));    if (sums_to_do != sums_per_thread)        break;  }  for (int i = 0; i < threads_to_use; i++)    t[i]->join();  boost::uint64_t result = 0;  for (std::vector<boost::uint64_t *>::iterator it = part_sums.begin(); it != part_sums.end(); ++it)      result += **it;  boost::chrono::high_resolution_clock::time_point end = boost::chrono::high_resolution_clock::now();  for (int i = 0; i < threads_to_use; i++)  {    delete t[i];    delete part_sums[i];  }  assert(result == boost::uint64_t(499999999500000000));  std::cout << "Result is correct" << std::endl;  std::cout << "Time taken: " << (end - start).count() * ((double) boost::chrono::high_resolution_clock::period::num / boost::chrono::high_resolution_clock::period::den) << std::endl;}

多核处理器

多处理器方式是真正的并行，而不是通过系统的调度实现时间切片的方式。那如何确定多核机器上面线程的开启数量呢？
std::thread::hardware_concurrency() (or boost::thread::hardware_concurrency())
可以获悉CPU上面正在运行的处理器核数。注意，这里的结果是系统所能够探析到的逻辑核数量。例如拥有4核处理器的i7准测试机的超线程能够实现8核。
使用方法如下：

for (int threads_to_use = 1; threads_to_use <= static_cast<int>(std::thread::hardware_concurrency()); threads_to_use++){  // original code  std::cout << "Time taken with " << threads_to_use << " core" << (threads_to_use != 1? "s":"") << ": " << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;}

boost版本下采用boost::thread::hardware_concurrency()。
动态方式设置线程数，运行代码只要去除原先设计的threads_to_use的const 属性，而设置成一个动态的，并在原来的main函数部分增加一层循环
for (int threads_to_use = 1; threads_to_use <= static_cast(boost::thread::hardware_concurrency()); threads_to_use++)
{
// original code
std::cout << “Time taken with ” << threads_to_use << ” core” << (threads_to_use != 1? “s”:”“) << “: ” << timeTaken.total_milliseconds()<< std::endl;
}
具体如下：

int main(){    for (int threads_to_use = 1; threads_to_use <= static_cast<int>(boost::thread::hardware_concurrency()); threads_to_use++)    {        //原先的代码放在这里。在这里，启用的进程数threads_to_use，以一个循环进行变化。注意，本文所用的测试机，为24核。    }    std::cout << "Time taken with " << threads_to_use << " core" << (threads_to_use != 1? "s":"") << ": " << timeTaken.total_milliseconds()<< std::endl;        return 0;}

运行结果如下：
Time taken with 1 core: 3874
Time taken with 2 cores: 1927
Time taken with 3 cores: 1289
Time taken with 4 cores: 965
Time taken with 5 cores: 773
Time taken with 6 cores: 643
Time taken with 7 cores: 552
Time taken with 8 cores: 482
Time taken with 9 cores: 429
Time taken with 10 cores: 386
Time taken with 11 cores: 358
Time taken with 12 cores: 327
Time taken with 13 cores: 406
Time taken with 14 cores: 387
Time taken with 15 cores: 374
Time taken with 16 cores: 394
Time taken with 17 cores: 337
Time taken with 18 cores: 304
Time taken with 19 cores: 314
Time taken with 20 cores: 303
Time taken with 21 cores: 296
Time taken with 22 cores: 285
Time taken with 23 cores: 279
Time taken with 24 cores: 267
从下图可以看出，大概在12的时候，开始出现了反弹现象，并出现了波动。所以，最佳线程数选择可用核数的一半。

线程同步

虽然多线程的使用可以提高应用程序的性能，但也增加了复杂性。如果使用线程在同一时间执行几个函数，访问共享资源时必须相应地同步。一旦应用达到了一定规模，这涉及相当一些工作。本段介绍了Boost.Thread提供同步线程的类。
代码：

import matplotlib.pyplot as pltimport numpy as npy = [3874,1927,1289,965,773,643,552,482,429,386,358,327,406,387,374,394,337,304,314,303,296,285,279,267]x = np.arange(1,25)x1 = x.tolist()print(type(x1))print(len(x1))print(len(y))print(y)plt.plot(x1,y,'r--')plt.axis([1, 24, 0, 4000])plt.title('cost time of cores')plt.xlabel('number of cores')plt.ylabel('cost time/milliseconds')plt.show()

参考：
https://katyscode.wordpress.com/2013/08/15/c11-boost-multi-threading-the-parallel-aggregation-pattern/

0 0