海量数据处理

来源：互联网发布：内涵段子知乎编辑：程序博客网时间：2024/05/16 01:31

需求：
在两个不同的数据库实例里的两张不同的表中找出近5个月内，account_id字段下相同的数据。
一张表数量级百万行，另一张表千万行。

分析：
因为是在不同实例下，所以不能直接在数据库里操作，要通过脚本来，本次使用php。
要先从两张表里把所需要的数据查找出来，这里要用原生的PDO，利用query，为什么呢要用原生的pdo呢，因为一些框架封装好的pdo是把数据都存放在数组中，但是海量的数据存在数组中，肯定是不行的。而用query，返回的数据集是 PDOStatement类型，这是迭代器的接口实现。其中还要注意，要取消掉缓存结果集，$pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false)。这样能大大提高效率。以上说的可以参考两个文章：
php处理大数据
php导出mysql大数据
有个问题，query只能执行一条sql，执行完要将其取出来或是释放资源，这样的话，解决方法是开两个pdo。

将返回的数据集通过算法分到不同的文件下。

 ini_set('max_execution_time', '0');        $sqlWeb = "select 语句";        $pdoWeb = new \PDO('mysql:host=*;dbname=*', '*', '*');        $pdoWeb->setAttribute(\PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);        $rowsWeb = $pdoWeb->query($sqlWeb);        foreach ($rowsWeb as $row) {            $cat = $this->myHash($row['account_id']);            file_put_contents(base_path() . '/datafile/web/temp_pc' . $cat . '.txt', $row['account_id'] . PHP_EOL, FILE_APPEND);        }

这边用的文件划分算法是：通过计算每个数据的每个字符的ascii码的大小+5000，然后%1000，最终结果觉得被划分到哪个文件。
`public function myHash($key)
{

    $len = strlen($key);    //key中每个字符所对应的ASCII的值    $asciiTotal = 0;    for ($i = 0; $i < $len; $i++) {        $asciiTotal += ord($key[$i]) + 5000;    }    return $asciiTotal % 1000;}`

最后将文件中的数据通过迭代器来进行遍历。

   public function fileWebAll($file)    {        ini_set('max_execution_time', '0');        $handle=@fopen($file,'r');        if(!$handle){            throw new Exception("Can not read file!");        }        while(($line=fgets($handle))!==false){            yield $line;        }        if (!feof($handle)) {            throw new Exception('Error: unexpected fgets() fail');        }        fclose($handle);        return ;    }

这里写代码片

 public function fileHandle()    {        ini_set('max_execution_time', '0');        for ($i = 105;$i<444;$i++){            $filePc = base_path()."/datafile/pc/temp_pc".$i.".txt";            $fileWeb = base_path()."/datafile/web/temp_pc".$i.".txt";            if(file_exists($filePc)&&file_exists($fileWeb)){                foreach ($this->filePcAll($filePc) as $rowPc){                    foreach ($this->fileWebAll($fileWeb) as $rowWeb){                        if($rowPc===$rowWeb){                            file_put_contents(base_path().'/datafile/same.txt',$rowWeb,FILE_APPEND);                            break;                        }                    }                }            }        }    }

一些相关的引用文章
php处理海量数据
这里写链接内容
php生成器
php maximum execution time
query 怎么不耗内存

阅读全文

0 0