PHP多进程爬虫-Curl中的 SSL 和 pcntl_fork

来源:互联网 发布:淘宝无线端优惠券 编辑:程序博客网 时间:2024/06/06 08:59
  • PHP多进程爬虫-Curl中的 SSL 和 pcntl_fork
    • 起源
    • 原因
    • 解决方法
    • 参考&引用

PHP多进程爬虫-Curl中的 SSL 和 pcntl_fork

起源

   最近在使用PHP多进程写爬虫的时候,遇到一个很奇怪的问题。在PHP多进程程序中,如果父进程对某域名(比如:https://www.jd.com)进行https请求后,那么子进程https请求同样的网站,会请求失败。

比如:

<?php $ch = curl_init('https://www.jd.com/');           curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $success = curl_exec($ch); var_dump($success !== false); // true curl_close($ch); $pid = pcntl_fork(); if ($pid === 0) {     $ch = curl_init('https://www.jd.com/');     curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);     curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);     $success = curl_exec($ch);     var_dump($success !== false); // false     $errno = curl_errno($ch); // 35     $error = curl_error($ch); // SSL connect error     curl_close($ch); } else if ($pid > 0) {     // wait for child process     pcntl_wait($status); }
bool(true)bool(false)

打开curl调试,有以下调试信息。

*Trying 183.56.147.1...*TCP_NODELAY set*Connected to www.jd.com (183.56.147.1) port 443 (#0)*Initializing NSS with certpath: none*skipping SSL peer certificate verification*ALPN, server accepted to use http/1.1*SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256*Server certificate:*subject: CN=*.jd.com,O="BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY CO., LTD.",L=beijing,ST=beijing,C=CN*start date: Jul 04 05:47:07 2017 GMT*expire date: Aug 28 09:42:54 2018 GMT*common name: *.jd.com*issuer: CN=GlobalSign Organization Validation CA - SHA256 - G2,O=GlobalSign nv-sa,C=BE> GET / HTTP/1.1Host: www.jd.comAccept: */*< HTTP/1.1 200 OK< Server: JDWS/2.0< Date: Sat, 18 Nov 2017 05:45:52 GMT< Content-Type: text/html; charset=utf-8< Content-Length: 124343< Connection: keep-alive< Vary: Accept-Encoding< Vary: Accept-Encoding< Expires: Sat, 18 Nov 2017 05:45:52 GMT< Cache-Control: max-age=30< ser: 101.115< Via: BJ-M-YZ-NX-76(HIT), http/1.1 GZ-CT-1-JCS-24 ( [cRs f ])< Age: 24< Strict-Transport-Security: max-age=360<*Curl_http_done: called premature == 0*Connection #0 to host www.jd.com left intactbool(true)*Trying 183.56.147.1...*TCP_NODELAY set*Connected to www.jd.com (183.56.147.1) port 443 (#0)*NSS error -8023 (SEC_ERROR_PKCS11_DEVICE_ERROR)*A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has occurred with the token or slot.*Curl_http_done: called premature == 0*Closing connection 0

从调试信息我们会发现

*Trying 183.56.147.1...*TCP_NODELAY set*Connected to www.jd.com (183.56.147.1) port 443 (#0)*NSS error -8023 (SEC_ERROR_PKCS11_DEVICE_ERROR)*A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has occurred with the token or slot.*Curl_http_done: called premature == 0*Closing connection 0
子进程中的https请求发生和NSS错误, 其中NSS是libcurl库中负责SSL证书加密的功能

原因

   通过在网上查找资料,发现这个原因可能是PHP中curl使用的libcurl库所导致的, 众所周知,https请求会在http请求的基础上加上一个验证证书和对称加密传输内容的步骤,而libcurl的实现可能 是在生成加密密钥的时候是利用了进程的pid来生成的,所以一旦在父进程通过https访问网站,相应的密钥和证书就会生成。 但是之后在子进程中再次通过https访问相同的网站,由于pid不一样,生成的私钥也不同,网站的公钥不配对,所以验证失败, 出现上面的错误。

解决方法

  1. 父进程中采用http访问,或者所有子进程中都都采用http访问
<?php $ch = curl_init('http://www.jd.com/');           curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $success = curl_exec($ch); var_dump($success !== false); // true curl_close($ch); $pid = pcntl_fork(); if ($pid === 0) {     $ch = curl_init('https://www.jd.com/');     curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);     curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);     $success = curl_exec($ch);     var_dump($success !== false); // false     $errno = curl_errno($ch); // 35     $error = curl_error($ch); // SSL connect error     curl_close($ch); } else if ($pid > 0) {     // wait for child process     pcntl_wait($status); }
bool(true)bool(true)
  1. 使用 socket 代替 curl

  2. 使用 pork_exec() 代替 pork_fork()

参考&引用

  1. https://stackoverflow.com/questions/26285311/ssl-requests-made-with-curl-fail-after-process-fork

  2. https://stackoverflow.com/questions/15466809/libcurl-ssl-error-after-fork

  3. https://stackoverflow.com/questions/34901910/curl-and-pcntl-fork?lq=1

原创粉丝点击