PHP处理大型XML文件的几种方式比较

XMLReader only

Pros: fast, uses little memory

Cons: excessively hard to write and debug, requires lots of userland code to do anything useful. Userland code is slow and prone to error. Plus, it leaves you with more lines of code to maintain

XMLReader + SimpleXML

Pros: doesn’t use much memory (only the memory needed to process one node) and SimpleXML is, as the name implies, really easy to use.

Cons: creating a SimpleXMLElement object for each node is not very fast. You really have to benchmark it to understand whether it’s a problem for you. Even a modest machine would be able to process a thousand nodes per second, though.

XMLReader + DOM

Pros: uses about as much memory as SimpleXML, and XMLReader::expand() is faster than creating a new SimpleXMLElement. I wish it was possible to use simplexml_import_dom() but it doesn’t seem to work in that case

Cons: DOM is annoying to work with. It’s halfway between XMLReader and SimpleXML. Not as complicated and awkward as XMLReader, but light years away from working with SimpleXML.

My advice: write a prototype with SimpleXML, see if it works for you. If performance is paramount, try DOM. Stay as far away from XMLReader as possible. Remember that the more code you write, the higher the possibility of you introducing bugs or introducing performance regressions.

PHP处理大的XML文件

最简单的解析XML文件的方法是使用simplexml_load_file,它会将XML转换为对象。simplexml_load_file的问题在于它会将整个文件解析到内存中,当处理大型XML文档时,这并不理想。

XMLReader提供了一种以内存高效的方式读取XML文件的方法。XMLReader是一种stream拉取XML解析器——这意味着它是非常底层的,只有在告诉它这样做时,它才会获取文档的下一个片段。这使得XMLReader非常内存高效,但是对程序员不太友好。幸运的是,XMLReader和SimpleXML可以结合使用。

测试
大型XML文件:feed_big.xml.gz。约有40000个节点,磁盘上未压缩的大小为109MB。这个XML非常简单,有很多<prod>…</prod>节点。

代码如下:

<cafProductFeed>
<datafeed id="xxxx" merchantId="xxxx"
merchantName="xxxxxxxxxxxxxxxxxxxxx">
<prod id="750924782" in_stock="no" is_for_sale="yes" lang="en"
pre_order="no" stock_quantity="0" web_offer="no">
<brand>
<brandName>Maxxis</brandName>
</brand>
<cat>
<awCatId>252</awCatId>
<awCat>Cycling</awCat>
<mCat>Wheels &amp; Tyres > Tyres</mCat>
</cat>
<price curr="GBP">
<buynow>43.99</buynow>
<delivery>0.00</delivery>
<rrp>53.99</rrp>
<store>0.00</store>
</price>
<text>
<name>Maxxis Crossmark Tyre - LUST</name>
<desc>Maxxis Crossmark Tyre - LUSTDesigned with World
Champion Christoph Sauser, the CrossMark is the dramatic
evolution of the Cross Country racing tire. The nearly
continuous center ridge flies on hardpack, yet has enough
spacing to grab wet roots and rocks The slightly raised
ridge of side knobs offers cornering precision never before
seen on a tire this fast Features:LUST TechnologyFast
rolling center ridgeRaised side knobs for better
corneringSize: 26" x 2.1"TPI: 120Max PSI: 60Durometer:
70aBuy Maxxis Tyres from xxxxx, the World’s Largest Online
Bike Store.</desc>
</text>
<uri>
<awTrack>
http://www.awin1.com/pclick.php?p=750924782&a=181769&m=2698</awTrack>
<awImage>
http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awImage>
<awThumb>
http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awThumb>
<mImage>
http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod17336_Black_NE_01?$productfeedlarge$</mImage>
<mLink>
http://click.pump.to/fm-d0151/NY49D4IwEED~SnODUwsiKtLBxcn4sahxYWlKDY1Am7YGiPG~e2C8pcnL6717w8vVwKEKwfIiLuKu6yJZCd06JWTQppWDrJWPpGmKuBF9rz2TznjfCPdkYXCK1S8fithZZp0pkyxN10DhATxZJRQ08GydUbDAFxsKEjGFFvgSkduZUmE8meOktwOM7CyakZ2mFNn9U-SKKcLI8Xa5Tp6WqC3TKM-nTSLgp3ulVO3JbJI92f7e8RqLwr5EBT5f</mLink>
</uri>
<vertical />
<pId>100003UK</pId>
<colour>Black</colour>
<delTime>UK Free Standard Delivery - 3-4 working
days</delTime>
<lastUpdated>2017-09-18 20:16:31</lastUpdated>
<mpn>TB72545000</mpn>
</prod>
<prod id="750924792" in_stock="yes" is_for_sale="yes" lang="en"
pre_order="no" stock_quantity="16" web_offer="no">
<brand>
<brandName>DMR</brandName>
</brand>
<cat>
<awCatId>252</awCatId>
<awCat>Cycling</awCat>
<mCat>Components > Derailleurs</mCat>
</cat>
<price curr="GBP">
<buynow>12.49</buynow>
<delivery>0.00</delivery>
<rrp>17.99</rrp>
<store>0.00</store>
</price>
<text>
<name>DMR Chain Tugs</name>
<desc>DMR Chain Tugs Available for single speed rear wheels
– BMX or MTB- Invaluable asset for anyone who stretches
chains or knocks their rear wheel out of alignment - CNC
machined to fit 10mm axles - Made in the UKBuy DMR Frames
&amp; Forks from xxxxx, the World's Largest Online Bike
Store.</desc>
</text>
<uri>
<awTrack>
http://www.awin1.com/pclick.php?p=750924792&a=181769&m=2698</awTrack>
<awImage>
http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awImage>
<awThumb>
http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awThumb>
<mImage>
http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod216_Black_NE_01?$productfeedlarge$</mImage>
<mLink>
http://click.pump.to/fm-d0151/HY09D4IwFEX~SvPmApYgagcXWIzRwejWpSlFmtCPlBJijP~dB28899z7vjDHETgMKQUuClEsy5KrQRoXtVTJeKc-atRTrrwVRWdjtoVZmt-TKGLIQvRdyWqg0ANne0bBAD~UBwoBeHmkoBBTcMArRLHxncZ3bIf3usKK7tKuqL09SLNukydub4lRGLAyr05bVSbUGm-Dd9qliZxJq6M046jnuBb6gMqlQwl-fw__</mLink>
</uri>
<vertical />
<pId>10000UK</pId>
<colour>Black</colour>
<delTime>UK Free Standard Delivery - 3-4 working
days</delTime>
<lastUpdated>2017-09-18 20:17:47</lastUpdated>
<mpn>DMR-CT-K</mpn>
</prod>

...

处理方式1:simple_load_file

test01.php

<?php

if(empty($argv[1]))
{
die("Please specify xml file to parse.\n");
}

$countIx = 0;

$xml = simplexml_load_file('compress.zlib://'.$argv[1]);

if($xml === false)
{
die('Unable to load and parse the xml file: ' . error_get_last()['message'] );
}

foreach($xml->datafeed->prod as $element)
{
$prod = array(
'name' => strval($element->text->name),
'price' => strval($element->price->buynow),
'currency' => strval($element->price->attributes()->curr)
);

print_r($prod);
echo "\n";
$countIx++;
}

print "Number of items=$countIx\n";
print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n";
print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n";
print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n";
print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n";

print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n";


/**
* Returns memory usage from /proc<PID>/status in bytes.
*
* @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false.
*/
function memory_get_process_usage()
{
$status = file_get_contents('/proc/' . getmypid() . '/status');

$matchArr = array();
preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr);

if(!isset($matchArr[2][0]) || !isset($matchArr[2][1]))
{
return false;
}

return intval($matchArr[2][0]) + intval($matchArr[2][1]);
}

处理方式2:XMLReader 和 SimpleXMLElement

处理大型XML文件的正确方式是使用XMLReader和SimpleXMLElement的组合,这样对程序员更友好一点。

test02.php

<?php

if(empty($argv[1]))
{
die("Please specify xml file to parse.\n");
}

$countIx = 0;

$xml = new XMLReader();
$xml->open('compress.zlib://'.$argv[1]);

while($xml->read() && $xml->name != 'prod')
{
;
}

while($xml->name == 'prod')
{
$element = new SimpleXMLElement($xml->readOuterXML());

$prod = array(
'name' => strval($element->text->name),
'price' => strval($element->price->buynow),
'currency' => strval($element->price->attributes()->curr)
);

print_r($prod);
print "\n";
$countIx++;

$xml->next('prod');
unset($element);
}

print "Number of items=$countIx\n";
print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n";
print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n";
print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n";
print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n";

print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n";


$xml->close();

/**
* Returns memory usage from /proc<PID>/status in bytes.
*
* @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false.
*/
function memory_get_process_usage()
{
$status = file_get_contents('/proc/' . getmypid() . '/status');

$matchArr = array();
preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr);

if(!isset($matchArr[2][0]) || !isset($matchArr[2][1]))
{
return false;
}

return intval($matchArr[2][0]) + intval($matchArr[2][1]);
}

打开XML文档,因为文档是压缩的

$xml->open('compress.zlib://'.$argv[1]);

Skips all the nodes until the first product is reached:

while($xml->read() && $xml->name != 'prod'){;}

When the above while loop finishes – that means that XMLReader has either reached the first product, or the end of file is reached. In case the first product is reached document stream cursor will be at the first product node in the XML document, and we will enter the while loop below.

while($xml->name == 'prod')
{
$element = new SimpleXMLElement($xml->readOuterXML());
...
$xml->next('prod');
unset($element);
}

The XMLReader::readOuterXML() returns the contents of the current node as a string, only one node at the time will be parsed. When we are finished with this node, it is destroyed with unset so that PHP garbage collection can free it.

XMLReader::next() will jump to the next product node.

And at the end close the input which XMLReader is parsing:

$xml->close();

PHP双引号下的多维数组

今天在VSCode上用PHP处理JSON文件的时候遇到了,VSCode 给出了提示,但是没有注意到.

虽然用其他的方式绕过了,但是总是想知道原因. 在stackoverflow上发现了原因:

https://stackoverflow.com/questions/38054476/how-to-put-multidimensional-arrays-double-quoted-strings

PHP在双引号下输出一维数组,没有问题:

<?php echo "my name is: $arr[name]"; ?>

但是输出二维甚至多维数组的时候就出现了问题:

<?php echo "he is $twoDimArr[family][1]"; ?>

输出为:

he is Array[1]

在php官网的文档中发现了答案:

http://php.net/manual/en/language.types.string.php#language.types.string.parsing

一维数组属于simple variable,二维甚至多维数组属于complex variable,需要使用curly braces(大括号)包围起来

echo "he is {$twoDimArr['family'][1]}";

 

PHP ob_start()用途

PHP网站上的一条comment,写的非常的好:

These are two usages of ob_start():

1-Well, you have more control over the output. Trivial example: 
say you want to show the user an error message, but the script has 
already sent some HTML to the browser. It'll look ugly, with a 
half-rendered page and then an error message. Using the output buffering 
functions, you can simply delete the buffer and sebuffer and send only 
the error message, which means it looks all nice and neat buffer and send
2-The reason output buffering was invented was to create a seamless 
transfer, from: php engine -> apache -> operating system -> web user

If you make sure each of those use the same buffer size, the system 
will use less writes, use less system resources and be able
 to handle more traffic.

 

Matomo安装完成后的注意步骤

Matomo的前身是Piwik, 按照官网的步骤安装好以后需要优化下面几点:

  1. auto archiving

详细的内容需要参见这里:

How to Set up Auto-Archiving of Your Reports

但是如果你使用的是军哥的LNMP的话,如果你的domain 是example,可以这么写crontab:

5 * * * * /usr/bin/php /home/wwwroot/example.com/console core:archive –url=https://example.com > /home/wwwlogs/matomo-archive.log

2. 强制使用SSL, 在config/config.ini.php 的General 下面添加 force_ssl = 1

3. maxmind 的免费数据库,目前需要注册一个才能开始使用

 

 

最后,测试失败。Matomo的统计能力太差,一天超过百万的PV就撑不住了。。。

 

放弃!

PHP7 兼容性检测

PHP 7 都出到7.4 了,是时候把手中的discuz论坛升级到PHP 7了,考虑了半天兼容性的问题,准备升级到php 7.3

discuz 的主程序已经升级到最新的discuz 20191201版本了,兼容PHP 7.3应该是没有问题,剩下的就是安装的那些插件的兼容性的问题.

检测PHP7 的兼容性,我们这里使用主流的PHPCompatibility 配合PHP Code_Sniffer

https://github.com/squizlabs/PHP_CodeSniffer
https://github.com/PHPCompatibility/PHPCompatibility

PHPCompatibility 是PHP Code_Sniffer 的插件,因此我们应该先安装PHP Code_Sniffer, 在github上有很多中的安装方式,这里我们选择最近的wget 或者curl下载二进制程序,然后把二进制程序放到全局去

# Download using curl
curl -OL https://squizlabs.github.io/PHP_CodeSniffer/phpcs.phar
curl -OL https://squizlabs.github.io/PHP_CodeSniffer/phpcbf.phar

# Or download using wget
wget https://squizlabs.github.io/PHP_CodeSniffer/phpcs.phar
wget https://squizlabs.github.io/PHP_CodeSniffer/phpcbf.phar

# register as global commands
mv phpcs.phar phpcs
mv phpcbf.phar phpcbf
chmod 755 phpcs
chmod 755 phpcbf
mv phpcs /usr/local/bin/
mv phpcbf /usr/local/bin/

这样就安装完了phpcs,下一步我们需要下载PHPCompatibility这个插件,并且让phpcs使用这个插件

# Download PHPCompability
cd ~
wget https://github.com/PHPCompatibility/PHPCompatibility/archive/9.3.5.zip
unzip 9.3.5.zip
# path is /root/PHPCompability9.3.5
# config phpcs to use PHPCompatibility
phpcs --config-set installed_paths /root/PHPCompability9.3.5

这样phpcs就可以使用PHPCompatibility9.3.5了,当然了也可以使用git的形式来下载,这样更新PHPCompatibility比较方便

加入说我们需要检查的文件folder 是/home/plugin, 那么我们就可以用

phpcs -p --standard=PHPCompatibility --runtime-set testVersion 7.3 --report-full=/home/php.log /home/plugin

-p: 打印progress到console上面

–standard: 表示使用哪个标准

–runtime-set testVersion 7.3 : 表示用PHP7.3的标准来检查

–report-full: 表示将结果输出到某一文件

 

检查完毕,删掉了不兼容php7.3 的插件,然后就开始准备升级discuz到7.3了

PHP SimpleXMLElement 对象

今天在用RedbeanPHP 写数据库接口的时候,发现从SimpleXMLElement对象里面出来的值,按理说都是string, 怎么也不能赋值给RedbeanPHP, 用var_dump看了一下,发现SimpleXMLElement对象比较有意思,里面嵌套的对象以及property 的type 都是object,而且都是SimpleXMLElement 对象,在google上搜了一下发现问这个问题的不少。

https://stackoverflow.com/questions/416548/forcing-a-simplexml-object-to-a-string-regardless-of-context

比如说XML是这样的:

<channel>
<item>
<title>This is title 1</title>
</item>
</channel>

下面这样确实能够输出string:

$xml = simplexml_load_string($xmlstring);
echo $xml->channel->item->title;

但是除了echo 以外,下面这样就不能被当成string了

$foo = array( $xml->channel->item->title );

这是因为$XML->channel->item->title 的type 其实仍然为SimpleXMLElement的对象

我们可以用typecast来解决这个问题:

$foo = array( (string) $xml->channel->item->title );

The above code internally calls __toString() on the SimpleXMLObject. This method is not publicly available, as it interferes with the mapping scheme of the SimpleXMLObject, but it can still be invoked in the above manner.

另外看到有人这么写也可以:

$foo = array( $xml->channel->item->title.'' );

通过gettype查看确实变成了string,具体原理不知道

nginx 和 cgi, fcgi 以及php-cgi, php-fpm

在apache 称雄的时代,经常会看到cgi,但是随着nginx 的出现以及时代的进步, cgi 的程序越来越少了.

最近有需求安装smokeping,因此cgi 又被拿了出来

CGI是common gateway interface的缩写,大家都译作通用网关接口,但很不幸,我们无法见名知意。

我们知道,web服务器所处理的内容都是静态的,要想处理动态内容,需要依赖于web应用程序,如php、jsp、python、perl等。但是web server如何将动态的请求传递给这些应用程序?它所依赖的就是cgi协议。没错,是协议,也就是web server和web应用程序交流时的规范。换句话说,通过cgi协议,再结合已搭建好的web应用程序,就可以让web server也能”处理”动态请求(或者说,当用户访问某个特定资源时,可以触发执行某个web应用程序来实现特定功能),你肯定知道处理两字为什么要加上双引号。CGI可以是任何的可执行程序,可以是Shell脚本,二进制应用,或者其他的脚本(Python脚本,Ruby脚本等)

简单的cgi工作方式如下:

有多种方式可以执行cgi程序,但对http的请求方法来说,只有get和post两种方法允许执行cgi脚本。实际上post方法的内部本质还是get方法,只不过在发送http请求时,get和post方法对url中的参数处理方式不一样而已。

任何一种语言都能编写CGI,只不过有些语言比较擅长,有些语言则非常繁琐,例如用bash shell开发,那么需要用echo等打印语句将执行结果放在巨多无比的html的标签中输出给客户端。常用于编写CGI的语言有perl、php、python等,java也一样能写,但java的servlet完全能实现CGI的功能,且更优化、更利于开发

总体上来说,CGI(common gateway interface) 就是所谓的短生存应用程序,Fast CGI 就是所谓的长生存应用程序. FastCGI 像是一个常驻 long-live 型的CGI, 它可以一直执行者,不会每次都要话费时间去fork一次

CGI 和 fastcgi 有自己输入和输出标准,比如HTTP头部, CGI环境变量,get和post等等

CGI, Fast-CGI 是protocols, CGI 慢点, Fast-CGI 要快很多

CGI程序能够用 Python, PERL, Shell, C or C++等语言来实现,尽管没有明确的规定,但是一般用C写的cgi,我们会用.cgi作为后缀,用perl 的用.pl作为后缀,其实我们都可以用.cgi作为后缀. Perl由于其跨操作系统、易于修改的特性成为了CGI的主流编写语言,以至于一般的“cgi程序”就是Perl程序

nginx + fastcgi: nginx只能处理静态文件,对于动态文件,一般用fastcgi 来作为“沟通”的协议. fastcgi 进程由fastcgi 进程管理器管理,而不是nginx,这样就需要一个fastcgi 管理器,这里可以使用spawn-fcgi 作为fastcgi 的进程管理器

nginx + cgi: nginx 不能直接执行外部可执行程序,因此nginx天生不支持cgi的, nginx 虽然不支持cgi,但是他“支持”fastCGI, 这样我们可以fastcgi 来wrap 一下cgi,这样变相的支持cgi,常见的fastcgi wrapper 有fcgiwrap

另外其实cgi 程序也可以被当成fastCGI,因此也可以用nginx + spawn-fcgi 来执行cgi

Spawn-FCGI是一个通用的FastCGI管理服务器,它是lighttpd中的一部份,很多人都用Lighttpd的Spawn-FCGI进行FastCGI模式下的管理工作,不过有不少缺点。而PHP-FPM的出现多少缓解了一些问题

fastcgi 是一个协议, php-fpm 实现了这个协议

通过webserver 来运行php, 一般有两种方式,一个是php-cgi, 另外一个php-fpm (php FastCGI Process Manager) (apache 的mod_php, 把php当作一个模块来执行,以及cli 和 isapi 不在考虑范围之内, php总共来说有5种执行模式)

php-fpm 比传统的CGI方式 (suPHP) 要快,

php-cgi是php自带的fastcgi进程管理器,php-cgi变更php.ini配置后需重启php-cgi才能让新的php-ini生效,不可以平滑重启。另外直接杀死php-cgi进程,php就不能运行了,但是php-fpm和 spawn-fcgi就没有此类问题

php-fpm 就是一个支持php 解析的 fastcgi进程管理器,只能适用于php,其余语言写的cgi,例如perl,python,C,都不能使用

mod_php 和 php-fpm 是运行php 的两种方式,mod_php 是running php as apache module

在segmentfault上看到了一个文章,写的很不错https://segmentfault.com/q/1010000000256516

你(PHP)去和爱斯基摩人(web服务器,如 Apache、Nginx)谈生意

你说中文(PHP代码),他说爱斯基摩语(C代码),互相听不懂,怎么办?那就都把各自说的话转换成英语(FastCGI 协议)吧。

怎么转换呢?你就要使用一个翻译机(PHP-FPM)
(当然对方也有一个翻译机,那个是他自带的)

我们这个翻译机是最新型的,老式的那个(PHP-CGI)被淘汰了。不过它(PHP-FPM)只有年轻人(Linux系统)会用,老头子们(Windows系统)不会摆弄它,只好继续用老式的那个。