PHP处理大型XML文件的几种方式比较

XMLReader only

Pros: fast, uses little memory

Cons: excessively hard to write and debug, requires lots of userland code to do anything useful. Userland code is slow and prone to error. Plus, it leaves you with more lines of code to maintain

XMLReader + SimpleXML

Pros: doesn’t use much memory (only the memory needed to process one node) and SimpleXML is, as the name implies, really easy to use.

Cons: creating a SimpleXMLElement object for each node is not very fast. You really have to benchmark it to understand whether it’s a problem for you. Even a modest machine would be able to process a thousand nodes per second, though.

XMLReader + DOM

Pros: uses about as much memory as SimpleXML, and XMLReader::expand() is faster than creating a new SimpleXMLElement. I wish it was possible to use simplexml_import_dom() but it doesn’t seem to work in that case

Cons: DOM is annoying to work with. It’s halfway between XMLReader and SimpleXML. Not as complicated and awkward as XMLReader, but light years away from working with SimpleXML.

My advice: write a prototype with SimpleXML, see if it works for you. If performance is paramount, try DOM. Stay as far away from XMLReader as possible. Remember that the more code you write, the higher the possibility of you introducing bugs or introducing performance regressions.

PHP处理大的XML文件

最简单的解析XML文件的方法是使用simplexml_load_file,它会将XML转换为对象。simplexml_load_file的问题在于它会将整个文件解析到内存中,当处理大型XML文档时,这并不理想。

XMLReader提供了一种以内存高效的方式读取XML文件的方法。XMLReader是一种stream拉取XML解析器——这意味着它是非常底层的,只有在告诉它这样做时,它才会获取文档的下一个片段。这使得XMLReader非常内存高效,但是对程序员不太友好。幸运的是,XMLReader和SimpleXML可以结合使用。

测试
大型XML文件:feed_big.xml.gz。约有40000个节点,磁盘上未压缩的大小为109MB。这个XML非常简单,有很多<prod>…</prod>节点。

代码如下:

<cafProductFeed>
<datafeed id="xxxx" merchantId="xxxx"
merchantName="xxxxxxxxxxxxxxxxxxxxx">
<prod id="750924782" in_stock="no" is_for_sale="yes" lang="en"
pre_order="no" stock_quantity="0" web_offer="no">
<brand>
<brandName>Maxxis</brandName>
</brand>
<cat>
<awCatId>252</awCatId>
<awCat>Cycling</awCat>
<mCat>Wheels &amp; Tyres > Tyres</mCat>
</cat>
<price curr="GBP">
<buynow>43.99</buynow>
<delivery>0.00</delivery>
<rrp>53.99</rrp>
<store>0.00</store>
</price>
<text>
<name>Maxxis Crossmark Tyre - LUST</name>
<desc>Maxxis Crossmark Tyre - LUSTDesigned with World
Champion Christoph Sauser, the CrossMark is the dramatic
evolution of the Cross Country racing tire. The nearly
continuous center ridge flies on hardpack, yet has enough
spacing to grab wet roots and rocks The slightly raised
ridge of side knobs offers cornering precision never before
seen on a tire this fast Features:LUST TechnologyFast
rolling center ridgeRaised side knobs for better
corneringSize: 26" x 2.1"TPI: 120Max PSI: 60Durometer:
70aBuy Maxxis Tyres from xxxxx, the World’s Largest Online
Bike Store.</desc>
</text>
<uri>
<awTrack>
http://www.awin1.com/pclick.php?p=750924782&a=181769&m=2698</awTrack>
<awImage>
http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awImage>
<awThumb>
http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awThumb>
<mImage>
http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod17336_Black_NE_01?$productfeedlarge$</mImage>
<mLink>
http://click.pump.to/fm-d0151/NY49D4IwEED~SnODUwsiKtLBxcn4sahxYWlKDY1Am7YGiPG~e2C8pcnL6717w8vVwKEKwfIiLuKu6yJZCd06JWTQppWDrJWPpGmKuBF9rz2TznjfCPdkYXCK1S8fithZZp0pkyxN10DhATxZJRQ08GydUbDAFxsKEjGFFvgSkduZUmE8meOktwOM7CyakZ2mFNn9U-SKKcLI8Xa5Tp6WqC3TKM-nTSLgp3ulVO3JbJI92f7e8RqLwr5EBT5f</mLink>
</uri>
<vertical />
<pId>100003UK</pId>
<colour>Black</colour>
<delTime>UK Free Standard Delivery - 3-4 working
days</delTime>
<lastUpdated>2017-09-18 20:16:31</lastUpdated>
<mpn>TB72545000</mpn>
</prod>
<prod id="750924792" in_stock="yes" is_for_sale="yes" lang="en"
pre_order="no" stock_quantity="16" web_offer="no">
<brand>
<brandName>DMR</brandName>
</brand>
<cat>
<awCatId>252</awCatId>
<awCat>Cycling</awCat>
<mCat>Components > Derailleurs</mCat>
</cat>
<price curr="GBP">
<buynow>12.49</buynow>
<delivery>0.00</delivery>
<rrp>17.99</rrp>
<store>0.00</store>
</price>
<text>
<name>DMR Chain Tugs</name>
<desc>DMR Chain Tugs Available for single speed rear wheels
– BMX or MTB- Invaluable asset for anyone who stretches
chains or knocks their rear wheel out of alignment - CNC
machined to fit 10mm axles - Made in the UKBuy DMR Frames
&amp; Forks from xxxxx, the World's Largest Online Bike
Store.</desc>
</text>
<uri>
<awTrack>
http://www.awin1.com/pclick.php?p=750924792&a=181769&m=2698</awTrack>
<awImage>
http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awImage>
<awThumb>
http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awThumb>
<mImage>
http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod216_Black_NE_01?$productfeedlarge$</mImage>
<mLink>
http://click.pump.to/fm-d0151/HY09D4IwFEX~SvPmApYgagcXWIzRwejWpSlFmtCPlBJijP~dB28899z7vjDHETgMKQUuClEsy5KrQRoXtVTJeKc-atRTrrwVRWdjtoVZmt-TKGLIQvRdyWqg0ANne0bBAD~UBwoBeHmkoBBTcMArRLHxncZ3bIf3usKK7tKuqL09SLNukydub4lRGLAyr05bVSbUGm-Dd9qliZxJq6M046jnuBb6gMqlQwl-fw__</mLink>
</uri>
<vertical />
<pId>10000UK</pId>
<colour>Black</colour>
<delTime>UK Free Standard Delivery - 3-4 working
days</delTime>
<lastUpdated>2017-09-18 20:17:47</lastUpdated>
<mpn>DMR-CT-K</mpn>
</prod>

...

处理方式1:simple_load_file

test01.php

<?php

if(empty($argv[1]))
{
die("Please specify xml file to parse.\n");
}

$countIx = 0;

$xml = simplexml_load_file('compress.zlib://'.$argv[1]);

if($xml === false)
{
die('Unable to load and parse the xml file: ' . error_get_last()['message'] );
}

foreach($xml->datafeed->prod as $element)
{
$prod = array(
'name' => strval($element->text->name),
'price' => strval($element->price->buynow),
'currency' => strval($element->price->attributes()->curr)
);

print_r($prod);
echo "\n";
$countIx++;
}

print "Number of items=$countIx\n";
print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n";
print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n";
print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n";
print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n";

print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n";


/**
* Returns memory usage from /proc<PID>/status in bytes.
*
* @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false.
*/
function memory_get_process_usage()
{
$status = file_get_contents('/proc/' . getmypid() . '/status');

$matchArr = array();
preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr);

if(!isset($matchArr[2][0]) || !isset($matchArr[2][1]))
{
return false;
}

return intval($matchArr[2][0]) + intval($matchArr[2][1]);
}

处理方式2:XMLReader 和 SimpleXMLElement

处理大型XML文件的正确方式是使用XMLReader和SimpleXMLElement的组合,这样对程序员更友好一点。

test02.php

<?php

if(empty($argv[1]))
{
die("Please specify xml file to parse.\n");
}

$countIx = 0;

$xml = new XMLReader();
$xml->open('compress.zlib://'.$argv[1]);

while($xml->read() && $xml->name != 'prod')
{
;
}

while($xml->name == 'prod')
{
$element = new SimpleXMLElement($xml->readOuterXML());

$prod = array(
'name' => strval($element->text->name),
'price' => strval($element->price->buynow),
'currency' => strval($element->price->attributes()->curr)
);

print_r($prod);
print "\n";
$countIx++;

$xml->next('prod');
unset($element);
}

print "Number of items=$countIx\n";
print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n";
print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n";
print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n";
print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n";

print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n";


$xml->close();

/**
* Returns memory usage from /proc<PID>/status in bytes.
*
* @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false.
*/
function memory_get_process_usage()
{
$status = file_get_contents('/proc/' . getmypid() . '/status');

$matchArr = array();
preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr);

if(!isset($matchArr[2][0]) || !isset($matchArr[2][1]))
{
return false;
}

return intval($matchArr[2][0]) + intval($matchArr[2][1]);
}

打开XML文档,因为文档是压缩的

$xml->open('compress.zlib://'.$argv[1]);

Skips all the nodes until the first product is reached:

while($xml->read() && $xml->name != 'prod'){;}

When the above while loop finishes – that means that XMLReader has either reached the first product, or the end of file is reached. In case the first product is reached document stream cursor will be at the first product node in the XML document, and we will enter the while loop below.

while($xml->name == 'prod')
{
$element = new SimpleXMLElement($xml->readOuterXML());
...
$xml->next('prod');
unset($element);
}

The XMLReader::readOuterXML() returns the contents of the current node as a string, only one node at the time will be parsed. When we are finished with this node, it is destroyed with unset so that PHP garbage collection can free it.

XMLReader::next() will jump to the next product node.

And at the end close the input which XMLReader is parsing:

$xml->close();

PHP xdebug

PHP的debug 工具肯定是要选xdebug了,xdebug可以完美配合phpstorm 和vscode

安装

很简单,xdebug官方提供wizard

https://xdebug.org/wizard

现在基本都是安装3.*

Wizard 会帮你配置好xdebug在php中的ini文件,对于大多数人来讲可以使用下面的配置文件

zend_extension = xdebug
xdebug.mode = debug,develop,profile
xdebug.start_with_request = trigger

在xdebug 3.0中,port默认是9003