PHP处理大的XML文件

最简单的解析XML文件的方法是使用simplexml_load_file,它会将XML转换为对象。simplexml_load_file的问题在于它会将整个文件解析到内存中,当处理大型XML文档时,这并不理想。

XMLReader提供了一种以内存高效的方式读取XML文件的方法。XMLReader是一种stream拉取XML解析器——这意味着它是非常底层的,只有在告诉它这样做时,它才会获取文档的下一个片段。这使得XMLReader非常内存高效,但是对程序员不太友好。幸运的是,XMLReader和SimpleXML可以结合使用。

测试
大型XML文件:feed_big.xml.gz。约有40000个节点,磁盘上未压缩的大小为109MB。这个XML非常简单,有很多<prod>…</prod>节点。

代码如下:

<cafProductFeed>
<datafeed id="xxxx" merchantId="xxxx"
merchantName="xxxxxxxxxxxxxxxxxxxxx">
<prod id="750924782" in_stock="no" is_for_sale="yes" lang="en"
pre_order="no" stock_quantity="0" web_offer="no">
<brand>
<brandName>Maxxis</brandName>
</brand>
<cat>
<awCatId>252</awCatId>
<awCat>Cycling</awCat>
<mCat>Wheels &amp; Tyres > Tyres</mCat>
</cat>
<price curr="GBP">
<buynow>43.99</buynow>
<delivery>0.00</delivery>
<rrp>53.99</rrp>
<store>0.00</store>
</price>
<text>
<name>Maxxis Crossmark Tyre - LUST</name>
<desc>Maxxis Crossmark Tyre - LUSTDesigned with World
Champion Christoph Sauser, the CrossMark is the dramatic
evolution of the Cross Country racing tire. The nearly
continuous center ridge flies on hardpack, yet has enough
spacing to grab wet roots and rocks The slightly raised
ridge of side knobs offers cornering precision never before
seen on a tire this fast Features:LUST TechnologyFast
rolling center ridgeRaised side knobs for better
corneringSize: 26" x 2.1"TPI: 120Max PSI: 60Durometer:
70aBuy Maxxis Tyres from xxxxx, the World’s Largest Online
Bike Store.</desc>
</text>
<uri>
<awTrack>
http://www.awin1.com/pclick.php?p=750924782&a=181769&m=2698</awTrack>
<awImage>
http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awImage>
<awThumb>
http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awThumb>
<mImage>
http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod17336_Black_NE_01?$productfeedlarge$</mImage>
<mLink>
http://click.pump.to/fm-d0151/NY49D4IwEED~SnODUwsiKtLBxcn4sahxYWlKDY1Am7YGiPG~e2C8pcnL6717w8vVwKEKwfIiLuKu6yJZCd06JWTQppWDrJWPpGmKuBF9rz2TznjfCPdkYXCK1S8fithZZp0pkyxN10DhATxZJRQ08GydUbDAFxsKEjGFFvgSkduZUmE8meOktwOM7CyakZ2mFNn9U-SKKcLI8Xa5Tp6WqC3TKM-nTSLgp3ulVO3JbJI92f7e8RqLwr5EBT5f</mLink>
</uri>
<vertical />
<pId>100003UK</pId>
<colour>Black</colour>
<delTime>UK Free Standard Delivery - 3-4 working
days</delTime>
<lastUpdated>2017-09-18 20:16:31</lastUpdated>
<mpn>TB72545000</mpn>
</prod>
<prod id="750924792" in_stock="yes" is_for_sale="yes" lang="en"
pre_order="no" stock_quantity="16" web_offer="no">
<brand>
<brandName>DMR</brandName>
</brand>
<cat>
<awCatId>252</awCatId>
<awCat>Cycling</awCat>
<mCat>Components > Derailleurs</mCat>
</cat>
<price curr="GBP">
<buynow>12.49</buynow>
<delivery>0.00</delivery>
<rrp>17.99</rrp>
<store>0.00</store>
</price>
<text>
<name>DMR Chain Tugs</name>
<desc>DMR Chain Tugs Available for single speed rear wheels
– BMX or MTB- Invaluable asset for anyone who stretches
chains or knocks their rear wheel out of alignment - CNC
machined to fit 10mm axles - Made in the UKBuy DMR Frames
&amp; Forks from xxxxx, the World's Largest Online Bike
Store.</desc>
</text>
<uri>
<awTrack>
http://www.awin1.com/pclick.php?p=750924792&a=181769&m=2698</awTrack>
<awImage>
http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awImage>
<awThumb>
http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awThumb>
<mImage>
http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod216_Black_NE_01?$productfeedlarge$</mImage>
<mLink>
http://click.pump.to/fm-d0151/HY09D4IwFEX~SvPmApYgagcXWIzRwejWpSlFmtCPlBJijP~dB28899z7vjDHETgMKQUuClEsy5KrQRoXtVTJeKc-atRTrrwVRWdjtoVZmt-TKGLIQvRdyWqg0ANne0bBAD~UBwoBeHmkoBBTcMArRLHxncZ3bIf3usKK7tKuqL09SLNukydub4lRGLAyr05bVSbUGm-Dd9qliZxJq6M046jnuBb6gMqlQwl-fw__</mLink>
</uri>
<vertical />
<pId>10000UK</pId>
<colour>Black</colour>
<delTime>UK Free Standard Delivery - 3-4 working
days</delTime>
<lastUpdated>2017-09-18 20:17:47</lastUpdated>
<mpn>DMR-CT-K</mpn>
</prod>

...

处理方式1:simple_load_file

test01.php

<?php

if(empty($argv[1]))
{
die("Please specify xml file to parse.\n");
}

$countIx = 0;

$xml = simplexml_load_file('compress.zlib://'.$argv[1]);

if($xml === false)
{
die('Unable to load and parse the xml file: ' . error_get_last()['message'] );
}

foreach($xml->datafeed->prod as $element)
{
$prod = array(
'name' => strval($element->text->name),
'price' => strval($element->price->buynow),
'currency' => strval($element->price->attributes()->curr)
);

print_r($prod);
echo "\n";
$countIx++;
}

print "Number of items=$countIx\n";
print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n";
print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n";
print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n";
print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n";

print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n";


/**
* Returns memory usage from /proc<PID>/status in bytes.
*
* @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false.
*/
function memory_get_process_usage()
{
$status = file_get_contents('/proc/' . getmypid() . '/status');

$matchArr = array();
preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr);

if(!isset($matchArr[2][0]) || !isset($matchArr[2][1]))
{
return false;
}

return intval($matchArr[2][0]) + intval($matchArr[2][1]);
}

处理方式2:XMLReader 和 SimpleXMLElement

处理大型XML文件的正确方式是使用XMLReader和SimpleXMLElement的组合,这样对程序员更友好一点。

test02.php

<?php

if(empty($argv[1]))
{
die("Please specify xml file to parse.\n");
}

$countIx = 0;

$xml = new XMLReader();
$xml->open('compress.zlib://'.$argv[1]);

while($xml->read() && $xml->name != 'prod')
{
;
}

while($xml->name == 'prod')
{
$element = new SimpleXMLElement($xml->readOuterXML());

$prod = array(
'name' => strval($element->text->name),
'price' => strval($element->price->buynow),
'currency' => strval($element->price->attributes()->curr)
);

print_r($prod);
print "\n";
$countIx++;

$xml->next('prod');
unset($element);
}

print "Number of items=$countIx\n";
print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n";
print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n";
print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n";
print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n";

print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n";


$xml->close();

/**
* Returns memory usage from /proc<PID>/status in bytes.
*
* @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false.
*/
function memory_get_process_usage()
{
$status = file_get_contents('/proc/' . getmypid() . '/status');

$matchArr = array();
preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr);

if(!isset($matchArr[2][0]) || !isset($matchArr[2][1]))
{
return false;
}

return intval($matchArr[2][0]) + intval($matchArr[2][1]);
}

打开XML文档,因为文档是压缩的

$xml->open('compress.zlib://'.$argv[1]);

Skips all the nodes until the first product is reached:

while($xml->read() && $xml->name != 'prod'){;}

When the above while loop finishes – that means that XMLReader has either reached the first product, or the end of file is reached. In case the first product is reached document stream cursor will be at the first product node in the XML document, and we will enter the while loop below.

while($xml->name == 'prod')
{
$element = new SimpleXMLElement($xml->readOuterXML());
...
$xml->next('prod');
unset($element);
}

The XMLReader::readOuterXML() returns the contents of the current node as a string, only one node at the time will be parsed. When we are finished with this node, it is destroyed with unset so that PHP garbage collection can free it.

XMLReader::next() will jump to the next product node.

And at the end close the input which XMLReader is parsing:

$xml->close();

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.