PHP处理大型XML文件的几种方式比较

XMLReader only

Pros: fast, uses little memory

Cons: excessively hard to write and debug, requires lots of userland code to do anything useful. Userland code is slow and prone to error. Plus, it leaves you with more lines of code to maintain

XMLReader + SimpleXML

Pros: doesn’t use much memory (only the memory needed to process one node) and SimpleXML is, as the name implies, really easy to use.

Cons: creating a SimpleXMLElement object for each node is not very fast. You really have to benchmark it to understand whether it’s a problem for you. Even a modest machine would be able to process a thousand nodes per second, though.

XMLReader + DOM

Pros: uses about as much memory as SimpleXML, and XMLReader::expand() is faster than creating a new SimpleXMLElement. I wish it was possible to use simplexml_import_dom() but it doesn’t seem to work in that case

Cons: DOM is annoying to work with. It’s halfway between XMLReader and SimpleXML. Not as complicated and awkward as XMLReader, but light years away from working with SimpleXML.

My advice: write a prototype with SimpleXML, see if it works for you. If performance is paramount, try DOM. Stay as far away from XMLReader as possible. Remember that the more code you write, the higher the possibility of you introducing bugs or introducing performance regressions.

PHP处理大的XML文件

最简单的解析XML文件的方法是使用simplexml_load_file,它会将XML转换为对象。simplexml_load_file的问题在于它会将整个文件解析到内存中,当处理大型XML文档时,这并不理想。

XMLReader提供了一种以内存高效的方式读取XML文件的方法。XMLReader是一种stream拉取XML解析器——这意味着它是非常底层的,只有在告诉它这样做时,它才会获取文档的下一个片段。这使得XMLReader非常内存高效,但是对程序员不太友好。幸运的是,XMLReader和SimpleXML可以结合使用。

测试
大型XML文件:feed_big.xml.gz。约有40000个节点,磁盘上未压缩的大小为109MB。这个XML非常简单,有很多<prod>…</prod>节点。

代码如下:

<cafProductFeed>
<datafeed id="xxxx" merchantId="xxxx"
merchantName="xxxxxxxxxxxxxxxxxxxxx">
<prod id="750924782" in_stock="no" is_for_sale="yes" lang="en"
pre_order="no" stock_quantity="0" web_offer="no">
<brand>
<brandName>Maxxis</brandName>
</brand>
<cat>
<awCatId>252</awCatId>
<awCat>Cycling</awCat>
<mCat>Wheels &amp; Tyres > Tyres</mCat>
</cat>
<price curr="GBP">
<buynow>43.99</buynow>
<delivery>0.00</delivery>
<rrp>53.99</rrp>
<store>0.00</store>
</price>
<text>
<name>Maxxis Crossmark Tyre - LUST</name>
<desc>Maxxis Crossmark Tyre - LUSTDesigned with World
Champion Christoph Sauser, the CrossMark is the dramatic
evolution of the Cross Country racing tire. The nearly
continuous center ridge flies on hardpack, yet has enough
spacing to grab wet roots and rocks The slightly raised
ridge of side knobs offers cornering precision never before
seen on a tire this fast Features:LUST TechnologyFast
rolling center ridgeRaised side knobs for better
corneringSize: 26" x 2.1"TPI: 120Max PSI: 60Durometer:
70aBuy Maxxis Tyres from xxxxx, the World’s Largest Online
Bike Store.</desc>
</text>
<uri>
<awTrack>
http://www.awin1.com/pclick.php?p=750924782&a=181769&m=2698</awTrack>
<awImage>
http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awImage>
<awThumb>
http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awThumb>
<mImage>
http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod17336_Black_NE_01?$productfeedlarge$</mImage>
<mLink>
http://click.pump.to/fm-d0151/NY49D4IwEED~SnODUwsiKtLBxcn4sahxYWlKDY1Am7YGiPG~e2C8pcnL6717w8vVwKEKwfIiLuKu6yJZCd06JWTQppWDrJWPpGmKuBF9rz2TznjfCPdkYXCK1S8fithZZp0pkyxN10DhATxZJRQ08GydUbDAFxsKEjGFFvgSkduZUmE8meOktwOM7CyakZ2mFNn9U-SKKcLI8Xa5Tp6WqC3TKM-nTSLgp3ulVO3JbJI92f7e8RqLwr5EBT5f</mLink>
</uri>
<vertical />
<pId>100003UK</pId>
<colour>Black</colour>
<delTime>UK Free Standard Delivery - 3-4 working
days</delTime>
<lastUpdated>2017-09-18 20:16:31</lastUpdated>
<mpn>TB72545000</mpn>
</prod>
<prod id="750924792" in_stock="yes" is_for_sale="yes" lang="en"
pre_order="no" stock_quantity="16" web_offer="no">
<brand>
<brandName>DMR</brandName>
</brand>
<cat>
<awCatId>252</awCatId>
<awCat>Cycling</awCat>
<mCat>Components > Derailleurs</mCat>
</cat>
<price curr="GBP">
<buynow>12.49</buynow>
<delivery>0.00</delivery>
<rrp>17.99</rrp>
<store>0.00</store>
</price>
<text>
<name>DMR Chain Tugs</name>
<desc>DMR Chain Tugs Available for single speed rear wheels
– BMX or MTB- Invaluable asset for anyone who stretches
chains or knocks their rear wheel out of alignment - CNC
machined to fit 10mm axles - Made in the UKBuy DMR Frames
&amp; Forks from xxxxx, the World's Largest Online Bike
Store.</desc>
</text>
<uri>
<awTrack>
http://www.awin1.com/pclick.php?p=750924792&a=181769&m=2698</awTrack>
<awImage>
http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awImage>
<awThumb>
http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awThumb>
<mImage>
http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod216_Black_NE_01?$productfeedlarge$</mImage>
<mLink>
http://click.pump.to/fm-d0151/HY09D4IwFEX~SvPmApYgagcXWIzRwejWpSlFmtCPlBJijP~dB28899z7vjDHETgMKQUuClEsy5KrQRoXtVTJeKc-atRTrrwVRWdjtoVZmt-TKGLIQvRdyWqg0ANne0bBAD~UBwoBeHmkoBBTcMArRLHxncZ3bIf3usKK7tKuqL09SLNukydub4lRGLAyr05bVSbUGm-Dd9qliZxJq6M046jnuBb6gMqlQwl-fw__</mLink>
</uri>
<vertical />
<pId>10000UK</pId>
<colour>Black</colour>
<delTime>UK Free Standard Delivery - 3-4 working
days</delTime>
<lastUpdated>2017-09-18 20:17:47</lastUpdated>
<mpn>DMR-CT-K</mpn>
</prod>

...

处理方式1:simple_load_file

test01.php

<?php

if(empty($argv[1]))
{
die("Please specify xml file to parse.\n");
}

$countIx = 0;

$xml = simplexml_load_file('compress.zlib://'.$argv[1]);

if($xml === false)
{
die('Unable to load and parse the xml file: ' . error_get_last()['message'] );
}

foreach($xml->datafeed->prod as $element)
{
$prod = array(
'name' => strval($element->text->name),
'price' => strval($element->price->buynow),
'currency' => strval($element->price->attributes()->curr)
);

print_r($prod);
echo "\n";
$countIx++;
}

print "Number of items=$countIx\n";
print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n";
print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n";
print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n";
print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n";

print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n";


/**
* Returns memory usage from /proc<PID>/status in bytes.
*
* @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false.
*/
function memory_get_process_usage()
{
$status = file_get_contents('/proc/' . getmypid() . '/status');

$matchArr = array();
preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr);

if(!isset($matchArr[2][0]) || !isset($matchArr[2][1]))
{
return false;
}

return intval($matchArr[2][0]) + intval($matchArr[2][1]);
}

处理方式2:XMLReader 和 SimpleXMLElement

处理大型XML文件的正确方式是使用XMLReader和SimpleXMLElement的组合,这样对程序员更友好一点。

test02.php

<?php

if(empty($argv[1]))
{
die("Please specify xml file to parse.\n");
}

$countIx = 0;

$xml = new XMLReader();
$xml->open('compress.zlib://'.$argv[1]);

while($xml->read() && $xml->name != 'prod')
{
;
}

while($xml->name == 'prod')
{
$element = new SimpleXMLElement($xml->readOuterXML());

$prod = array(
'name' => strval($element->text->name),
'price' => strval($element->price->buynow),
'currency' => strval($element->price->attributes()->curr)
);

print_r($prod);
print "\n";
$countIx++;

$xml->next('prod');
unset($element);
}

print "Number of items=$countIx\n";
print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n";
print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n";
print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n";
print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n";

print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n";


$xml->close();

/**
* Returns memory usage from /proc<PID>/status in bytes.
*
* @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false.
*/
function memory_get_process_usage()
{
$status = file_get_contents('/proc/' . getmypid() . '/status');

$matchArr = array();
preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr);

if(!isset($matchArr[2][0]) || !isset($matchArr[2][1]))
{
return false;
}

return intval($matchArr[2][0]) + intval($matchArr[2][1]);
}

打开XML文档,因为文档是压缩的

$xml->open('compress.zlib://'.$argv[1]);

Skips all the nodes until the first product is reached:

while($xml->read() && $xml->name != 'prod'){;}

When the above while loop finishes – that means that XMLReader has either reached the first product, or the end of file is reached. In case the first product is reached document stream cursor will be at the first product node in the XML document, and we will enter the while loop below.

while($xml->name == 'prod')
{
$element = new SimpleXMLElement($xml->readOuterXML());
...
$xml->next('prod');
unset($element);
}

The XMLReader::readOuterXML() returns the contents of the current node as a string, only one node at the time will be parsed. When we are finished with this node, it is destroyed with unset so that PHP garbage collection can free it.

XMLReader::next() will jump to the next product node.

And at the end close the input which XMLReader is parsing:

$xml->close();

PHP xdebug

PHP的debug 工具肯定是要选xdebug了,xdebug可以完美配合phpstorm 和vscode

安装

很简单,xdebug官方提供wizard

https://xdebug.org/wizard

现在基本都是安装3.*

Wizard 会帮你配置好xdebug在php中的ini文件,对于大多数人来讲可以使用下面的配置文件

zend_extension = xdebug
xdebug.mode = debug,develop,profile
xdebug.start_with_request = trigger

在xdebug 3.0中,port默认是9003

 

股息收税的问题

大陆A股、B股分红不扣税(只考虑长期持股)。

港股通买港股,分红扣税20%。

境外券商账户买注册在内地的港股公司,分红扣税10%。

境外券商账户买注册在香港、开曼、维京、百慕大的港股公司,分红不扣税。

境外券商账户买美股填了w8表,分红扣税10%。

香港银行账户买美股即使填了w8表,基本上分红扣税也是30%。

the row size is xx which is greater than maximum allowed size (8126 bytes) for a record on index leaf page

一个数据库DDL在MariaDB 10.3上没有任何问题,在MariaDB 10.6上就一般报这个问题:

[ERROR] InnoDB: Cannot add field `footer_color` in table `xxxx`.`theme` because after adding it, the row size is 8474 which is greater than maximum allowed size (8126 bytes) for a record on index leaf page.

这个Google一下很容易就知道是Row Size Too Large Error. 根据下面的文档很容易解决办法.

https://mariadb.com/kb/en/troubleshooting-row-size-too-large-errors-with-innodb/

但是按理说MariaDB 10.3 和 MariaDB 10.6上在这方面变动不大,为什么会出现问题呢?

搜索了半天,在MariaDB的文档上找到了说明:

https://mariadb.com/kb/en/innodb-row-formats-overview/#maximum-row-size

原来在MariaDB 10.2.26, 10.3.17, 10.4.7之前的版本上,InnoDB计算row size的方式是错误的,在后面的版本里修复了。

而我的MariaDB 10.3的版本正好是10.3.15.

困扰了一个上午的问题解决了!

另附解决办法,如果不想修改表的结构,可以在my.cnf添加下面的命令:

innodb_strict_mode = OFF

JavaScript setInterval 和 ‘this’ 的解决方案

By default, setInterval中的this 是被绑定到global context, 如果你想绑定到current context, 有如下几种解决方案:

(本文来自于stackoverflow: https://stackoverflow.com/questions/2749244/javascript-setinterval-and-this-solution)

Code:

prefs: null,
startup : function()
{
// init prefs
...
this.retrieve_rate();
this.intervalID = setInterval(this.retrieve_rate, this.INTERVAL);
}

如果我们想在setInterval里面access this,可以:

  1. 使用.bind
this.intervalID = setInterval(this.retrieve_rate.bind(this), this.INTERVAL);

2. 在函数外,把this 赋值给其他变量,常见是that 和 self

var self = this;
this.intervalID = setInterval(
function() { self.retrieve_rate(); },
this.INTERVAL);

3. 使用arrow function, 箭头函数

startup : function()
{
// init prefs
...
this.retrieve_rate();
this.intervalID = setInterval( () => this.retrieve_rate(), this.INTERVAL);
}

US 大型服务商wifi calling 所需的port和 domain(ePDG)

Wifi Calling 一般是通过IPSEC tunnels

Port 这个很简单,就是在UDP下需要打开如下端口:

UDP – port 500 (IPsec – IKE)
UDP – port 4500 (IPsec – NAT traversal)

MTU一般推荐设置为1500.

ePDG: Evolved Packet Data Gateway

ePDG其实就是一个FQDN, 很多服务商都发布了他们的ePDG. 国内一般用的比较多的是t-mobile 的wifi calling. 所以可以多关注一个t-mobile 的 ePDG

AT&T:

epdg.epc.att.net
sentitlement2.mobile.att.net
vvm.mobile.att.net
mmsc.mobile.att.net

Verizon:

wo.vzwwo.com
sg.vzwfemto.com

T-Mobile:

epdg.epc.mnc260.mcc310.pub.3gppnetwork.org
ss.epdg.epc.mnc260.mcc310.pub.3gppnetwork.org
ss.epdg.epc.geo.mnc260.mcc310.pub.3gppnetwork.org

在Android/iPhone上抓包

因为要debug一个app为什么无法启动,就研究了一下抓包.  目前主流的在android 和iphone上都可以抓包的是fiddler, 使用简单,不用root, 操作简便,UI简明. 但是缺点就是只能抓包http/https协议,无法抓包第四层tcp/udp协议. 在android上也可以抓包使用tcpdump. 此篇文章只简单记录一下fiddler.

总体流程分为这么几步: 在电脑上安装fiddler; 简单配置fiddler; 手机上安装证书; 手机设置proxy, 让所有http/https流量经过fiddler; 抓包. 

前提条件是电脑和手机在一个局域网内.

下面详细说一下:

  1. 下载fiddler. 只需要下载免费的fiddler classic就可以了. 下载地址如下
https://www.telerik.com/fiddler/fiddler-classic

2. 简单配置fiddler远程连接

Fiddler 如果抓取 https 协议会话需要进一步配置,在 Tools ->Options 菜单下,选择HTTPS标签并配置如下:

手机抓取需要配置远程连接,在 Tools ->Options 菜单下,选择Connections标签并配置如下:

Fiddler listens on port是手机连接fiddler时的代理端口号,默认8888即可;

Allow remote computers to connect是允许远程发送请求,需要勾上.

防火墙需要开放8888端口,注意如果你使用的是windows的话,这个防火墙已经自动配置好了.

3. 手机安装fiddler证书, 这里需要手机和电脑在同一个局域网之内. 下面以iphone为例说明,安卓的步骤差不多.

在电脑上通过cmd敲命令ipconfig查看本机的ipv4地址.

打开手机浏览器,输入 http://【fiddler电脑IP地址】:【fiddler设置的端口号】,例如 http://192.168.123.100:8888 可以下载证书并安装。

在打开的页面中,点击 FiddlerRoot certificate 下载证书,点击允许

在Settings系统设置中,点击 Profile Downloaded(已下载的配置文件):

点击install, 安装证书

不同系统手机的下载路径不一样,例如有的是: 设置->通用->关于本机->证书信任设置

4 手机配置代理

手机设置 -> WLAN -> 选择无线网络 -> HTTP Proxy,选择 Manual,Server 为 Fiddler 的电脑 ip 地址,端口号为 Fiddler 的端口号:

 

此时操作浏览器或APP,在 fiddler 中可以看到完成的请求和响应数据:

 

PHP双引号下的多维数组

今天在VSCode上用PHP处理JSON文件的时候遇到了,VSCode 给出了提示,但是没有注意到.

虽然用其他的方式绕过了,但是总是想知道原因. 在stackoverflow上发现了原因:

https://stackoverflow.com/questions/38054476/how-to-put-multidimensional-arrays-double-quoted-strings

PHP在双引号下输出一维数组,没有问题:

<?php echo "my name is: $arr[name]"; ?>

但是输出二维甚至多维数组的时候就出现了问题:

<?php echo "he is $twoDimArr[family][1]"; ?>

输出为:

he is Array[1]

在php官网的文档中发现了答案:

http://php.net/manual/en/language.types.string.php#language.types.string.parsing

一维数组属于simple variable,二维甚至多维数组属于complex variable,需要使用curly braces(大括号)包围起来

echo "he is {$twoDimArr['family'][1]}";