Simple HTML DOM Parser - あられねこのめも

PHPのHTML DOM parser、Simple HTML DOM Parser。
jQuery風にアクセス可能で、使用方法も簡単です。

本家

http://simplehtmldom.sourceforge.net/

ダウンロード

http://sourceforge.net/projects/simplehtmldom/files/simplehtmldom/1.11/simplehtmldom_1_11.zip/download

使い方

要素の取得

ベースとなるHTMLはURL、ファイル、文字列から取得できます。

require_once("simple_html_dom.php");

// URLからDOM作成
$html = file_get_html('http://www.google.com/');

// IMGタグ
foreach($html->find('img') as $element) {
    echo $element->src . '<br>';
}

// Aタグ
foreach($html->find('a') as $element) {
    echo $element->href . '<br>'; 
}

JQuery風に操作できるFind

// Find all anchors, returns a array of element objects
$ret = $html->find('a');

// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0);

// Find all <div> which attribute id=foo
$ret = $html->find('div[id=foo]');

// Find all <div> with the id attribute
$ret = $html->find('div[id]');

// Find all element has attribute id
$ret = $html->find('[id]');

要素の書き換え

// 文字列から構築
$html = str_get_html('<div id="hoge">HOGE</div><div id="fuga">FUGA</div>');
// 1番目のdivのclass="bar"を追加
$html->find('div', 1)->class = 'bar';
// 0番目のdivのテキストをfooに
$html->find('div[id=foo]', 0)->innertext = 'foo';

$str = $html->save();
// <div id="hoge">foo</div><div id="fuga" class="bar">FUGA</div>

メモリリーク

取得するDOMは循環参照となるため、開放する必要があります。

// HTMLの文字列から構築
$html = str_get_html($contents);

// jquery風に記述可能。ここでは、Aタグを取得
$cts = $html->find('div[class=productTitle] a');

foreach($cts as $c) {
    $req_value['url'] = $c->attr['href'];

    $req_result[] = $req_value;
}

$html->clear();	// 循環参照なので開放されない。clear()を実行

var_dump($req_value);