Verification: a143cc29221c9be0

Parsing html content in php

Index

  • Quick Start
  • How to create HTML DOM object?
  • How to find HTML elements?
  • How to access the HTML element's attributes?
  • How to traverse the DOM tree?
  • How to dump contents of DOM object?
  • How to customize the parsing behavior?
  • API Reference
  • FAQ

Quick Start

Top

  • Get HTML elements
  • Modify HTML elements
  • Extract contents from HTML
  • Scraping Slashdot!


$html = file_get_html('http://www.google.com/');


foreach($html->find('img') as $element)
       echo $element->src . '
'
;


foreach($html->find('a') as $element)
       echo $element->href . '
'
;


$html = str_get_html('

Hello
World
'); $html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html;


echo file_get_html('http://www.google.com/')->plaintext;


$html = file_get_html('http://slashdot.org/');


foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

How to create HTML DOM object?

Top

  • Quick way
  • Object-oriented way


$html = str_get_html('Hello!');


$html = file_get_html('http://www.google.com/');


$html = file_get_html('test.htm');


$html = new simple_html_dom();


$html->load('Hello!');


$html->load_file('http://www.google.com/');


$html->load_file('test.htm');

How to find HTML elements?

Top

  • Basics
  • Advanced
  • Descendant selectors
  • Nested selectors
  • Attribute Filters
  • Text & Comments


$ret = $html->find('a');


$ret = $html->find('a', 0);


$ret = $html->find('a', -1);


$ret = $html->find('div[id]');


$ret = $html->find('div[id=foo]');


$ret = $html->find('#foo');


$ret = $html->find('.foo');


$ret = $html->find('*[id]');


$ret = $html->find('a, img');


$ret = $html->find('a[title], img[title]');

Supports these operators in attribute selectors:

Filter Description
[attribute] Matches elements that have the specified attribute.
[!attribute] Matches elements that don't have the specified attribute.
[attribute=value] Matches elements that have the specified attribute with a certain value.
[attribute!=value] Matches elements that don't have the specified attribute with a certain value.
[attribute^=value] Matches elements that have the specified attribute and it starts with a certain value.
[attribute$=value] Matches elements that have the specified attribute and it ends with a certain value.
[attribute*=value] Matches elements that have the specified attribute and it contains a certain value.


$es = $html->find('ul li');


$es = $html->find('div div div');


$es = $html->find('table.hello td');


$es = $html->find(''table td[align=center]');


foreach($html->find('ul') as $ul)
{
       foreach($ul->find('li') as $li)
       {
            
       }
}


$e = $html->find('ul', 0)->find('li', 0);

How to access the HTML element's attributes?

Top

  • Get, Set and Remove attributes
  • Magic attributes
  • Tips


$value = $e->href;


$e->href = 'my link';


$e->href = null;


if(isset($e->href))
        echo 'href exist!';


$html = str_get_html("

foo bar
");
$e = $html->find("div", 0);

echo $e->tag;
echo $e->outertext;
echo $e->innertext;
echo $e->plaintext;

Attribute Name Usage
$e->tag Read or write the tag name of element.
$e->outertext Read or write the outer HTML text of element.
$e->innertext Read or write the inner HTML text of element.
$e->plaintext Read or write the plain text of element.


echo $html->plaintext;


$e->outertext = '

' . $e->outertext . '
';


$e->outertext = '';


$e->outertext = $e->outertext . '

foo
';


$e->outertext = '

foo
' . $e->outertext;

How to traverse the DOM tree?

Top

  • Background Knowledge
  • Traverse the DOM tree


echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;

echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');

You can also call methods with Camel naming convertions.

Method Description

mixed

$e->children ( [int $index] )
Returns the Nth child object if index is set, otherwise return an array of children.

element

$e->parent ()
Returns the parent of element.

element

$e->first_child ()
Returns the first child of element, or null if not found.

element

$e->last_child ()
Returns the last child of element, or null if not found.

element

$e->next_sibling ()
Returns the next sibling of element, or null if not found.

element

$e->prev_sibling ()
Returns the previous sibling of element, or null if not found.

How to dump contents of DOM object?

Top