首页/Home PHP Basics A Good HTML Parser In PHP

A Good HTML Parser In PHP

PrintE-mail
Monday, 05 May 2008 17:50  

First let me give you an explanation of what I want to do.

As you have noticed, most blogger sites or content management systems (CMS) allow users to input hypertext content. In such systems, one of the filters is a filter which handles the user input string. Here I don't want to talk about the security tricks. What I will show you is about the HTML matches in the user input.

 

Given such a user input:

This is a begin of a test case. <p>Here is a paragraph. 
<p>This is a new paragraph. <br />
<a href="http://phparch.cn">phparch.cn <br /> Here is a new line. </p>
<div>A div <span class="aspan">It begins with ABC</span>
<ul><li>A list begins. <li>Another option</li> This maybe incorrect

What we want to do with such an HTML? Of course we should try our best to make sure that the tags are nested  correctly and matched correctly. Otherwise, it may crash our site. I have experienced such a nightmare. I don't want to suffer a second time. Frown

I Googled, but no good luck. Those so-called solutions are not what I wanted. Then I will show you the solution. Thanks to the tidy extension. 

<?php

// Give that user input to $content
$content 'This is a begin of a test case. <p>Here is a paragraph.
<p>This is a new paragraph. <br />
<a href="http://phparch.cn">phparch.cn <br /> Here is a new line. </p>
<div>A div <span class="aspan">It begins with ABC</span> 
<ul><li>A list begins. <li>Another option</li> This maybe incorrect'
;

// Tidy extension configurations
$config = array('indent' => false,
                
'output-xhtml' => true,
                
'show-body-only' => true);

$filtered tidy_parse_string($content$config'UTF8');

echo 
$filtered;

?>

 Then you can get the output like this:

This is a begin of a test case.
<p>Here is a paragraph.</p>
<p>This is a new paragraph.<br />
<a href="http://phparch.cn">phparch.cn<br />
Here is a new line.</a></p>
<div><a href="http://phparch.cn">A div <span class="aspan">It
begins with ABC</span></a>
<ul>
<li>A list begins.</li>
<li>Another option</li>
<li style="list-style: none">This maybe incorrect</li>
</ul>
</div>

For more information about tidy, go to http://php.net/manual/en/function.tidy-parse-string.php.

For more tidy configuration options, have a look at this page:
http://tidy.sourceforge.net/docs/quickref.html