Parse XML with PHP

Chroder

New Member
Introduction
This post is meant to teach you how to parse XML into workable data using PHP4. In this post, we are going to work with an XML file that defines the layout of a book. The book will contain the book details and chapters. Each chapter will hold information to which page number it is on and a small description. How to create an XML document is beyond the scope of this post, but be sure to check out W3 Schools.

The XML Document
Create a new XML file (that is, a file with a .xml extension) with the following contents. We will be using this XML file throughout this post to demonstrate.
Code:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE book [
    <!ELEMENT book      (chapter+)>
    <!ELEMENT chapter   (desc+)>
    <!ELEMENT desc      (#PCDATA)>
    
    <!ATTLIST book      name    CDATA   "Unknown">
    <!ATTLIST book      author  CDATA   "Unknown">
    <!ATTLIST book      isbn    CDATA   "Unknown">
    <!ATTLIST chapter   name    CDATA   "Unknown">
    <!ATTLIST chapter   page    CDATA   "Unknown">
]>
<book name="The WindowsXP OS" author="Me" isbn="1234567890">
    <chapter name="Introduction" page="1">
        <desc>Book introduction</desc>
    </chapter>

    <chapter name="Who Should Read" page="3">
        <desc>About who the book is aimed to</desc>
    </chapter>

    <chapter name="Getting It" page="5">
        <desc>How to get the WindowsXP CD</desc>
    </chapter>

    <chapter name="Home or Pro" page="6">
        <desc>Which one is best for you?</desc>
    </chapter>

    <chapter name="Installing" page="7">
        <desc>The installation process</desc>
    </chapter>

    <chapter name="Making Users" page="10">
        <desc>How to add users to your system</desc>
    </chapter>

    <chapter name="Viruses" page="13">
        <desc>All about viruses and how to avoid them</desc>
    </chapter>

    <chapter name="Conclusion" page="15">
        <desc>Good-byes and further reading</desc>
    </chapter>

    <chapter name="Terms" page="16">
        <desc>Terms that you might not understand</desc>
    </chapter>

    <chapter name="Index" page="19">
        <desc>Index</desc>
    </chapter>
</book>

The BookParser Class
Whenever I parse XML, I like to create a nice custom class. You could create some sort of base class and just extend off of it for different types of XML documents, but for this post we'll be creating just a simple BookParser. So here's the code, we'll talk about it after.
PHP:
class BookParser
{
    var $book;
    var $xml;
    var $chapter;
    var $tag;


    //==========================================================================
    // BookParser
    // ----------
    // Constructor, set up parser
    //==========================================================================
    function BookParser()
    {
        $this->xml = xml_parser_create();

        xml_set_object($this->xml, $this);
        xml_set_element_handler($this->xml, 'elementStart', 'elementEnd');
        xml_set_character_data_handler($this->xml, 'characterData');
        xml_parser_set_option($this->xml, XML_OPTION_CASE_FOLDING, false);
    }



    //==========================================================================
    // parse
    // -----
    // Parse the document
    //==========================================================================
    function parse($path)
    {
        $fh = fopen($path, 'r') or die('Cannot open file `' . $path . '`');

        while($data = fread($fh, 4096))
        {
            xml_parse($this->xml, $data, feof($fh)) or die('Cannot parse file `' . $path . '`');
        }

        @fclose($fh);
    }



    //==========================================================================
    // elementStart
    // ------------
    // Handle opening tag
    //==========================================================================
    function elementStart($parser, $tag, $attr)
    {
        if($tag == 'book')
        {
            $this->book = array('name' => $attr['name'], 'author' => $attr['author'], 'isbn' => $attr['isbn']);
            $this->display();
        }

        elseif(empty($this->chapter) && $tag == 'chapter')
            $this->chapter = new ItemChapter($attr['name'], $attr['page']);

        else
            $this->tag = $tag;
    }



    //==========================================================================
    // elementEnd
    // ----------
    // Handle closing tag
    //==========================================================================
    function elementEnd($parser, $tag)
    {
        if($tag == 'chapter')
        {
            $this->chapter->display();
            unset($this->chapter);
        }

        $this->tag = '';
    }



    //==========================================================================
    // characterData
    // -------------
    // Handle element character data
    //==========================================================================
    function characterData($parser, $data)
    {
        if(!empty($this->chapter) && $this->tag == 'desc')
            $this->chapter->setDesc($data);
    }



    //==========================================================================
    // display
    // -------
    // Display book information
    //==========================================================================
    function display()
    {
        echo "<h1>{$this->book['name']}</h1>\n";
        echo "By <strong>{$this->book['author']}</strong> <small><em>(ISBN: {$this->book['isbn']})</em></small><br /><br /><br />\n\n\n";
    }
}

Continued ...
 
Last edited:

Chroder

New Member
ParseBook Method
The constuctor (ParseBook() method) does some pretty basic XML set up. First we create a new XML parser with the xml_parser_create() function. To handle the XML, we assign callback functions to certain events. For example, when the parser see's a new tag opened, it looks for the callback function to work with it. This is what xml_set_element_handler() and xml_set_character_data_handler() do. xml_set_element_handler() sets the handlers for open and close tags and xml_set_element_handler() sets the handler for when data is encounterd (meaning, the text between the tags: <desc>this text</desc>). But since we're working with an object, we have 'assign' the parser to the object, this is what xml_set_object(). The parser now knows to call the method of the object instead of simply calling a function. And last but not least, we use xml_parser_set_option() to set XML_OPTION_CASE_FOLDING to false, thus disabling it. By default the parser makes all tags uppercase, I think it looks messy so I disable it.

parse Method
The parse method is relatively simple. It just opens the desired file for reading and reads 4KB at a time. The data that it reads is passed to the XML parser, which in turn fires the events.

elementStart Method
The elementStart() method is the method fired when the parser finds an opening tag. With our XML document, that means the parser will fire it upon <book>, <chapter> and <desc>. The $tag variable holds the name of the tag it found, so it's easy for us to handle it. We just have to test it with some conditionals and then do what we want with it. The $attr variable is an array holding the attributes (like book has 'name', 'author' and 'isbn').

As you can see, when the tag is 'book', then we set some book details in a member array and then run the display() method to echo out the book details. Normally, you'd wait until you found the closing tag to output things, but with <book> being the root element, that would be after all the chapeters were finished -- we don't want the book details at the bottom!

If we come accross a chapter tag, then we want to get ready and store some details. I've made a separate class called ItemChapter (which we'll cover in a minute) that is simply a 'holder' class that holds the data and takes care of the the display.

And lastly, if the tag is <desc>, then we set the class member variable $tag to desc, so when the characterData() method is fired we can be sure the it is the contents of <desc>.

elementEnd Method
The elementEnd function is fired when the parser comes accross and end tag. In our case, that is </book>, </chapter>, and </desc>. The $tag variable tells us which one it is. We check if the tag is 'chapter', and if it is then we call the chapter display() method and unset the chapter object, ready to start a new one. And we always set the member variable $tag to an empty string.

characterData Method
As stated above, the characterData is fired when the parser comes accross, well, data. The stuff between tags are data. In our XML document, the only tag that contains data is <desc>, so we test to see if the current tag (remember we set the member variable $tag whenever elementStart is fired) and if it is, then we set the chapter description with the chapter object's setDesc() method.

The ItemChapter Class
The ItemChapter class is really simple. It basically holds chapter data and displays it. Here's the code:
PHP:
class ItemChapter
{
    var $chapter;


    //==========================================================================
    // ItemChapter
    // -----------
    // Return the current chapter
    //==========================================================================
    function ItemChapter($name, $page)
    {
        $this->chapter = array('name' => $name, 'page' => $page);
    }



    //==========================================================================
    // display
    // -------
    // Display the chapter
    //==========================================================================
    function display()
    {
        echo "<strong>{$this->chapter['name']}</strong> <small><em>(p.{$this->chapter['page']})</em></small><br />\n";
        echo "{$this->chapter['desc']}<br /><br />\n\n";
    }


    //==========================================================================
    // setDesc
    // -------
    // Set the chapter description
    //==========================================================================
    function setDesc($desc)
    {
        $this->chapter['desc'] = trim($desc);
    }
}
The constructor just sets up some chapter details including the name and the page number. The setDesc() method sets the description and the display() method displays the chapter. Simple eh?

Let's Use It
Now it's time to use it! All the hard work is done now, all we have to do is stick in some code. I put the two classes, ParseBook and ItemChapter in a file called 'parse.php' and included it. All you have to do is create a new instance of ParseBook and then pass the XML file to the parse() method, easy as pie!
PHP:
<?php require_once('parse.php'); ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>Book Overview</title>
    <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" /> 
    <style type="text/css" media="all">
        body, table, tbody, tr, th, td, p, div, span {
            font: normal 8pt Verdana, Arial, sans-serif;
            color: #000;
            line-height: 150%;
        }

        body {
            background-color: #E0E5ED;
            margin: 40px;
        }

        #wrap {
            background-color: #FFF;
            padding: 25px;
            border: 1px solid #405779;
        }
    </style>
</head>
<body>

<div id="wrap">

<?php

$book = new BookParser();
$book->parse('book.xml');

?>

</div>
</body>
</html>
And the result is a nice book overview seen in the attached image.

Anyway, hope you learned something :cool:
 

Attachments

  • screen.gif
    screen.gif
    7.2 KB · Views: 227

Chroder

New Member
I learned a lot through just posting it :p The XML file took me a while, I needed to learn how to write the DTD ;)
 

zkiller

Super Moderator
Staff member
guess it was a great learning by doing experience for you then. i find it is most often easiest to teach someone how to do something by simple having them do it and assisting whenever they get stuck.
 

Chroder

New Member
Yeah, me too. Maybe tell them the how-to as in the routines and ideas, but have them actually do it. They learn better and teaching is easier :D
 
Top