I recently needed to write a quick script to extract some information from HTML documents. Poking around I found the machine already had
TCLLIB installed so I thought it would be a good opportunity to try out the
htmlparse library. The library is extremely easy to use except for one annoying pitfall, the handling of attributes. The parser will invoke a callback function with the tag name, text, and attributes. A default function is provided called
::htmlparse::debugCallback
. From the documentation of the
param
argument:
The fourth argument, param, contains the un-interpreted list of parameters to the tag.
I thought for sure they must be kidding. Do they really expect me to parse the attribute data myself? The reason I'm using a library is so I don't have to worry about all the intricacies of HTML parsing. I decided to give it a try with the example below:
#!/usr/bin/env tclsh
package require htmlparse 1.2
set html {
<html>
<head><title>Test HTML Page</title></head>
<body>
<p>This is some test <a target = "_blank"
href="http://w3.org/html" >HTML</a> content.</p>
</body>
</html>
}
::htmlparse::parse $html
The output of running this example:
$ ./htmlparse.tcl
==> hmstart {} {} {
}
==> html {} {} {
}
==> head {} {} {}
==> title {} {} {Test HTML Page}
==> title / {} {}
==> head / {} {
}
==> body {} {} {
}
==> p {} {} {This is some test }
==> a {} {target = "_blank"
href="http://w3.org/html" } HTML
==> a / {} { content.}
==> p / {} {
}
==> body / {} {
}
==> html / {} {
}
==> hmstart / {} {}
Sure enough, the attributes are all in one big string just as the documentation stated. This is one of those times I was hoping the documentation was wrong.
No comments:
Post a Comment