«

»

Jul 07

Using the Python HTMLParser library

When writing a script to download files off a site, I figured there was an easy python library to do that. Well, sort of. I chose to use the HTMLParser library.  The documentation is not the best, so I thought I would add a bit of what I found.  If I had to do it again, I might just use regular expressions to do it all.

First, if you can find all your information in the tag, that makes life a lot easier. If not, you have to create a way to know where you are in a document. To do this, I suggest creating a list in the class that you append all tags to so that you know what the last tag was. Do this at the start of the handle_starttag() function. Take the tag off the stack in the handle_endtag() function. This way when you have a call to handle_data() you know where you are.

Below is an example template to use for your use.

 

import HTMLParser
class MyParse(HTMLParser.HTMLParser):
    def __init__(self):
        #super() does not work for this class
        HTMLParser.HTMLParser.__init__(self)
        self.tag_stack = []
        self.attr_stack = []

    def handle_endtag(self, tag):
        #take the tag off the stack if it matches the next close tag
        #if you are expecting unmatched tags, then this needs to be more robust
        if self.tag_stack[len(self.tag_stack)-1][0] == tag:
            self.tag_stack.pop()

    def handle_data(self, data):
        #'data' is the text between tags, not necessarily
        #matching tags
        #this gives you a link to the last tag
        tstack = self.tag_stack[len(self.tag_stack)-1]
        #do something with the text
           
    def handle_starttag(self, tag, attrs):
        #add tag to the stack
        self.tag_stack.append([tag, attrs])
        #if this tag is a link
        if tag =="a":
            #these next few lines find if there is a hyperlink in the tag
            tloc = map(lambda x: 1 if x[0]=='href' else 0,attrs)
            try:
                #did we find any hyperlinks
                attr_loc = tloc.index(1)
            except:
                pass
            # attr_loc only exists if we found a hyperlink
            if vars().has_key('attr_loc'):
                #append to the last item in the stack the location of the hyperlink
                #note, this does not increase the length of the stack
                #as we are putting it inside the last item on the stack
                self.tag_stack[len(self.tag_stack)-1].append(attr_loc)
               
                #now we can do what we need with the hyperlink

 
How I would use this to go through a webpage (assuming MyParse is in the same file):
 

if __name__=="__main__":
    import httplib
    site = "curioussystem.com"
    file_loc = r"/index.php"
    conn = httplib.HTTPConnection(site)
    conn.request("GET", file_loc)
    r1 = conn.getresponse()
    #copy response to variable because reading clears it
    data = r1.read()
    t = MyParse()
    t.feed(data)     #where the action happens

 
One other note, when I actually had to download something, I used the subprocess module to call wget to do the actual downloading. It was too much work in python for what I wanted.

Permanent link to this article: http://blog.curioussystem.com/2011/07/using-the-python-htmlparser-library/

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>