This course is closed. The accuracy of this page's content is not guaranteed. [Return to college.yukondude.com]

You really should consider upgrading your browser to a more recent version. You'll still be able to view everything on this site, but it'll be an ugly experience.

COMP118: Procedural Programming (F'05)
Week 14 Lecture Notes: Finishing the Web Crawling Program

When last we met...

The design for the crawler.py program looked like this:

# Ask the user for a starting URL.
# Add the starting URL to a list of unvisited URLs.
# While there are still unvisited URLs:
    # Get the first unvisited URL and remove it from the list.
    # Add this URL to the list of visited URLs.
    # Fetch the HTML found at this URL.
    # Extract all local site links from the HTML and "normalize" them.
    # For each of the link URLs:
        # If this link URL hasn't already been visited and isn't in the unvisited list:
            # Add this link URL to the unvisited list.
# Display the list of visited URLs.

We had written the code for everything up to the "Extract all local site links from HTML and 'normalize' them" task:

import re
import urllib

def validate_url(candidate_url):
    """Return (True, Normalized URL) if the given URL is valid."""
    # Only match URLs that end with .htm or .html (http:// is optional).
    matches = re.findall(r"^(http://)?(.+/.+\.html?)$", candidate_url)

    if matches:
        # Pull out the URL from the second group in the pattern.
        normalized_url = matches[0][1]
    else:
        # Invalid URL, so normalized form is irrelevant.
        normalized_url = ""

    # Return a (Boolean, String) tuple to the calling code.
    return (len(matches) == 1, normalized_url)


# Ask the user for a starting URL.
while True:
    start_url = raw_input("Enter the starting URL: ")
    (is_valid_url, start_url) = validate_url(start_url)
    
    if is_valid_url:
        break
    
    print "Please enter a VALID starting URL."

# URLs that have yet to be crawled.
unvisited_urls = []

# URLs that have already been crawled.
visited_urls = []

# Add the starting URL to a list of unvisited URLs.
unvisited_urls.append(start_url)

# While there are still unvisited URLs:
while unvisited_urls:
    # Get the first unvisited URL and remove it from the list.
    page_url = unvisited_urls[0]
    unvisited_urls = unvisited_urls[1:]

    # Add this URL to the list of visited URLs.
    visited_urls.append(page_url)

    # Fetch the HTML found at this URL.
    html = urllib.urlopen("http://" + page_url).read()

    # Extract all local site links from the HTML and "normalize" them.
    link_urls = extract_link_urls(html)

Extracting and Normalizing the Link URLs

The next step was to define the extract_link_urls() function using a regular expression to find all <a>nchor tags in the given HTML:

def extract_link_urls(html):
    """Return normalized intra-site link URLs found in the HTML."""
    pattern = """
        <a            # beginning of anchor tag
        .+?           # stuff before href attribute
        href="        # beginning of href attribute
        (.+?\.html?)  # contents of href attribute (HTML only)
        "             # end of href attribute
        .+?           # stuff after href attribute
        >             # end of the anchor tag
    """

    matches = re.findall(pattern, html, re.VERBOSE)

Then we had to make sure we eliminated any links to other sites, and also normalized the relative links found by our regular expression:

    # Found link URLs that don't leave the site.
    link_urls = []

    # Only add intra-site link URLs to the returned list.
    for match in matches:
        if match[:7] != "http://":
            link_urls.append(normalize_url(match))

    return link_url

Excluding links that start with http:// may be over-aggressive, but it will ensure that we don't follow any links that lead to other sites (but we may miss some on this site). Now it's a matter of normalizing each intra-site link. After a moment's reflection, we realized that, to normalize a link, we need to convert a relative URL into an absolute URL. Since all found link URLs will be relative to the current page, we need the current page's URL passed to normalize_url() (and therefore extract_link_urls() as well). To pass the current page's URL we'll need to make a few tweaks:

def extract_link_urls(html, page_url):
...
    # Only add intra-site link URLs to the returned list.
    for match in matches:
        if match[:7] != "http://":
            link_urls.append(normalize_url(match, page_url))
...
    # Extract all local site links from the HTML and "normalize" them.
    link_urls = extract_link_urls(html, page_url)

And now for the normalize_url() function definition:

import os
...
def normalize_url(link_url, page_url):
    """Return the normalized form of the given link (relative to the page URL)."""
    # Strip off the file name from the current page's URL.
    page_path = os.path.dirname(page_url)
    # Join (concatenate) the current page's URL path to the new link.
    joined_url = os.path.join(page_path, link_url)
    # Normalize the resulting path (deal with relative folder references).
    normalized_url = os.path.normpath(joined_url)
    # Return the result, replacing backslashes with slashes.
    return normalized_url.replace("\\", "/")

That one's not too obvious, but the gist of it is to use Python's built-in os.path module functions to do all of the heavy lifting. It's purpose-built to handle calculating absolute paths from relative ones. One problem we discovered later is that it doesn't properly handle absolute-path link URLs. They're easy enough to deal with, but we'll just ignore them for now.

Now that our link URLs are normalized, we can finish off the main program:

    # Extract all local site links from the HTML and "normalize" them.
    link_urls = extract_link_urls(html, page_url)

    # For each of the link URLs:
    for link_url in link_urls:
        # If this link URL hasn't already been visited and isn't in the unvisited
        # list:
        if link_url not in visited_urls and link_url not in unvisited_urls:
            # Add this link URL to the unvisited list.
            unvisited_urls.append(link_url)

    print "Found %d links in %s, %d unvisited, %d visited" % (len(link_urls),
                                                              page_url,
                                                              len(unvisited_urls),
                                                              len(visited_urls))

The program is complete, and we can test it out on a certain instructor's site:

Enter the starting URL: http://netlab2.yukoncollege.yk.ca/Brian_Butler/index.html
Found 3 links in netlab2.yukoncollege.yk.ca/Brian_Butler/index.html, 3 unvisited, 1 visited
Found 7 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/index.html, 9 unvisited, 2 visited
Found 1 links in netlab2.yukoncollege.yk.ca/Brian_Butler/office.html, 8 unvisited, 3 visited
Found 4 links in netlab2.yukoncollege.yk.ca/Brian_Butler/pers_index.html, 9 unvisited, 4 visited
Found 1 links in netlab2.yukoncollege.yk.ca/index.html, 9 unvisited, 5 visited
Found 1 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/assignments.html, 8 unvisited, 6 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/c225_04_marks.html, 7 unvisited, 7 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/comp225_news.html, 6 unvisited, 8 visited
Found 29 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes.html, 32 unvisited, 9 visited
Found 1 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/225out05.html, 31 unvisited, 10 visited
Found 1 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/resources.html, 30 unvisited, 11 visited
Found 1 links in netlab2.yukoncollege.yk.ca/Brian_Butler/contact.html, 29 unvisited, 12 visited
Found 4 links in netlab2.yukoncollege.yk.ca/Brian_Butler/vy1yc.html, 31 unvisited, 13 visited
Found 0 links in netlab2.yukoncollege.yk.ca/usage/index.html, 30 unvisited, 14 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/apache.html, 29 unvisited, 15 visited
Found 3 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week1.html, 29 unvisited, 16 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg1.html, 28 unvisited, 17 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week2.html, 27 unvisited, 18 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg2.html, 26 unvisited, 19 visited
Found 4 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week3.html, 27 unvisited, 20 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week4.html, 26 unvisited, 21 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg3.html, 25 unvisited, 22 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week5.html, 24 unvisited, 23 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg4.html, 23 unvisited, 24 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week6.html, 22 unvisited, 25 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg5.html, 21 unvisited, 26 visited
Found 4 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week7.html, 22 unvisited, 27 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg6.html, 21 unvisited, 28 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week8.html, 20 unvisited, 29 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg7.html, 19 unvisited, 30 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week9.html, 18 unvisited, 31 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg8.html, 17 unvisited, 32 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week10.html, 16 unvisited, 33 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg9.html, 15 unvisited, 34 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week11.html, 14 unvisited, 35 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg10.html, 13 unvisited, 36 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week12.html, 12 unvisited, 37 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg11.html, 11 unvisited, 38 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week13.html, 10 unvisited, 39 visited
Found 0 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/sg/sg12.html, 9 unvisited, 40 visited
Found 3 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week14.html, 8 unvisited, 41 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/radiopics.html, 7 unvisited, 42 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/vy1atv.html, 7 unvisited, 43 visited
Found 1 links in netlab2.yukoncollege.yk.ca/Brian_Butler/VHF+_beacons.html, 6 unvisited, 44 visited
Found 3 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/week1a.html, 5 unvisited, 45 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/vi.html, 4 unvisited, 46 visited
Found 2 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/emacs.html, 3 unvisited, 47 visited
Found 4 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/shellpgm.html, 2 unvisited, 48 visited
Found 4 links in netlab2.yukoncollege.yk.ca/Brian_Butler/comp225/notes/testing.html, 1 unvisited, 49 visited
Found 1 links in netlab2.yukoncollege.yk.ca/Brian_Butler/ATV_images.html, 0 unvisited, 50 visited

The Complete Program

import os
import re
import urllib

def extract_link_urls(html, page_url):
    """Return normalized intra-site link URLs found in the HTML."""
    pattern = """
        <a            # beginning of anchor tag
        .+?           # stuff before href attribute
        href="        # beginning of href attribute
        (.+?\.html?)  # contents of href attribute (HTML only)
        "             # end of href attribute
        .+?           # stuff after href attribute
        >             # end of the anchor tag
    """

    matches = re.findall(pattern, html, re.VERBOSE)

    # Found link URLs that don't leave the site.
    link_urls = []

    # Only add intra-site link URLs to the returned list.
    for match in matches:
        if match[:7] != "http://":
            link_urls.append(normalize_url(match, page_url))

    return link_urls

def normalize_url(link_url, page_url):
    """Return the normalized form of the given link (relative to the page URL)."""
    # Strip off the file name from the current page's URL.
    page_path = os.path.dirname(page_url)
    # Join (concatenate) the current page's URL path to the new link.
    joined_url = os.path.join(page_path, link_url)
    # Normalize the resulting path (deal with relative folder references).
    normalized_url = os.path.normpath(joined_url)
    # Return the result, replacing backslashes with slashes.
    return normalized_url.replace("\\", "/")
   
def validate_url(candidate_url):
    """Return (True, Normalized URL) if the given URL is valid."""
    # Only match URLs that end with .htm or .html (http:// is optional).
    matches = re.findall(r"^(http://)?(.+/.+\.html?)$", candidate_url)

    if matches:
        # Pull out the URL from the second group in the pattern.
        normalized_url = matches[0][1]
    else:
        # Invalid URL, so normalized form is irrelevant.
        normalized_url = ""

    # Return a (Boolean, String) tuple to the calling code.
    return (len(matches) == 1, normalized_url)


# Ask the user for a starting URL.
while True:
    start_url = raw_input("Enter the starting URL: ")
    (is_valid_url, start_url) = validate_url(start_url)
    
    if is_valid_url:
        break
    
    print "Please enter a VALID starting URL."

# URLs that have yet to be crawled.
unvisited_urls = []

# URLs that have already been crawled.
visited_urls = []

# Add the starting URL to a list of unvisited URLs.
unvisited_urls.append(start_url)

# While there are still unvisited URLs:
while unvisited_urls:
    # Get the first unvisited URL and remove it from the list.
    page_url = unvisited_urls[0]
    unvisited_urls = unvisited_urls[1:]

    # Add this URL to the list of visited URLs.
    visited_urls.append(page_url)

    # Fetch the HTML found at this URL.
    html = urllib.urlopen("http://" + page_url).read()

    # Extract all local site links from the HTML and "normalize" them.
    link_urls = extract_link_urls(html, page_url)

    # For each of the link URLs:
    for link_url in link_urls:
        # If this link URL hasn't already been visited and isn't in the unvisited
        # list:
        if link_url not in visited_urls and link_url not in unvisited_urls:
            # Add this link URL to the unvisited list.
            unvisited_urls.append(link_url)

    print "Found %d links in %s, %d unvisited, %d visited" % (len(link_urls),
                                                              page_url,
                                                              len(unvisited_urls),
                                                              len(visited_urls))
↑Top