Coursetree: the Epilogue and Return

Aug 23 2011

The sign the adventure is over is always a return to the starting point after traveling to the other side of the world. In the early days when I wrote the crawler for courses, I gathered faculty links from a table by writing a small scraper for the links. The exact method is not clear, but it could have also been done in numerous other ways. Later, I found only those links when I came back to the project after several months. One reason I may not have recorded the method may be that there were duplicates in them, such as 100’s and 200’s referring to the same page.

At the end of this journey, I have many pages of documentation on the system. However, the system still lacked a way of handling new course pages. Now the system is complete, with only a few lines of code, which would have been impossible without the journey:

from urllib import urlopen
html = urlopen("http://ugradcalendar.uwaterloo.ca/page/Course-Descriptions-Index").read()
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re, cgi
from urlparse import urlparse
linksToCourses = SoupStrainer('a', href=re.compile('courses.aspx'))
links = [tag['href'] for tag in BeautifulSoup(html, parseOnlyThese=linksToCourses)]
faculties = list(set([cgi.parse_qs(urlparse(link)[4])['Code'][0] for link in links]))
faculties

[u'CIVE',
u'ARBUS',
u'ECON',
u'INTEG',
u'JS',
u'REES',
u'WS',
u'DUTCH',
u'CHEM',
u'AVIA',
u'PSYCH',
u'SPD',
u'SPCOM',
u'CROAT',
u'CHE',
u'HRM',
u'ENBUS',
u'SCBUS',
u'PACS',
u'SYDE',
u'KIN',
u'LAT',
u'STAT',
u'INDEV',
u'SMF',
u'CMW',
u'FINE',
u'PORT',
u'GER',
u'KOREA',
u'SCI',
u'BUS',
u'SWREN',
u'HIST',
u'AMATH',
u'PD',
u'RUSS',
u'OPTOM',
u'AFM',
u'COOP',
u'ECE',
u'MSCI',
u'NATST',
u'GRK',
u'ME',
u'INTTS',
u'RS',
u'GERON',
u'ITALST',
u'HLTH',
u'JAPAN',
u'MATH',
u'PLAN',
u'FR',
u'PHIL',
u'ENGL',
u'ISS',
u'ITAL',
u'PDENG',
u'SOCWK',
u'REC',
u'ARTS',
u'MTHEL',
u'NE',
u'BIOL',
u'APPLS',
u'EARTH',
u'CLAS',
u'CO',
u'CM',
u'ACTSC',
u'POLSH',
u'DRAMA',
u'COMM',
u'CS',
u'SPAN',
u'SI',
u'PSCI',
u'CHINA',
u'WKRPT',
u'SE',
u'HUMSC',
u'ERS',
u'ARCH',
u'EASIA',
u'DAC',
u'PMATH',
u'LS',
u'GEOE',
u'GEOG',
u'PHYS',
u'PDPHRM',
u'IS',
u'SOC',
u'STV',
u'MUSIC',
u'ANTH',
u'ESL',
u'MTE',
u'ENVS',
u'INTST',
u'PHARM',
u'GENE',
u'ENVE']

This gets a list of faculties from the table by returning a set.


import MySQLdb

db = MySQLdb.connect(user='', db='', passwd='', host='')

cursor = db.cursor()

cursor.execute('SELECT faculty FROM faculties ORDER BY faculty')

db_faculties = [row[0] for row in cursor.fetchall()]

db.close()

set(faculties) - (set(faculties) & set(db_faculties))

set([u'BUS', u'COOP', u'PD', u'PDPHRM'])

By subtracting the intersection of faculties in the database from the faculties in the table, the set of new items are found. The new items may be added to the database. I leave it to the reader.

Not only is this the first step when updating the database, it is a miniature model of the entire application. Thus it wraps everything up in a way that serves a purpose.

Tags: python, web scraping

No responses yet

Coursetree: the Epilogue and Return

Latest Posts

Feed on

Search

Monthly

Categories

Pages