Coursetree: the Epilogue and Return
The sign the adventure is over is always a return to the starting point after traveling to the other side of the world. In the early days when I wrote the crawler for courses, I gathered faculty links from a table by writing a small scraper for the links. The exact method is not clear, but it could have also been done in numerous other ways. Later, I found only those links when I came back to the project after several months. One reason I may not have recorded the method may be that there were duplicates in them, such as 100’s and 200’s referring to the same page.
At the end of this journey, I have many pages of documentation on the system. However, the system still lacked a way of handling new course pages. Now the system is complete, with only a few lines of code, which would have been impossible without the journey:
html = urlopen("http://ugradcalendar.uwaterloo.ca/page/Course-Descriptions-Index").read()
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re, cgi
from urlparse import urlparse
linksToCourses = SoupStrainer('a', href=re.compile('courses.aspx'))
links = [tag['href'] for tag in BeautifulSoup(html, parseOnlyThese=linksToCourses)]
faculties = list(set([cgi.parse_qs(urlparse(link)[4])['Code'][0] for link in links]))
faculties
u'ARBUS',
u'ECON',
u'INTEG',
u'JS',
u'REES',
u'WS',
u'DUTCH',
u'CHEM',
u'AVIA',
u'PSYCH',
u'SPD',
u'SPCOM',
u'CROAT',
u'CHE',
u'HRM',
u'ENBUS',
u'SCBUS',
u'PACS',
u'SYDE',
u'KIN',
u'LAT',
u'STAT',
u'INDEV',
u'SMF',
u'CMW',
u'FINE',
u'PORT',
u'GER',
u'KOREA',
u'SCI',
u'BUS',
u'SWREN',
u'HIST',
u'AMATH',
u'PD',
u'RUSS',
u'OPTOM',
u'AFM',
u'COOP',
u'ECE',
u'MSCI',
u'NATST',
u'GRK',
u'ME',
u'INTTS',
u'RS',
u'GERON',
u'ITALST',
u'HLTH',
u'JAPAN',
u'MATH',
u'PLAN',
u'FR',
u'PHIL',
u'ENGL',
u'ISS',
u'ITAL',
u'PDENG',
u'SOCWK',
u'REC',
u'ARTS',
u'MTHEL',
u'NE',
u'BIOL',
u'APPLS',
u'EARTH',
u'CLAS',
u'CO',
u'CM',
u'ACTSC',
u'POLSH',
u'DRAMA',
u'COMM',
u'CS',
u'SPAN',
u'SI',
u'PSCI',
u'CHINA',
u'WKRPT',
u'SE',
u'HUMSC',
u'ERS',
u'ARCH',
u'EASIA',
u'DAC',
u'PMATH',
u'LS',
u'GEOE',
u'GEOG',
u'PHYS',
u'PDPHRM',
u'IS',
u'SOC',
u'STV',
u'MUSIC',
u'ANTH',
u'ESL',
u'MTE',
u'ENVS',
u'INTST',
u'PHARM',
u'GENE',
u'ENVE']
This gets a list of faculties from the table by returning a set.
import MySQLdb
db = MySQLdb.connect(user='', db='', passwd='', host='')
cursor = db.cursor()
cursor.execute('SELECT faculty FROM faculties ORDER BY faculty')
db_faculties = [row[0] for row in cursor.fetchall()]
db.close()
set(faculties) - (set(faculties) & set(db_faculties))
By subtracting the intersection of faculties in the database from the faculties in the table, the set of new items are found. The new items may be added to the database. I leave it to the reader.
Not only is this the first step when updating the database, it is a miniature model of the entire application. Thus it wraps everything up in a way that serves a purpose.