Archive for the 'Programming' Category

Functional Javascript Programming using Underscore

Aug 10 2011 Published by under Programming

The Scenario

Suppose a javascript function is needed to parse text and execute another function for certain lines of text. Each line of the text is ended with a line break. If there is a keyword ‘get’, then the word following it needs to be checked if it is in the list of files on the server, and get it, using a GET request. So ‘get this’ would get the file ‘this’ from the server.

Object Oriented Approach

First, I tried doing this using the traditional way, with array indexes of letters.

if(input.value.split(' ').indexOf('get')==-1) return;
var lines = input.value.split('\n');
...

I stopped right there the next line is supposed to be a for each loop. That satisfied the requirements for using a functional library called underscore.js.

Mixed Approach

Not my favorite one, as it makes the painting look like a collage, and I think it made an error in the code harder to spot. See if you can spot it:

_(input.value.split(' ')).chain()
    .select(function (line) {
        var words = line.split(' ');
        return words.length > 1 && words[0] == 'get' && _.include(files, words[1]);
    })
    .each(function (line) {
        executeGet(line);
    });

Aha! The space should actually be newline.

Functional Approach
This time, I had to modify the code to use a single word from the word array, the last one.
The main line that violated the functional approach was

var words = line.split(' ');

along with the array accesses, which should be replaced with first and rest. It turns out there is a purely functional way:

_(input.value.split('\n')).chain()
    .map(function (line) {
        return line.replace(/get\s*/, '');
    })
    .select(function (module) {
        return module in files;
    })
    .each(function (module) {
        $.get(files[module]);
    })

The custom executeGet function has been replaced by jQuery’s $.get.

Conclusions
Which approach is better? Using the OO approach, a local index variable would be needed to be created for the for loop, along with other locals, which adds irrelevant lines to the code. On the other hand, each line on its own makes sense. However, functional code cannot be understood except as a whole, because there are no variables. Statelessness has its own beauty in programming. It’s similar to running a quantum computer, not caring how bits disappear or time warp, but only for the result of the computation.

No responses yet

Coursetree 2.0: An Intelligent Backend Coming Soon

Jun 24 2011 Published by under CourseTree,Singularitarian

The goal of coursetree 2.0 is to leverage the current cloud infrastructure to deliver semantic applications that help users find the information they are looking for.

Features currently planned:

  1. Course search that understands what the user wants
  2. Filtering of irrelevant links
  3. Pattern based degree data mining

Draft implementation strategy:

  1. Let Google search index Wikipedia and video links
  2. Bayesian classifier will be used to categorize link content into subjects
  3. Template induction and template scraping

Features under consideration:

  1. Adaptable prerequisite semantic analysis
  2. Fully automated template learning and template extraction
  3. Relevant course links/suggested courses

Tenative ideas:

  1. Genetic algorithm for grammar rule generation with fitness score assigned according to the total number of parse errors
  2. Use hashing algorithms to detect similarity in sections of a page, feed similar sections using wrapper induction to generate template
  3. Build map of courses using anti-requisites and display nearest neighbors

Unsuccessful incubation features:

  1. Using genetic algorithms to generate templates for wrapper induction
  2. Switch to parse trees extract noun phrases for Wikipedia link candidates
  3. YQL for video link scraping

Lessons learned and salvaged:

  1. Don’t use genetic programming methods where scores cannot be assigned to each individual “program”, as many of the templates were simply fails with zero scores
  2. Although in some cases successful (with a comma separated list of noun phrases), in other cases single words were marked as noun phrases in the parse tree instead of a more desirable longer phrase
  3. Due to frequent changes in video sites, nested JavaScript callbacks with closures to glue previews to links made the code a target to be recycled

No responses yet

JavaScript URL Parsing

Apr 24 2011 Published by under Programming

Many web frameworks have URLs in the format of domain/category/page, but what if there’s a need for that outside of the framework, like in client-side scripting? Two lines of code is enough for getting those URL parameters:

var url = /(\w+):\/\/([\w.]+(?:\:8000)?)\/(\S*)\/(\S*)\//;
var result = "http://127.0.0.1:8000/category/page/".match(url);

One special feature of JavaScript regular expressions is (?…) is a group that does not show in the result array.

If you wanted to map the result to different actions:

if (result != null) {
    switch(result[4]){ //page
        case 'page1':
            //code block1
            break;
    }
}

No responses yet

Course Explorer Refresh, with Inspiration

Dec 23 2010 Published by under CourseTree

Coursetree 1.5

screenshot.21

New Features:

  • Successor relationships
  • Textbooks by ISBN from the bookstore
  • Department requirements
  • 2 clicks at most to look up information on a course
  • Intelligent back button behavior
  • Links to Wikipedia
  • HTML5 semantic markup

Bugfixes:

  • All issues listed in the previous release notes
  • Phrase “one of” being missed by the scraper
  • Parser can handle phrases such as “taken prior to”
  • M(y)isunderstanding regarding commas corrected
    • CS 136 or 145, MATH 135
    • now becomes (CS 136 or 145) and MATH 135
    • should see more prerequisites listed as a result
  • Courses being overwritten by labs

Bugs:

  • A couple of quirks showing up as errors in the parser
    • EARTH 121, 121L or 122, 122L should be interpreted as EARTH 121/EARTH 122
    • CHEM 264 or 28/262 should be interpreted as CHEM 264/CHEM 282/CHEM262
  • Courses overlapping in the prerequisite graph

Waterloo Course Planner has been tested and works on IE8, Firefox 3.6, Chrome 9, Safari 5, and Opera 11.

No responses yet

ASAP Loading with LABjs

Dec 02 2010 Published by under Programming

Plan 1

Let’s start off with some dull code to introduce the subject:

//<![CDATA[
base_url = 'http://localhost/';
//]]>
function loadfile(filename, filetype){
    if (filetype=="js"){
        var fileref=document.createElement('script')
        fileref.setAttribute("type","text/javascript")
        fileref.setAttribute("src", base_url + 'js/' + filename + '.js')
    }
    else if (filetype=="css"){
        var fileref=document.createElement("link")
        fileref.setAttribute("rel", "stylesheet")
        fileref.setAttribute("type", "text/css")
        fileref.setAttribute("href", base_url + 'css/' + filename + '.css')
    }
    if (typeof fileref!="undefined")
        document.getElementsByTagName("head")[0].appendChild(fileref)
}

A typical base template for loading the CSS and JS files using Django:

<!-- CSS Files -->
<script type="text/javascript">
var css = {% block css %}{% endblock %};
for (i in css) {
    loadfile(css[i], 'css');
}
</script>

<!-- JS Files -->
<script type="text/javascript">
var js = {% block js %}{% endblock %};
if (navigator.userAgent.indexOf('MSIE') != -1){
    js.unshift('iespecial');
}
for (i in js) {
    loadfile(js[i], 'js');
}
</script>

The actual template for the page specifying the files only needs to be two lines long:

{% block css %}['glaze', 'fancy']{% endblock %}
{% block js %}['jquery', 'yui', 'dojo']{% endblock %}

As the bootstrapping code cannot use any libraries itself, this is definitely as manageable as it gets, unless you prefer to merge CSS and JS files into one array. That would acually add more bloat with the ‘.css’ and ‘.js’ extensions. Except there’s just one minor bug with the smaller files being loaded first, resulting in “Uncaught ReferenceError: $ is not defined” for JQuery’s document ready.

Plan Z

The problem with the above scheme is that the files are loaded asynchronously, all at the same time. The usual script tag loading ensures the files are loaded in order. So is there a way to load them asynchronously and still make sure dependencies are met?
The idea is to check document status for complete, then load files that depend on them. LABjs does just about everything I expected. Except it wouldn’t load itself. So to put it all together, I used a mini-loader for LABjs itself. The design decision here is to keep the code As Short As Possible. Code for generating script tags only fits into the template in the MTV pattern. Naturally, JavaScript turns out to be be the best way to load JS files, offering more features than traditional loading schemes with server generated headers. Now, this is all I have to manage with the dependencies:

{% block js %}
.script(get_abs_url('framework')).wait()
.script(get_abs_url('plugin.framework'))
.script(get_abs_url('myplugin.framework')).wait()
.script(get_abs_url('init'));
{% endblock %}

No responses yet

Course Explorer Released, Bugs Known

Nov 14 2010 Published by under CourseTree

Coursetree 1.0

screenshot.1

Features:

  • All of the courses in the course calendar, categorized by department
  • Displays course title and description
  • Color codes for the time of the year a course is offered
  • Less clicks to look up information on a course
  • Draws prerequisite diagram
  • Nodes can be toggled to show an alternative list of prerequisites

Bugs:

  • Some courses were rejected by the scraper
  • Some courses are missing prerequisites, possibly a parsing issue
  • Due to the above two bugs, some prerequisites won’t show
  • Some nodes at the bottom of the prerequisite tree appear to be transparent

Recommended browsers for Waterloo Course Planner: Firefox, Chrome, and Opera.

No responses yet

Generating Possible Courses to Fulfill Prerequisites

Nov 06 2010 Published by under CourseTree

For this part of the project, the goal is to generate a list of courses that meets the requirements for a course from a string stored in the database in a specific format. The entire project was designed so that each component is modular. This way, each component could be written and tested independently, and the complexity of the task is greatly reduced. Likewise, the project consists of so many high level components it appears, at first glance, to be incapable of being one piece. To my surprise, I put together Scrapy + PLY + Django, and it all works beautifully.

Part of the reason these separate components can all come together is that the specifications were set in stone before design, and the design itself (choosing the web framework, scraper, compiler) is tailored with the whole in mind. So for this part, these are the requirements:

From a string of prerequisites separated by commas, with alternatives separated by slashes, generate a list of lists that represents all possible course combinations, without more than one course selected out of each group of alternatives.

"CS 241, STAT 206/STAT 230/STAT 240" -> [['CS 241'], [' STAT 206', 'STAT 230', 'STAT 240']]
[['CS 241'], [' STAT 206', 'STAT 230', 'STAT 240']] -> [('CS 241', 'STAT 206'), ('CS 241', 'STAT 230'), ('CS 241', 'STAT 240')]

Basically, it’s a combination of a list of lists of strings.
 

Once the goal was set, the coding phase was straightforward. Simply understand the meaning of the input, and use a python function to turn it into the specified output.

def calc_comb(t):
    l = t.split(', ')
    group = []
        group.append(s.split('/'))
    for s in l:
    return list(itertools.product(*group))

 

No responses yet

Coding Horror and the Illumination

Nov 03 2010 Published by under CourseTree

Can you imagine anything useful coming out of this line of code?

prereqPattern='''([A-Z]{2,5} \d{3}/[A-Z]{2,5} \d{3}\s*and\s*[A-Z]{2,5} \d{3}/[A-Z]{2,5} \d{3})|([A-Z]{2,5} \d{3}\s*and\s*\(*[A-Z]{2,5} \d{3}\)*)|(\)\s*and\s*\()|(One of [A-Z]{2,5} \d{3}\s*[, or\d{3}]*;|[A-Z]{2,5} \d{3}[, \d{3}]*;|.*,\s*[A-Z]{2,5} \d{3}.*;|[A-Z]{2,5} \d{3} or [A-Z]{2,5} \d{3}|[A-Z]{2,5} \d{3} or \d{3}|[A-Z]{2,5} \d{3})'''

I can, because I’ve seen it. Here’s a small slice:

MATH 137 or 147
 ACTSC 231, STAT 230 or 240
MATH 136 or 146
ACTSC 232
ACTSC 371
AFM 272;ACTSC 291
STAT 330, 333
STAT 330, 333
 ACTSC 331, STAT 330
AFM 372;ACTSC 391;ACTSC 231;ACTSC 231 and BUS 393;STAT 330;STAT 334
AFM 372;ACTSC 391;ACTSC 231;ACTSC 231 and BUS 393;STAT 333 or 334
ACTSC 331
AFM 101 or BUS 227
AFM 101

A master piece at its time, and quite decent to maintain with the documentation:

'CS 134 or 145 or a mark of 60% or higher in 136 or 138; Honours Mathematics or Software Engineering students only.'
'([A-Z]{2,5} \d{3}) or (\d{3})*'
('CS 134', '145')

'One of CS 134, 136, 138, 145; Computer Science students only.'
'([A-Z]{2,5} \d{3})([, \d{3}]*;)'
('CS 134', ', 136, 138, 145;')

'Computer Science students only.'
'(\D* students only)'
'Computer Science students only'

'CS 350 or ECE 354'
'([A-Z]{2,5} \d{3}) or ([A-Z]{2,5} \d{3})*'
('CS 350', 'ECE 354')

'CM 339/CS 341 or CS 350'
'([A-Z]{2,5} \d{3})/([A-Z]{2,5} \d{3})( or [A-Z]{2,5} \d{3})*'
('CM 339', 'CS 341', ' or CS 350')

'DAC 201/ENGL 203 and DAC 202/ENGL 204'
 '([A-Z]{2,5} \d{3})/([A-Z]{2,5} \d{3}).*([A-Z]{2,5} \d{3})/([A-Z]{2,5} \d{3})*'
('DAC 201', 'ENGL 203', 'AC 202', 'ENGL 204')

And a light comes through. There is a better way, better than I ever imagined. In fact, the ugly phase of evolution comes before the flowering. A metaphor for this would be the pollution during industrialization and a bloom of clean industries following it due to the emphasis on knowledge rather than machine labor.
Simply pass on the parenthesis in the lexer and rewrite the grammar so that it can see more of the input before having to make a decision for shift/reduce conflicts. This is similar to using a parser with more lookahead, an oracle that sees infinitely many steps ahead.

Yet, no matter how sweet the code,

it is time to part ways.

What comes next,

no one knows.

This age is past,

and the ground is laid for the next one.

No responses yet

Parsing with PLY: The Good, Bad, and Ugly

Oct 30 2010 Published by under CourseTree,Programming

PLY is a Python implementation of Lex and Yacc originally written for Unix. Previously, I used it for tokenizing prerequisite strings. The goal is to have a program that performs the following trick:

Input Interpretation

MATH 135 or 145, STAT 220 or 230 or 240 MATH 135/MATH 145, STAT 220/STAT 230

AMATH 250 and MATH 237 or 247; Level at least 3A;
Not open to General Mathematics students.
AMATH 250, MATH 237/MATH 247

BIOL 140, 140L, (BIOL 208 or 330) and (BIOL 308 or
330)
BIOL 140, BIOL 140L, BIOL 208/BIOL 330, BIOL
308/BIOL 330

(AMATH 341/CM 271/ CS 371 or CS 370) and (MATH 128
or 138 or 148)
AMATH 341/CM 271/CS 371/CS 370, MATH 128/MATH
138/MATH 148

MATH 135 or 145, STAT 220 or 230 or 240 MATH 135/MATH 145, STAT 220/STAT 230/STAT 240

AMATH 250 and MATH 237 AMATH 250, MATH 237

(BIOL 140,BIOL 208 or 330) and (BIOL 308 or
330)
BIOL 140/BIOL 208/BIOL 330, BIOL 308/BIOL 330

MATH 229 or 239 or 249 and (One of CO 227, 250/350,
355, CO 352/CM 340);
MATH 229/MATH 239/MATH 249, CO 227/CO 250/CO
350

(CS 240 or SE 240), CS 246, ECE 222 CS 240/SE 240, CS 246, ECE 222

CS 241 and (STAT 206 or 230 or 240) CS 241, STAT 206/STAT 230/STAT 240

At first glance, it looks like a simple task. Just replace “,” “and” with “,”, and replace “/” “or” with “/”. Regular expressions could do that. Then there comes the issue with the comma and its interpretation depending on the level in the hierarchy. So using PLY to do the same thing turned out to be much easier, since grammar rules are used instead of regular expressions. I put together a quick prototype in an hour.

import ply.yacc as yacc
import sys
sys.path.append("/Documents/Projects/ply/")
from prereqlex import tokens

def p_prereq_cp(p):
    'prereq : cp'
    p[0] = p [1]

def p_prereq_cp_and(p):
    'cp : course AND cp'
    p[0] = p[1] + ' ' + p[3]

def p_prereq_cp_or(p):
    'cp : course OR cp'
    p[0] = p[1] + ' ' + p[3]

def p_prereq_cp_comma(p):
    'cp : course COMMA cp'
    p[0] = p[1] + ' ' + p[3]

def p_prereq_cp_parens(p):
    'cp : LPARENS cp RPARENS'
    p[0] = p[2]

def p_prereq_cp_course(p):
    'cp : course'
    p[0] = p[1]

def p_prereq_course_extra(p):
    'course : course OR NUM'
    p[0] = p[1] + ' ' + p[3]

def p_prereq_course_default(p):
    'course : DEPT NUM'
    p[0] = p[1] + ' ' + p[2]

# Error rule for syntax errors
def p_error(p):
    print "Syntax error in input!"

# Build the parser
parser = yacc.yacc()

while True:
   try:
       s = raw_input('&gt; ')
   except EOFError:
       break
   if not s: continue
   result = parser.parse(s)
   print result

&gt; MATH 125 or 135
MATH 125 135
&gt; MATH 125 or MATH 135
MATH 125 MATH 135

That’s pretty good for a start, but doesn’t get much better for interpreting the input correctly.

def p_prereq_cp_lp_ep(p):
    'prereq : cp SEMI lp SEMI ep'
    p[0] = p[1]

def p_prereq_cp_lp(p):
    'prereq : cp SEMI lp'
    p[0] = (p[1] , 'Level ' + p[3])

def p_prereq_cp_ep(p):
    'prereq : cp SEMI ep'
    p[0] = p[1] + ' Exclude ' + p[3]

def p_prereq_cp(p):
    'prereq : cp'
    p[0] = p[1]

def p_prereq_cp_and(p):
    'cp : cp AND cg'
    p[0] = ('AND', p[1], p[3])

def p_prereq_cp_comma(p):
    'cp : cp COMMA cg'
    p[0] = ('COMMA', p[1], p[3])

def p_prereq_cp_comma_and(p):
    'cp : cp COMMA AND cg'
    p[0] = ('COMMA', p[1], p[4])

def p_prereq_cp_semi(p):
    'cp : cp SEMI cg'
    p[0] = ('SEMI', p[1], p[3])

def p_prereq_cp_semi_and(p):
    'cp : cp SEMI AND cg'
    p[0] = ('SEMI', p[1], p[4])

def p_prereq_cp_defualt(p):
    'cp : cg'
    p[0] = p[1]

def p_prereq_cg_or(p):
    'cg : cg OR courses'
    p[0] = ('OR', p[1], p[3])

def p_prereq_cg_slash(p):
    'cg : cg SLASH courses'
    p[0] = ('SLASH', p[1], p[3])

def p_prereq_cg_comma(p):
    'cg : cg COMMA courses'
    p[0] = ('COMMA', p[1], p[3])

def p_prereq_cg_comma_or(p):
    'cg : cg COMMA OR courses'
    p[0] = ('COMMA', p[1], p[4])

def p_prereq_cg_parens(p):
    'cg : LPARENS cg RPARENS'
    p[0] = p[2]

def p_prereq_cg_one_of(p):
    'cg : ONE OF cg'
    p[0] = ('OR', p[3])

def p_prereq_cg_courses(p):
    'cg : courses'
    p[0] = p[1]

def p_prereq_courses_or(p):
    'courses : courses OR course'
    p[0] = ('OR', p[1], p[3])

def p_prereq_courses_slash(p):
    'courses : courses SLASH course'
    p[0] = ('SLASH', p[1], p[3])

def p_prereq_courses_comma(p):
    'courses : courses COMMA course'
    p[0] = ('COMMA', p[1], p[3])

def p_prereq_courses_default(p):
    'courses : course'
    p[0] = p[1]

def p_prereq_course_default(p):
    'course : DEPT NUM'
    p[0] = p[1] + ' ' + p[2]

def p_prereq_course_short(p):
    'course : NUM'
    p[0] = p[1]

def p_prereq_lp(p):
    'lp : LEVEL AT LEAST TERM'
    p[0] = p[4]

def p_prereq_ep(p):
    'ep : NOT OPEN TO'
    p[0] = ''

precedence = (
    ('left', 'AND', 'SEMI'),
    ('left', 'COMMA'),
    ('left', 'SLASH'),
    ('left', 'OR'),
    ('left', 'DEPT'),
    ('right', 'NUM'),
)

# Error rule for syntax errors
def p_error(p):
    print "Syntax error at '%s'" % p

import logging
logging.basicConfig(
    level = logging.INFO,
    filename = "parselog.txt",
    filemode = "w",
    format = "%(filename)10s:%(lineno)4d:%(message)s"
)
log = logging.getLogger()

parser = yacc.yacc(debug=True,debuglog=log)

prereqs = ('CHEM 120, 123; one of PHYS 111, 121; and one of PHYS 112, 122',
    'EARTH 121, 121L or 122, 122L or 153 or CIVE 153 or CIVE 253 or GEOE 153 or ENVE 153',
    'PHYS 112 or 122 or 125',
    'CHEM 254 or EARTH 421, CHEM 350',
    '(One of PHYS 112, 122, 125) and (one of MATH 118, 119, 128, 138, 148) and (one of AMATH 250, 350, CIVE 222, ENVE 223, MATH 218, 228, ME 203, SYDE 211)',
    'One of PSCI 250, 255, 260, 264',
    'One of ANTH 201/CLAS 221, ANTH 203, 233',
    'MATH 135 or 145, STAT 220 or 230 or 240',
    'AMATH 250 and MATH 237 or 247; Level at least 3A',
    'BIOL 140, 140L, (BIOL 208 or 330)',
    '(CS 240 or SE 240), CS 246, ECE 222',
    'CS 246, (SE 240 or CS 240)',
    '(AMATH 341/CM 271/ CS 371 or CS 370) and (MATH 128 or 138 or 148)',
    'MATH 135 or 145, STAT 220 or 230 or 240',
    'AMATH 250 and MATH 237',
    '(BIOL 140,BIOL 208 or 330) and (BIOL 308 or 330)',
    'MATH 229 or 239 or 249 and (One of CO 227, 250/350, 355, CO 352/CM 340)',
    '(CS 240 or SE 240), CS 246, ECE 222',
    'CS 241 and (STAT 206 or 230 or 240)')

limit = float("infinity");

i = 0;
for s in prereqs:
    if i &lt; limit:
        result = parser.parse(s, debug=log)
        print result
        i += 1

Output:

('SEMI', ('SEMI', ('COMMA', 'CHEM 120', '123'), ('OR', ('COMMA', 'PHYS 111', '121'))), ('OR', ('COMMA', 'PHYS 112', '122')))
('OR', ('OR', ('OR', ('OR', ('OR', ('COMMA', ('OR', ('COMMA', 'EARTH 121', '121L'), '122'), '122L'), '153'), 'CIVE 153'), 'CIVE 253'), 'GEOE 153'), 'ENVE 153')
('OR', ('OR', 'PHYS 112', '122'), '125')
('COMMA', ('OR', 'CHEM 254', 'EARTH 421'), 'CHEM 350')
('AND', ('AND', ('OR', ('COMMA', ('COMMA', 'PHYS 112', '122'), '125')), ('OR', ('COMMA', ('COMMA', ('COMMA', ('COMMA', 'MATH 118', '119'), '128'), '138'), '148'))), ('OR', ('COMMA', ('COMMA', ('COMMA', ('COMMA', ('COMMA', ('COMMA', ('COMMA', 'AMATH 250', '350'), 'CIVE 222'), 'ENVE 223'), 'MATH 218'), '228'), 'ME 203'), 'SYDE 211')))
('OR', ('COMMA', ('COMMA', ('COMMA', 'PSCI 250', '255'), '260'), '264'))
('OR', ('COMMA', ('COMMA', ('SLASH', 'ANTH 201', 'CLAS 221'), 'ANTH 203'), '233'))
('OR', ('OR', ('COMMA', ('OR', 'MATH 135', '145'), 'STAT 220'), '230'), '240')
(('AND', 'AMATH 250', ('OR', 'MATH 237', '247')), 'Level 3A')
Syntax error at 'LexToken(LPARENS,'(',1,16)'
Syntax error at 'LexToken(RPARENS,')',1,32)'
None
('COMMA', ('OR', 'CS 240', 'SE 240'), ('COMMA', 'CS 246', 'ECE 222'))
Syntax error at 'LexToken(LPARENS,'(',1,8)'
Syntax error at 'LexToken(RPARENS,')',1,25)'
None
('AND', ('OR', ('SLASH', ('SLASH', 'AMATH 341', 'CM 271'), 'CS 371'), 'CS 370'), ('OR', ('OR', 'MATH 128', '138'), '148'))
('OR', ('OR', ('COMMA', ('OR', 'MATH 135', '145'), 'STAT 220'), '230'), '240')
('AND', 'AMATH 250', 'MATH 237')
('AND', ('OR', ('COMMA', 'BIOL 140', 'BIOL 208'), '330'), ('OR', 'BIOL 308', '330'))
('AND', ('OR', ('OR', 'MATH 229', '239'), '249'), ('OR', ('SLASH', ('COMMA', ('COMMA', ('SLASH', ('COMMA', 'CO 227', '250'), '350'), '355'), 'CO 352'), 'CM 340')))
('COMMA', ('OR', 'CS 240', 'SE 240'), ('COMMA', 'CS 246', 'ECE 222'))
('AND', 'CS 241', ('OR', ('OR', 'STAT 206', '230'), '240'))

This was the BNF grammar encoded:

prereq : cp
cp  : cp AND cgparens
	| cp COMMA cgparens
	| cp COMMA AND cgparens
	| cp SEMI cgparens
	| cp SEMI AND cgparens
	| cg
cg : cg OR courses
   | cg SLASH courses
   | cg COMMA courses
   | cg COMMA OR courses
   | LPARENS cg RPARENS
   | ONE OF cg
   | courses
courses : courses OR course
	   | courses SLASH course
	   | courses COMMA course
	   | course
course : DEPT NUM
	   | NUM

, which gives a surprising error whenever parenthesis appeared:

yacc.py: 389:Action : Reduce rule [course -> DEPT NUM] with ['BIOL','140'] and goto state 23
   yacc.py: 423:Result :  ('BIOL 140')
   yacc.py: 389:Action : Reduce rule [courses -> course] with ['BIOL 140'] and goto state 22
   yacc.py: 423:Result :  ('BIOL 140')
   yacc.py: 389:Action : Reduce rule [course -> NUM] with ['140L'] and goto state 24
   yacc.py: 423:Result :  ('140L')
   yacc.py: 389:Action : Reduce rule [courses -> courses COMMA course] with ['BIOL 140',',','140L'] and goto state 21
   yacc.py: 423:Result :  (('COMMA', 'BIOL 140', '140L'))
   yacc.py: 493:Error  : courses COMMA . LexToken(LPARENS,'(',1,16)
   yacc.py: 493:Error  : courses COMMA . error
   yacc.py: 493:Error  : courses . error
   yacc.py: 493:Error  : . error
   yacc.py: 389:Action : Reduce rule [course -> DEPT NUM] with ['BIOL','208'] and goto state 23
   yacc.py: 423:Result :  ('BIOL 208')
   yacc.py: 389:Action : Reduce rule [courses -> course] with ['BIOL 208'] and goto state 22
   yacc.py: 423:Result :  ('BIOL 208')
   yacc.py: 389:Action : Reduce rule [course -> NUM] with ['330'] and goto state 24
   yacc.py: 423:Result :  ('330')
   yacc.py: 389:Action : Reduce rule [courses -> courses OR course] with ['BIOL 208','or','330'] and goto state 19
   yacc.py: 423:Result :  (('OR', 'BIOL 208', '330'))
   yacc.py: 389:Action : Reduce rule [cg -> courses] with [] and goto state 18
   yacc.py: 423:Result :  (('OR', 'BIOL 208', '330'))
   yacc.py: 493:Error  : cg . LexToken(RPARENS,')',1,32)
   yacc.py: 493:Error  : cg . error
   yacc.py: 493:Error  : . error
   yacc.py: 493:Error  : . $end

It means the parser just refused to use the rule cg : courses and subsequently cg : LPARENS cg RPARENS. At that time, I thought it must be something wrong with the grammar. Forgetting the parenthesis would have been an option, only if the compiler gave the correct interpretation. With that in mind, I worked on getting the final result:

def p_prereq_cp_lp_ep(p):
    'prereq : cp SEMI lp SEMI ep'
    p[0] = p[1]

def p_prereq_cp_lp(p):
    'prereq : cp SEMI lp'
    p[0] = p[1] + 'Level ' + p[3]

def p_prereq_cp_ep(p):
    'prereq : cp SEMI ep'
    p[0] = p[1] + ' Exclude ' + p[3]

def p_prereq_cp(p):
    'prereq : cp'
    p[0] = p[1]

def p_cp(p):
    '''cp : cg
          | cp AND cg
          | cp COMMA cg
          | cp COMMA AND cg
          | cp SEMI cg
          | cp SEMI AND cg'''

    if (len(p) == 2):
        p[0] = p[1]
    elif (len(p) == 4):
        p[0] = p[1] + ', ' + p[3]
    else:
        p[0] = p[1] + ', ' + p[4]

def p_cg(p):
    '''cg : cl
          | LPARENS cg RPARENS
          | ONE OF cl'''

    if (len(p) == 2):
        p[0] = p[1]
    elif (p[1] == 'LPARENS'):
        p[0] = p[2]
    else:
        p[0] = p[3]

def p_cl(p):
    '''cl : courses
          | cl OR courses
          | cl SLASH courses
          | cl COMMA courses'''

    if (len(p) == 2):
        p[0] = p[1]
    else:
        p[0] = p[1] + '/' + p[3]

def p_courses(p):
    'courses : DEPT nums'
    p[0] = p[1] + ' ' + p[2]

def p_nums(p):
    '''nums : NUM
            | nums OR NUM
            | nums SLASH NUM
            | nums COMMA NUM'''

    if (len(p) == 2):
        p[0] = p[1]
    else:
        p[0] = p[1] + '/' + p[3]

def p_lp(p):
    'lp : LEVEL AT LEAST TERM'
    p[0] = p[4]

def p_ep(p):
    'ep : NOT OPEN TO'
    p[0] = ''

# Error rule for syntax errors
def p_error(p):
    print "Syntax error at '%s'" % p

I realize at this point that PLY couldn’t fly:

   yacc.py: 389:Action : Reduce rule [nums -> NUM] with ['121'] and goto state 19
   yacc.py: 423:Result :  ('121')
   yacc.py: 389:Action : Reduce rule [nums -> nums COMMA NUM] with ['121',',','121L'] and goto state 22
   yacc.py: 423:Result :  ('121/121L')
   yacc.py: 389:Action : Reduce rule [nums -> nums OR NUM] with ['121/121L','or','122'] and goto state 20
   yacc.py: 423:Result :  ('121/121L/122')
   yacc.py: 389:Action : Reduce rule [nums -> nums COMMA NUM] with ['121/121L/122',',','122L'] and goto state 22
   yacc.py: 423:Result :  ('121/121L/122/122L')
   yacc.py: 389:Action : Reduce rule [nums -> nums OR NUM] with [,'or','153'] and goto state 20
   yacc.py: 423:Result :  ('121/121L/122/122L/153')
   yacc.py: 493:Error  : DEPT nums OR . LexToken(DEPT,'CIVE',1,39)

It seems to be stuck on the rule nums : nums OR NUM. I did a test by deleting that rule. Then it parsed the rest of the string fine, just missing a few NUMs. For some reason, PLY just refused to use the rule cl : cl OR courses. I checked the PLY documentation. Yup, PLY turned out to be an iron balloon. “shift/reduce conflicts are resolved in favor of shifting (the default).” Of course, once the mystery is solved, the solution is not hard to find.

No responses yet

Tokenizing Prerequisites

Oct 06 2010 Published by under CourseTree

I’m taking a bite by bite approach to interpreting the course prerequisites from the scraper to make its storage in the database possible. Meanwhile, I already finished the part of it that retrieves the data recursively. The approach I decide to take is the same as building a compiler.

I previously favored a quick approach such as the interpreter pattern.

design

However, commas in sentences have different meanings depending on the context.

(BIOL 140, BIOL 208 or 330) and (BIOL 308 or 330)
(CS 240 or SE 240), CS 246, ECE 222

The interpreter pattern assumes a symbol only has one meaning. Thanks again to a course I took, I was already familiar with the problem and the solution. Recognizing and defining the problem was the hard part. A quick overview of the steps involved from start to finish:

First step in the compiling process is to tokenize input.  That takes care of blobs such as “Level at least 3A; Not open to General Mathematics students.”. Totally useless in terms of prerequisites. So it should not show up in the token string.

Building the lexer was as simple as specifying the token list, writing regular expressions for them, and setting ignored characters.

import ply.lex as lex
t_DEPT = r'[A-Z]{2,5}'
t_NUM = r'\d{3}'
t_OR = r'or'
t_COMMA = r','
t_SEMI = r';'
tokens = (
    'DEPT',
    'NUM',
    'OR',
    'COMMA',
    'SEMI',
    'AND',
    'LPARENS',
    'RPARENS',
)
t_AND = r'and'
t_LPARENS = r'\('
t_RPARENS = r'\)'
t_ignore  = ' \t'
def t_error(t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)
lexer = lex.lex()
data = ' (CS 240 or SE 240), CS 246, ECE 222'
lexer.input(data)
while True:
    tok = lexer.token()
    if not tok: break      # No more input
    print tok

Running it gives the output:

LexToken(LPARENS,'(',1,1)
LexToken(DEPT,'CS',1,2)
LexToken(NUM,'240',1,5)
LexToken(OR,'or',1,9)
LexToken(DEPT,'SE',1,12)
LexToken(NUM,'240',1,15)
LexToken(RPARENS,')',1,18)
LexToken(COMMA,',',1,19)
LexToken(DEPT,'CS',1,21)
LexToken(NUM,'246',1,24)
LexToken(COMMA,',',1,27)
LexToken(DEPT,'ECE',1,29)
LexToken(NUM,'222',1,33)

No responses yet

« Prev - Next »