Coursetree: the Epilogue and Return

Aug 23 2011

The sign the adventure is over is always a return to the starting point after traveling to the other side of the world. In the early days when I wrote the crawler for courses, I gathered faculty links from a table by writing a small scraper for the links. The exact method is not clear, but it could have also been done in numerous other ways. Later, I found only those links when I came back to the project after several months. One reason I may not have recorded the method may be that there were duplicates in them, such as 100’s and 200’s referring to the same page.

At the end of this journey, I have many pages of documentation on the system. However, the system still lacked a way of handling new course pages. Now the system is complete, with only a few lines of code, which would have been impossible without the journey:

from urllib import urlopen
html = urlopen("http://ugradcalendar.uwaterloo.ca/page/Course-Descriptions-Index").read()
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re, cgi
from urlparse import urlparse
linksToCourses = SoupStrainer('a', href=re.compile('courses.aspx'))
links = [tag['href'] for tag in BeautifulSoup(html, parseOnlyThese=linksToCourses)]
faculties = list(set([cgi.parse_qs(urlparse(link)[4])['Code'][0] for link in links]))
faculties
[u'CIVE',
u'ARBUS',
u'ECON',
u'INTEG',
u'JS',
u'REES',
u'WS',
u'DUTCH',
u'CHEM',
u'AVIA',
u'PSYCH',
u'SPD',
u'SPCOM',
u'CROAT',
u'CHE',
u'HRM',
u'ENBUS',
u'SCBUS',
u'PACS',
u'SYDE',
u'KIN',
u'LAT',
u'STAT',
u'INDEV',
u'SMF',
u'CMW',
u'FINE',
u'PORT',
u'GER',
u'KOREA',
u'SCI',
u'BUS',
u'SWREN',
u'HIST',
u'AMATH',
u'PD',
u'RUSS',
u'OPTOM',
u'AFM',
u'COOP',
u'ECE',
u'MSCI',
u'NATST',
u'GRK',
u'ME',
u'INTTS',
u'RS',
u'GERON',
u'ITALST',
u'HLTH',
u'JAPAN',
u'MATH',
u'PLAN',
u'FR',
u'PHIL',
u'ENGL',
u'ISS',
u'ITAL',
u'PDENG',
u'SOCWK',
u'REC',
u'ARTS',
u'MTHEL',
u'NE',
u'BIOL',
u'APPLS',
u'EARTH',
u'CLAS',
u'CO',
u'CM',
u'ACTSC',
u'POLSH',
u'DRAMA',
u'COMM',
u'CS',
u'SPAN',
u'SI',
u'PSCI',
u'CHINA',
u'WKRPT',
u'SE',
u'HUMSC',
u'ERS',
u'ARCH',
u'EASIA',
u'DAC',
u'PMATH',
u'LS',
u'GEOE',
u'GEOG',
u'PHYS',
u'PDPHRM',
u'IS',
u'SOC',
u'STV',
u'MUSIC',
u'ANTH',
u'ESL',
u'MTE',
u'ENVS',
u'INTST',
u'PHARM',
u'GENE',
u'ENVE']

This gets a list of faculties from the table by returning a set.


import MySQLdb
db = MySQLdb.connect(user='', db='', passwd='', host='')
cursor = db.cursor()
cursor.execute('SELECT faculty FROM faculties ORDER BY faculty')
db_faculties = [row[0] for row in cursor.fetchall()]
db.close()
set(faculties) - (set(faculties) & set(db_faculties))
set([u'BUS', u'COOP', u'PD', u'PDPHRM'])

By subtracting the intersection of faculties in the database from the faculties in the table, the set of new items are found. The new items may be added to the database. I leave it to the reader.

Not only is this the first step when updating the database, it is a miniature model of the entire application. Thus it wraps everything up in a way that serves a purpose.

No responses yet

Javascript LZMA Decompression

Aug 13 2011

In modern browsers, g-zip compression is a standard feature. The typical compression ratio for a plain text file is 30%, reducing the download time of web content by 70% and making it load 2-3 times faster. In spite of the speed up, g-zip is an old algorithm based on LZ77. Since then, newer algorithms have been invented, with LZMA being the standard. On Linux, LZMA typically produces files half the size compared to g-zip. This tutorial will show you how to use an LZMA compressed file produced by the standard lzma command on Unix machines directly in a client side web application. The rest of the post assumes you have the JavaScript libraries for LZMA and binary AJAX set up.

First, Make a Compressed File

echo "Hello, world." | lzma -3 > hello.lzma

Next,  Read Binary Data

<script src="../src/jquery-1.4.4-binary-ajax.js"></script>
<script src="../src/jdataview.js"></script>
<script>
function unzip(data) {
    // Make a view on the data
    var view = new jDataView(data);

    var int_arr = new Array;

    while (view.tell() < view.length) {

       int_arr.push(view.getUint8(view.tell()));

   }
   console.log(int_arr.length);
   console.log(int_arr);

}

// Download the file
$.get('hello.lzma', unzip, 'binary');
</script>

This is a pretty simple step, except the while loop counter may be unintuitive. getUint8 increments the file pointer, though it wasn’t documented in the API specification. I spent an hour or so comparing the output in hex. One of the problem was that

5d 00 00 08 00 0d 00 00 00 00 00 00 00 00

is the same as

5d 00 00 08 00 0d ff ff ff ff ff ff ff ff

in little Endian. You can try it in the decompressor, just replace the bytes in the hello world lzma on compression level 3. However, I figured out the problem as soon as I compared view.length and int_arr.length. They were multiples of 2! That always has significance in computing, in this case it meant I was reading every other byte. After correcting the while loop, I moved onto decoding the binary.

Third, Enjoy the Decoding

Yes, this is a rather boring thing to do while waiting, but do enjoy it.

    lzma.decompress(int_arr, function(result) {
        $('body').append($('<textarea></textarea>').val(result));
    })

Benefits

Using LZMA compression rather than g-zip, I was able to reduce a g-zipped file to 2/3 of its size, reducing the download time by 33%. The LZMA decompression algorithm could be improved to use an array to store results, joining them at the end, rather than appending to a string. It is not recommended to use this method unless you have large files. The libraries themselves take up about 50kb with g-zip. Furthermore, it is unsuitable for downloads where files are sent directly to the user, without being used by the application, since the user would have the decompression utilities.

One response so far

Functional Javascript Programming using Underscore

Aug 10 2011

The Scenario

Suppose a javascript function is needed to parse text and execute another function for certain lines of text. Each line of the text is ended with a line break. If there is a keyword ‘get’, then the word following it needs to be checked if it is in the list of files on the server, and get it, using a GET request. So ‘get this’ would get the file ‘this’ from the server.

Object Oriented Approach

First, I tried doing this using the traditional way, with array indexes of letters.

if(input.value.split(' ').indexOf('get')==-1) return;
var lines = input.value.split('\n');
...

I stopped right there the next line is supposed to be a for each loop. That satisfied the requirements for using a functional library called underscore.js.

Mixed Approach

Not my favorite one, as it makes the painting look like a collage, and I think it made an error in the code harder to spot. See if you can spot it:

_(input.value.split(' ')).chain()
    .select(function (line) {
        var words = line.split(' ');
        return words.length > 1 && words[0] == 'get' && _.include(files, words[1]);
    })
    .each(function (line) {
        executeGet(line);
    });

Aha! The space should actually be newline.

Functional Approach
This time, I had to modify the code to use a single word from the word array, the last one.
The main line that violated the functional approach was

var words = line.split(' ');

along with the array accesses, which should be replaced with first and rest. It turns out there is a purely functional way:

_(input.value.split('\n')).chain()
    .map(function (line) {
        return line.replace(/get\s*/, '');
    })
    .select(function (module) {
        return module in files;
    })
    .each(function (module) {
        $.get(files[module]);
    })

The custom executeGet function has been replaced by jQuery’s $.get.

Conclusions
Which approach is better? Using the OO approach, a local index variable would be needed to be created for the for loop, along with other locals, which adds irrelevant lines to the code. On the other hand, each line on its own makes sense. However, functional code cannot be understood except as a whole, because there are no variables. Statelessness has its own beauty in programming. It’s similar to running a quantum computer, not caring how bits disappear or time warp, but only for the result of the computation.

No responses yet

Opera 11.5 Minimalist Icons

Jul 10 2011

Minimalist design enables users to accomplish their tasks faster. The new version of Opera gets out of my way with non-distracting icons.

In Opera 11.5, the status icons are all the same color, allowing attention to be drawn to other areas of the browsing experience. Having one icon red, the other green uses the traffic light analogy, while these areas and the colors used as signals don’t require user action. So in this version, the Opera Unite red, which is Opera’s favorite color for branding, was dropped to opt for light blue for all icons.

No responses yet

Stoicism and Confucianism: the Thread of Society

Jul 09 2011

The Meditations of Marcus Aurelius is the indisputable classic of Stoicism. Similarly, the I Ching has a comparable history in Confucianism, with the Ten Wings being modified by each generation of scholars. The I Ching begins with two symbols, representing Heaven and Earth. Understanding this is the key to the philosophy. The Will of Heaven is done on Earth. One Will, One Universal Cause, and One Purpose that unites all.  The Mediations expounds on a similar philosophy: reason governs society, which rules men.

Now the good for the reasonable animal is society; for that we are made for society has been shown above. Is it not plain that the inferior exist for the sake of the superior? But the things which have life are superior to those which have not life, and of those which have life the superior are those which have reason.

Heaven is represented as reason, Earth as society, thus humans are to follow society. When Heaven and Earth come into being, human life begins.  The third symbol in the I Ching is appropriately named Beginning. From these three symbols, the ten thousand things follow.

Changes

The I Ching has another name, the Book of Changes.  The symbols alternate, with lines changing from yin and yang, representing the interplay of energies. The entire book represents a sequence of changes with each symbol. Within each symbol, the lines show how the situation develops. Thus, the I Ching is a suitable simulation, or conceptual model, of the Universe, as Marcus writes,

Now the universe is preserved, as by the changes of the elements so by the changes of things compounded of the elements.

The Role of Philosophy

Philosophy is not logic. It is a metaphysic of quality. The I Ching is not a book, because it encompasses all things in existence and all that have been or will ever be.  Indeed, if one truly knows the essence of the I Ching, all things past and future is not beyond his grasp.

Without going outside his door, one understands (all that takes place) under the sky; without looking out from his window, one sees the Tao of Heaven. The farther that one goes out (from himself), theless he knows.

The sage knows everything, yet perceives nothing. Perception belongs to the realm of time, but Heaven lasts forever. In the realm of time there is philosophy, for without it society has no way to speak of Heaven. There is no justification for the existence of humans other than philosophy, which is why philosophers have always asked for the meaning of life,

Of human life the time is a point, and the substance is in a flux, and the perception dull . . . What then is that which is able to conduct a man? One thing and only one, philosophy. . .

The Nature of Things

The I Ching as an oracle, reveals the true nature of the situation being asked. Each symbol, composed of yin and yang lines, shows how the situation develops. When taken together, they form a picture showing the nature of the situation.

Marcus asks the same questions of himself,

This thing, what is it in itself, in its own constitution? What is its substance and material? And what its causal nature (or form)? And what is it doing in the world? And how long does it subsist?

No responses yet

The Mind and the Machine: the Book

Jul 07 2011

A book has been written by the title of this blog, The Mind and the Machine: What It Means to Be Human and Why It Matters.  This topic becomes more and more relevant as computers take over human tasks. Each human is partly machine, partly something else, and it is that part that distinguishes humans from the machines. This book explores each aspect of the human. Let the machines take care of themselves, they neither perceive nor feel. The rest of life will be more meaningful.

No responses yet

Coursetree 2.0: An Intelligent Backend Coming Soon

Jun 24 2011

The goal of coursetree 2.0 is to leverage the current cloud infrastructure to deliver semantic applications that help users find the information they are looking for.

Features currently planned:

  1. Course search that understands what the user wants
  2. Filtering of irrelevant links
  3. Pattern based degree data mining

Draft implementation strategy:

  1. Let Google search index Wikipedia and video links
  2. Bayesian classifier will be used to categorize link content into subjects
  3. Template induction and template scraping

Features under consideration:

  1. Adaptable prerequisite semantic analysis
  2. Fully automated template learning and template extraction
  3. Relevant course links/suggested courses

Tenative ideas:

  1. Genetic algorithm for grammar rule generation with fitness score assigned according to the total number of parse errors
  2. Use hashing algorithms to detect similarity in sections of a page, feed similar sections using wrapper induction to generate template
  3. Build map of courses using anti-requisites and display nearest neighbors

Unsuccessful incubation features:

  1. Using genetic algorithms to generate templates for wrapper induction
  2. Switch to parse trees extract noun phrases for Wikipedia link candidates
  3. YQL for video link scraping

Lessons learned and salvaged:

  1. Don’t use genetic programming methods where scores cannot be assigned to each individual “program”, as many of the templates were simply fails with zero scores
  2. Although in some cases successful (with a comma separated list of noun phrases), in other cases single words were marked as noun phrases in the parse tree instead of a more desirable longer phrase
  3. Due to frequent changes in video sites, nested JavaScript callbacks with closures to glue previews to links made the code a target to be recycled

No responses yet

Do you remember grade 7?

May 27 2011

I still remember grade 7, better than grade 1 or grade 12. Grade 1 is somewhere out on the ocean, on a foggy island. Grade 12 seems to be covered by a layer of snow. All I see is white, among the trees and bushes. I can’t name ten significant events there, compared to fifty from grade 7. Let’s not forget, I loved school (in grade 7)! Now grade 12 is a different issue. After learning much math and a lot of science & societal issues, grade 12 just fit into the fabric of our modern society. It’s just one step in the production. But I still love the idea of grade 7, like a grassy knoll frozen in crystal clear ice. What is so memorable and what do I remember about it?

  1. Missing school for about a month in Canada. Cool!
  2. Getting brand new chairs during Christmas (school chairs were replaced)
  3. The math teacher taught the wrong lesson at the start of September which showed up as the last lesson
  4. Humans have evolved from monkeys (specifically referencing Geography teacher, who had more body hair)
  5. Geography teacher plays catch against the wall
  6. Someone wore a loose shirt while picking up a book
  7. Spelling Islam as “I slam” during spelling test
  8. Not to boast about knowledge that other people can’t understand after discussing biorythms in English class
  9. Math teacher mentions the Kenedies
  10. Math teacher talks about spanking in schools
  11. Kelley showed me a drawing which he thought was funny
  12. Having a starred conversation sitting in front of two people on the bus
  13. The road beside the bus station getting a ditch
  14. Walking home across town after missing the bus
  15. Getting whiplashed by long hair standing behind the lines of someone turning
  16. Oh yuck! Found a piece of black paper in the corn in the cafe
  17. Making an airplane that flew straight and never sank (paper airplane)
  18. Summer nights playing games on the grass
  19. The hat as it flew off while I was running
  20. Watching comedy with a man named Josh, who had the same name as my other friend Josh, who had the same name as my old best friend Josh
  21. Maybe Joshua means “one Jesus”, as I thought, reading the Bible
  22. Reading an encyclopedic book of short stories for children
  23. At the end of the year, various animals as a collection of books each day
  24. Exhaustion, after not sleeping well and taking a test during the summer
  25. Clarity, reading a poetry book of experiences on the sea
  26. Spending some days in spring break walking mazes in NeoPets
  27. Fall break, reading a book on vocabulary
  28. The worst flu of my life!
  29. A commercial tower appeared in SimCity in the first year
  30. Winning an Easter bunny machine from a community event
  31. Dissecting cow hearts with the surgeon of the class
  32. Don’t tell the truth, Ben will be disappointed by the literal translation of his name to Chinese
  33. Eating seaweed, the joke being other people mistaking it for grass
  34. Teacher says I should give animal crackers
  35. Three interpretations of “Favorite Bird” (Iraq): duck (bombardment), turkey (nearby country), chicken (playing chicken)
  36. Guy named Thomas dreams about harvesting resources on other planets
  37. Visiting the Secret Cove with Kelley
  38. Kelley’s fish tank
  39. Playing an alien shooter game on the cellphone on the way to Niagara Falls
  40. Stephen Hawking’s book on the tape on a cross continent journey
  41. Tree in front of the house being split by lightning
  42. Indian gives a penny for Halloween trick o’ treat
  43. The math teacher, who was also a coach, talks to Richie about his recent performance in class
  44. I still remembering trying to answer Frank’s question in English class
  45. The meetings with the advisor to decide whether or not, “to be, or not to be”, and I decided to switch
  46. Getting skin on my feet scratched off after walking a distance in the shoes
  47. Feeling despair and hopelessness one night, Fall
  48. The voice, a military commander through the morning routine
  49. Shower music, to avoid the monsters
  50. Drinking tea before the afternoon jog
  51. Playing the piano after dinner at Mrs. Ting’s house, the start of lessons
  52. Oversleeping by the breezy window, arrived at class late
  53. See You at the Pole (911 event), which I never understood

Maybe there is an infinite pool of memory in which I can remember every minute detail. Just how much digital memory is it? A dot, a speck, in the world of Mona Lisa Overdrive.

No responses yet

Game Over: Sabayon?

May 15 2011

Time to update Sabayon, after installing the wireless driver with a wired connection, about a week ago. Following the routine, entering several commands, letting them finish in a couple of hours. It’s been about 8 months since I last updated. Mostly because of the lack of a wired connection. I expected everything to go as expected, except it got stuck at

equo install entropy sulfur equo  --relaxed

which means the package manager is broken. As the console suggested, “you’re in deep shit”. Normally, on any Linux system, if the package manager is done, the system cannot be repaired. If it happens in the middle of an upgrade, then those half upgraded libraries breaks the system. By circular cause and effect, a broken package manager is stuck in a loop.
However, Sabayon is redundant (hint). Entropy does the same thing as Portage. My plan at this point was to run

emerge entropy sulfur equo

and continue with the upgrade…

No responses yet

JavaScript URL Parsing

Apr 24 2011

Many web frameworks have URLs in the format of domain/category/page, but what if there’s a need for that outside of the framework, like in client-side scripting? Two lines of code is enough for getting those URL parameters:

var url = /(\w+):\/\/([\w.]+(?:\:8000)?)\/(\S*)\/(\S*)\//;
var result = "http://127.0.0.1:8000/category/page/".match(url);

One special feature of JavaScript regular expressions is (?…) is a group that does not show in the result array.

If you wanted to map the result to different actions:

if (result != null) {
    switch(result[4]){ //page
        case 'page1':
            //code block1
            break;
    }
}

No responses yet

« Newer - Older »