Archive for August, 2011

Several Thousand Visitors Second Day Launch

Aug 30 2011 Published by under Python Fiddle,Singularitarian

A recent site I launched received a lot of attention. It may have been the only way to get the momentum going, since Google wouldn’t index a site with no content.

Fortunately, I spent a week before launch optimizing the page loading, serving static files from Amazon, and fixing usability bugs. So the result is a very smooth launch, even when serving many visitors per second. Some users who experienced slow loading issues may have been waiting for the browser to download a 1.3 or 2.0 MB file, which could have caused a traffic jam on a static file server. The technique used here was to serve files that are already compressed with lzma and gzip, respectively. Due to htaccess configuration not being available on Amazon, it was decided to serve these from another server.

The most surprising effect was that Google seemed to have picked up the link as soon as it appeared, along with other sites that mirror content.

No responses yet

Coursetree: the Epilogue and Return

Aug 23 2011 Published by under Programming

The sign the adventure is over is always a return to the starting point after traveling to the other side of the world. In the early days when I wrote the crawler for courses, I gathered faculty links from a table by writing a small scraper for the links. The exact method is not clear, but it could have also been done in numerous other ways. Later, I found only those links when I came back to the project after several months. One reason I may not have recorded the method may be that there were duplicates in them, such as 100’s and 200’s referring to the same page.

At the end of this journey, I have many pages of documentation on the system. However, the system still lacked a way of handling new course pages. Now the system is complete, with only a few lines of code, which would have been impossible without the journey:

from urllib import urlopen
html = urlopen("http://ugradcalendar.uwaterloo.ca/page/Course-Descriptions-Index").read()
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re, cgi
from urlparse import urlparse
linksToCourses = SoupStrainer('a', href=re.compile('courses.aspx'))
links = [tag['href'] for tag in BeautifulSoup(html, parseOnlyThese=linksToCourses)]
faculties = list(set([cgi.parse_qs(urlparse(link)[4])['Code'][0] for link in links]))
faculties
[u'CIVE',
u'ARBUS',
u'ECON',
u'INTEG',
u'JS',
u'REES',
u'WS',
u'DUTCH',
u'CHEM',
u'AVIA',
u'PSYCH',
u'SPD',
u'SPCOM',
u'CROAT',
u'CHE',
u'HRM',
u'ENBUS',
u'SCBUS',
u'PACS',
u'SYDE',
u'KIN',
u'LAT',
u'STAT',
u'INDEV',
u'SMF',
u'CMW',
u'FINE',
u'PORT',
u'GER',
u'KOREA',
u'SCI',
u'BUS',
u'SWREN',
u'HIST',
u'AMATH',
u'PD',
u'RUSS',
u'OPTOM',
u'AFM',
u'COOP',
u'ECE',
u'MSCI',
u'NATST',
u'GRK',
u'ME',
u'INTTS',
u'RS',
u'GERON',
u'ITALST',
u'HLTH',
u'JAPAN',
u'MATH',
u'PLAN',
u'FR',
u'PHIL',
u'ENGL',
u'ISS',
u'ITAL',
u'PDENG',
u'SOCWK',
u'REC',
u'ARTS',
u'MTHEL',
u'NE',
u'BIOL',
u'APPLS',
u'EARTH',
u'CLAS',
u'CO',
u'CM',
u'ACTSC',
u'POLSH',
u'DRAMA',
u'COMM',
u'CS',
u'SPAN',
u'SI',
u'PSCI',
u'CHINA',
u'WKRPT',
u'SE',
u'HUMSC',
u'ERS',
u'ARCH',
u'EASIA',
u'DAC',
u'PMATH',
u'LS',
u'GEOE',
u'GEOG',
u'PHYS',
u'PDPHRM',
u'IS',
u'SOC',
u'STV',
u'MUSIC',
u'ANTH',
u'ESL',
u'MTE',
u'ENVS',
u'INTST',
u'PHARM',
u'GENE',
u'ENVE']

This gets a list of faculties from the table by returning a set.


import MySQLdb
db = MySQLdb.connect(user='', db='', passwd='', host='')
cursor = db.cursor()
cursor.execute('SELECT faculty FROM faculties ORDER BY faculty')
db_faculties = [row[0] for row in cursor.fetchall()]
db.close()
set(faculties) - (set(faculties) & set(db_faculties))
set([u'BUS', u'COOP', u'PD', u'PDPHRM'])

By subtracting the intersection of faculties in the database from the faculties in the table, the set of new items are found. The new items may be added to the database. I leave it to the reader.

Not only is this the first step when updating the database, it is a miniature model of the entire application. Thus it wraps everything up in a way that serves a purpose.

No responses yet

Javascript LZMA Decompression

Aug 13 2011 Published by under Linux,Programming

In modern browsers, g-zip compression is a standard feature. The typical compression ratio for a plain text file is 30%, reducing the download time of web content by 70% and making it load 2-3 times faster. In spite of the speed up, g-zip is an old algorithm based on LZ77. Since then, newer algorithms have been invented, with LZMA being the standard. On Linux, LZMA typically produces files half the size compared to g-zip. This tutorial will show you how to use an LZMA compressed file produced by the standard lzma command on Unix machines directly in a client side web application. The rest of the post assumes you have the JavaScript libraries for LZMA and binary AJAX set up.

First, Make a Compressed File

echo "Hello, world." | lzma -3 > hello.lzma

Next,  Read Binary Data

<script src="../src/jquery-1.4.4-binary-ajax.js"></script>
<script src="../src/jdataview.js"></script>
<script>
function unzip(data) {
    // Make a view on the data
    var view = new jDataView(data);

    var int_arr = new Array;

    while (view.tell() < view.length) {

       int_arr.push(view.getUint8(view.tell()));

   }
   console.log(int_arr.length);
   console.log(int_arr);

}

// Download the file
$.get('hello.lzma', unzip, 'binary');
</script>

This is a pretty simple step, except the while loop counter may be unintuitive. getUint8 increments the file pointer, though it wasn’t documented in the API specification. I spent an hour or so comparing the output in hex. One of the problem was that

5d 00 00 08 00 0d 00 00 00 00 00 00 00 00

is the same as

5d 00 00 08 00 0d ff ff ff ff ff ff ff ff

in little Endian. You can try it in the decompressor, just replace the bytes in the hello world lzma on compression level 3. However, I figured out the problem as soon as I compared view.length and int_arr.length. They were multiples of 2! That always has significance in computing, in this case it meant I was reading every other byte. After correcting the while loop, I moved onto decoding the binary.

Third, Enjoy the Decoding

Yes, this is a rather boring thing to do while waiting, but do enjoy it.

    lzma.decompress(int_arr, function(result) {
        $('body').append($('<textarea></textarea>').val(result));
    })

Benefits

Using LZMA compression rather than g-zip, I was able to reduce a g-zipped file to 2/3 of its size, reducing the download time by 33%. The LZMA decompression algorithm could be improved to use an array to store results, joining them at the end, rather than appending to a string. It is not recommended to use this method unless you have large files. The libraries themselves take up about 50kb with g-zip. Furthermore, it is unsuitable for downloads where files are sent directly to the user, without being used by the application, since the user would have the decompression utilities.

One response so far

Functional Javascript Programming using Underscore

Aug 10 2011 Published by under Programming

The Scenario

Suppose a javascript function is needed to parse text and execute another function for certain lines of text. Each line of the text is ended with a line break. If there is a keyword ‘get’, then the word following it needs to be checked if it is in the list of files on the server, and get it, using a GET request. So ‘get this’ would get the file ‘this’ from the server.

Object Oriented Approach

First, I tried doing this using the traditional way, with array indexes of letters.

if(input.value.split(' ').indexOf('get')==-1) return;
var lines = input.value.split('\n');
...

I stopped right there the next line is supposed to be a for each loop. That satisfied the requirements for using a functional library called underscore.js.

Mixed Approach

Not my favorite one, as it makes the painting look like a collage, and I think it made an error in the code harder to spot. See if you can spot it:

_(input.value.split(' ')).chain()
    .select(function (line) {
        var words = line.split(' ');
        return words.length > 1 && words[0] == 'get' && _.include(files, words[1]);
    })
    .each(function (line) {
        executeGet(line);
    });

Aha! The space should actually be newline.

Functional Approach
This time, I had to modify the code to use a single word from the word array, the last one.
The main line that violated the functional approach was

var words = line.split(' ');

along with the array accesses, which should be replaced with first and rest. It turns out there is a purely functional way:

_(input.value.split('\n')).chain()
    .map(function (line) {
        return line.replace(/get\s*/, '');
    })
    .select(function (module) {
        return module in files;
    })
    .each(function (module) {
        $.get(files[module]);
    })

The custom executeGet function has been replaced by jQuery’s $.get.

Conclusions
Which approach is better? Using the OO approach, a local index variable would be needed to be created for the for loop, along with other locals, which adds irrelevant lines to the code. On the other hand, each line on its own makes sense. However, functional code cannot be understood except as a whole, because there are no variables. Statelessness has its own beauty in programming. It’s similar to running a quantum computer, not caring how bits disappear or time warp, but only for the result of the computation.

No responses yet