Tuesday, August 24, 2010

Who's who in Upstream Caching

Well, once again there is too much other stuff to work on and my last blog project didn't turn out quite the way I hoped, so there hasn't been a recent blog post. Look for a few concerning my latest project in the coming weeks, but for now I'll be putting up a few shorter or less technical ones about various subjects that I'd like to brain dump. Today, that subject is upstream caching, hints in the HTTP headers that developers can give to routers and content delivery systems that intermediate between client and server.

Given a particular website or web service, the original developer generally has the best idea of how often different sorts of information are updated on the site. It might be the case that the content is updated in a matter of minutes, days, or maybe once in a blue moon. At any rate, it can be pretty costly for the server's bandwidth if it constantly has to send the client information that the client has already seen. Therefore at any point in between the connection between server and client, whether it is through a proxy or a router, there may be some caching done to save bandwidth. But for a website serving private information, the possibility of a shared cache presents a clear risk to the privacy and security of the service. Therefore, services can add directives to the HTTP headers in a response to determine how the response should be cached.

I was curious about these particular headers and how often they are used in the wild. Thanks to alexa, I was able to download a CSV of the top 1 million websites accessed by their users. As a sidenote, that is really awesome that Alexa makes this data available in a developer-friendly format, way to go guys! At any rate, it was time to test my theory that websites at greater scale (more popular) would be more likely to have customized this upstream cache, particularly to include values for a stale timeout. To do this, I created a simple script to return the cache-control header from a request:

def check_cache(url):
 req = urllib2.Request(url)
 req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.126 Safari/533.4')
 opener = urllib2.build_opener()
 try:
  resp = opener.open(req)
  if 'cache-control' in resp.info().dict:
   return resp.info().dict['cache-control'].replace(" ", "").split(",")
 except Exception:
  return []
 return ['none']

The reason for the spoofed user agent header is that some popular websites block python's native user agent header. Some websites might also not have a public facing homepage, so for those we will not assume anything. For everything else that does not have a cache-control header, we'll return a list of 'none'.

Well, we have our list of top websites from Alexa and a method to get individual keys for a website. The way I broke up the findings was by taking the websites 50 at a time and simply adding together their key counts. Admittedly there are a few caveats to this method, one is that websites operated by a single company under different countries show up as different domains, even though it is likely that they use the same caching policy.

To run the script, I used the run method which gathered results for every 50 websites.
CHUNK_SIZE = 50
CHUNKS = 20

def get_next_chunk(f):
 cache_freq = defaultdict(int)
 for i in range(CHUNK_SIZE):
  s = f.readline()
  url = s.split(",")[1].strip()
  print "Reading:", url
  for key in check_cache("http://" + url):
   cache_freq[key.split("=")[0]] += 1
 return cache_freq


def run(skip = 0):
 f = open('/Users/Calvin/Downloads/top-1m.csv', 'r')
 for i in range(skip):
  f.readline()
 results = []
 for i in range(CHUNKS):
  freq = get_next_chunk(f)
  results.append(freq)
 return results

Add in a nice simple plotting script and we're ready to see the results:

import matplotlib.pyplot as plt
import numpy, random

def plot(results):
 ind = numpy.arange(CHUNKS)
 width = 0.1
 fig = plt.figure()
 p = fig.add_subplot(111)
 
 # Define lambda functions for map/reduce
 key_map = lambda x: x.keys()
 key_reduce = lambda x,y: set(x) & set(y)
 unique_keys = reduce(key_reduce, map(key_map, results))
 
 colors = dict([(key, (random.random(),
        random.random(),
        random.random())) for key in unique_keys])
 legend = dict()

 for key in unique_keys:
  key_counts = [result[key] for result in results]
  legend[key] = p.plot(key_counts, color=colors[key], lw=4.0)

 p.legend(tuple([legend[key][0] for key in unique_keys]), tuple([key for key in unique_keys]))
 p.set_title('Cache-Control Keys by Top Sites')
 p.set_ylabel('Frequency of Key')
 p.set_xlabel('Website Groupings')
 p.set_xticks(ind+len(unique_keys) * width / 2.0)
 p.set_xticklabels(tuple([str(i) for i in range(1050)[50::50]]))
 plt.show()

There we have it, let's check out the results of the caching for the top 100 websites:

A few things become clear here, although the granularity is a bit rough. If I wanted to wait through querying 1000 websites again I might have increased the amount of saved data. At any rate, from this graph we can see a distinct downward trend in the max-age key and the private key, keys which regulate how long items remain in the cache and whether they can be shared among connections. The sites with keys also drops, as we see the 'none' key steadily rising. Clearly the sites needing the most performance take the most care in their upstream caching while those that can afford to be more lax do not take these caching measures. It's interesting to see breakdowns in individual sites - Amazon is a notable exception to this rule, a site without an outbound caching policy despite its size and scale. Perhaps they are doing something more interesting by specifying a no-cache policy if a user is authenticated, but intentionally leaving anonymous users an optimized site experience.

Wednesday, July 7, 2010

Luhn's Auto-Abstract Algorithm.

Well folks, I currently have 2 or 3 significant blog entries whizzing around my head, and I have to post up one of them even though I have already started writing the other or else I might explode . I guess this can be a little tidbit to keep the posts going until some of my more major endeavors arrive.

So the inspiration for this post originated via a Hack Day at the company where I'm working for the summer. One group presented an auto-summarization feature dubbed "talking points" which essentially took group content and displayed the major bullet points nicely on the groups page. The presenters portrayed this effort as a more impartial view on groups. After all, I know that I for one welcome our robot overlords trust algorithms a lot more than I trust self-promotion intended to get you to join the group. At any rate, I didn't catch the name of the algorithm until it was sent out in an email a week later.

I was pretty interested in how it worked, so I checked out the paper online. From reading the paper, I gained a few insights. It turns out that German computer scientist H.P. Luhn came up with a method for automatically generating abstracts from scientific papers, along with many cool information theoretic ideas. It also turns out that this idea is old, as in programming the data on punch cards and published in 1958 old. It's not even his most famous work either, as the Luhn Algorithm on Wikipedia points to a checksum like algorithm that validates an identification number. Given its age, the algorithm is also described in pretty straightforward English and it seemed pretty easy to implement, so I went ahead and did it.

First I'd like to give you a brief rundown of how the algorithm works. As Luhn states in his paper, the algorithm really doesn't do anything fancy in terms of NLP or anything like that, it pretty much relies on frequency analysis and word spacing:
The justification of measuring word significance by use frequency is based on the fact that a writer normally repeats certain words as he advances or varies his arguments and as he elaborates on an aspect of a subject. This means of emphasis is taken as an indicator of significance. The more often certain words are found in each other’s company within a sentence, the more significance may be attributed to each of these words. - H.P. Luhn
Makes sense to me, certain words are significant and repeated, the denser these clusters are the more valuable they become.

The algorithm itself occurs in two phases. In the first phase, we must determine which words are significant. Luhn states that this is first done by doing a frequency analysis, then finding words which are significant, but not unimportant English words. I made a very sweeping generalization here by taking the top 100 English words and counting those as unimportant. Wikipedia tells me that these words account for roughly half of all written English, so I am going to only use that set as the unimportant words, though there's nothing inherently in my code that makes this so. The first thing I'll do is build a hashset from these words, pretty straightforward:

def load_common_words():
    f = open("words.txt", "r")
    common_words = set()
    for word in f:
        common_words.add(word.strip('\n').lower())
    f.close()
    return common_words

Next I have to get the most common words from my document, whatever it is, and then take a subset of those that are not these most common english words, but are still important. I did this by reading through the file and counting the word frequency, removing the most common words from our dictionary, and then sorting them in descending order. I decided to record the top 10% of these, which I think should fall pretty squarely in the 'important' category that Luhn was looking for.

def top_words(file_name):
    f = open(file_name, "r")
    record = {}
    common_words = load_common_words()
    for line in f:
        words = line.split()
        for word in words:
            w = word.strip('.!?,()\n').lower()
            if record.has_key(w):
                record[w] += 1
            else:
                record[w] = 1

    for word in record.keys():
        if word in common_words:
            record[word] = -1
    f.close()     
    occur = [key for key in record.keys()]
    occur.sort(reverse=True, key=lambda x: record[x])
    return set(occur[: len(occur) / 10 ])

Moving on to the second step, we must give a score to each sentence or set of text. This really generalizes to any body of text, but I was curious to how it would pick out important sentences, so I went ahead and just used sentences. This function calculates our score based upon how tightly the important words are clustered inside the sentence. Luhn called this the "significance factor," which could be calculated for a sentence by bracketing the significant words in the sentence, squaring the number of significant words and then dividing by the total number of words. So if we had a 100 word string of significant words, then the sentence would receive a score of 100. The calculate score method is a little bit hacky looking, but this is to save space and make things a little bit more interesting for you.

def calculate_score(sentence, metric):
    words = sentence.split()
    imp_words, total_words, begin_unimp, end, begin = [0]*5
    for word in words:
        w = word.strip('.!?,();').lower()
        end += 1
        if w in metric:
            imp_words += 1
            begin = total_words
            end = 0
        total_words += 1
    unimportant = total_words - begin - end
    if(unimportant != 0):
        return float(imp_words**2) / float(unimportant)
    return 0.0

All right, and with that we are pretty much ready to put everything together. The summarize function calculates the two parts, then calculates the score for each sentence. The top sentences, in this case I chose to represent the top 5% of scoring sentences, are returned to give a decent summary:

ABSTRACT_SIZE = .05

def summarize(file_name):
    f = open(file_name, "r")
    text = ""
    for line in f:
        text += line.replace('!','.').replace('?','.').replace('\n',' ')
    f.close()
    sentences = text.split(".")
    metric = top_words(file_name)
    scores = {}
    for sentence in sentences:
        scores[sentence] = calculate_score(sentence, metric)
    top_sentences = list(sentences)                                # make a copy
    top_sentences.sort(key=lambda x: scores[x], reverse=True)      # sort by score
    top_sentences = top_sentences[:int(len(scores)*ABSTRACT_SIZE)] # get top 5%
    top_sentences.sort(key=lambda x: sentences.index(x))           # sort by occurrence
    return '. '.join(top_sentences)                            

And wham! Summarization completed. Now this does depend upon a couple of key assumptions that Luhn made. He states that technical writers tend to refer to the same thing over and over with the same words and that even if they use alternate terms for their readers, they will eventually use very specific terms to describe their points. Well, I wanted to give it a test, so I ran it on one of the papers we read in our systems class. The paper is called BitTyrant, an excellent read in itself that discusses an extremely opportunistic and greedy BitTorrent agent.

Here's the beginning and ending snippets of what it came up with. You can check out the whole file/summary and setup with he attached code samples.

 We evaluate BitTyrant performance on real swarms, establishing that all peers, regardless of upload capacity, can significantly improve download performance while reducing upload contributions.  First, although modeling Bit- Torrent has seen a large body of work (see Section 6), our model is simpler and still suffices to capture the correlation between upload and download rates for real swarms. [... Calvin's editorial cut ...]  Unfortunately, our work shows that BitTorrent fails to attain such an equilibrium for typical file sizes in swarms with realistic bandwidth distributions and churn, which Bit- Tyrant exploits through strategic peer and rate selection.  In contrast, BitTyrant maximizes download per unit of upload bandwidth and can drastically reduce its upload contribution by varying the active set size and not sharing its upload bandwidth uniformly with active peers.

I could limit the text further, but then it tends to deteriorate given the sheer amount of information in the paper. Other than the references to  figures which are out of place without the accompanying text, the introduction and conclusion are pretty good. As promised here is the code. Next week I should have a more in-depth project to demonstrate, but until then have fun summarizing and abstracting.

Monday, June 28, 2010

Content is Key

Earlier this weekend, I took time to redo my website. The old site was primarily flash, didn't have a whole lot of information about me and was on the whole pretty useless. In looking over it and figuring out the layout, it got me thinking about what sorts of things a website, or any consumable really, needs to succeed. So this week there won't be any code or quick tutorials on data structures, just some of my thoughts on products and the web, what has staying power and what does not. For right now, I'm primarily going to talk about digital and print based media because these are the things I'm most familiar with, but I think these principles can be applied to almost anything that people buy.

In my mind, there are three basic parts to selling a product, whether it's a website, software, a movie or a book. There are the content, the presentation, and the spread. The content of the product relates to what purpose the product tends to serve. For a website, it's the information on the site, in a movie or book it's the plot. The presentation or delivery is how the content is presented, essentially the atmosphere of the product. It's the reason you might pay more for a meal at a fancy restaurant than a cheaper one even though the taste of the food is comparable. Finally there's the spread, how people first hear about your product or service. This could be the result of advertisements or a marketing campaign, but it also could be through word of mouth or recommendations.

Obviously the best products and services do all three of these very well. For instance, Apple manages to design a phone that is technologically years ahead of the competition. It's user interface and packaging are easy to use and stylish and even on the fourth iteration of almost the same exact product, huge crowds still show up at the launch. Someone over there knows what they are doing. However, though these three factors can be combined to create a dynamite product or service, I'd like to make the case that only content with some very minimal spread can create a very successful site or product, often outweighing the delivery entirely. It's part of a trend that we've seen time and time again in technology, and I think it demonstrates which way the world is trending as it becomes more information and data centric.

A tale of two games
I'll start with an example from old RTS gaming: Starcraft versus Total Annihilation. While I'm sure that most of you have heard of Starcraft, TA is generally less well known. Per its Wikipedia article and my own experiences playing the game, Total Annihilation was the first 3D RTS game made available, the graphics were cutting edge at the time and it featured two distinct player types similar to Starcraft's separate civilizations. The end result was two very similar games, on the one hand there was Starcraft, a strictly 2D game with three distinct but equally balanced races. On the other, there was Total Annihilation which had an impressive 3D set of units and terrain. It outmatched Starcraft in presentation and graphic display.

So what happened? TA was named the game of the year in 1997 and managed to beat Starcraft to release by six months. Yet Starcraft continues to enjoy a significantly more massive following over its flashier counterpart. Starcraft was BIG, people still have competitions the world over with this game. In the end Starcraft's mechanics made it the better game, despite it's graphical disadvantage. Gameplay was extremely balanced, contained a compelling campaign, and shipped with a fully-featured map editor that allowed players to define their own game types. These core features of the game gave it staying power, which is really the defining characteristic of a product with solid content. In the end the presentation doesn't matter when the core storyline or mechanics are flawed.

However, it's important to note that I'm not saying that a product without good content can't be successful. For example, the film Avatar was wildly successful and broke all sorts of box office records. I'll concede that it had some pretty amazing scenery, and having a 3D movie was a pretty interesting new medium that I think James Cameron explored well. However, how many people do you think will remember Avatar as one of the great movies in the future? I'm wagering that it will fade out in ten years or so. Avatar didn't have the unusual plot twists or interesting storyline that define any sort of great movie. It skimped on content by essentially reusing someone else's story, following cliche after cliche. While it's entertaining, I'd argue that it doesn't have the ability to stand the test of time.

Well, what about twitter?
Ah yes.... what about twitter? How can a service that limits postings to 140 characters possibly be successful with such limited content? That was my whole case right, that the content is the key to a service's longevity and depth.

Well, it's an interesting point, but here is I think there is a fundamental difference with twitter. While an individual user's content might be very limited by the 140 character limit, Twitter's aggregate content is very strong and provides a lot of a value. Think about it, if I told you that you could sign up for a special service just to get random messages I'd send out from time to time, would you do it? Even if you are my best friend and I updated it daily? Heck no, you don't care that I switched from mint toothpaste to cinnamon! But what if many of your friends, news sources and businesses did to give you a realtime flow of information? Now THAT is some powerful information.

And if you look at Twitter's evolution over time, you can see that they have realized this too. Twitter started out as something whose presentation made it interesting: "Tweets, those sound fun!" But over time they've started consolidating their data. Messages are now tagged and directed at different users, giving a high level view of "what's happening now." They recently added the ability to add location data to give tweets more local context and allow better search. Imagine if you're on the street and see some sort of event going on, you can just flip to twitter and search nearby tweets to discover it's the yearly kumquat festival.

But what I think makes Twitter such a powerful platform goes back to their power in the aggregate over the individual. How many tweets can you recall off the top of your head that were posted more than a week ago? I'm guessing it's not too many, simply because you can't fit that much into 140 characters. Tweets are rarely memorable, more often a link to something else or an ongoing conversation. Thus the individual user doesn't really have a lot of pull for interesting ideas and content on twitter. Instead the platform provides the content, and I think twitter knows it.

What does this mean for me?
Well, I mentioned that this whole idea about what makes a site successful initially came from redesigning my website. I started thinking about some of the people who I admire, whose blogs I read and ideas I consider. I came to the startling realization that all of them use very similar formats that showcase the site content above anything else. A few of them are simply blogs such as this one, while others are just some simply formatted html. These are online sites I read all the time, and they spread via reddit, slashdot, and other word of mouth sites. Despite the age of some of the articles, they get linked to again and again for the simple reason that their content is interesting and insightful.

So during all that time when I was trying to figure out how to design my site with a new iteration, I was just avoiding the root problem through design: that I had little important to say. I could try and put up some interesting flash animations or amusing bits about myself, but in the end no design can disguise the fact that the site has little content. I think it's a pretty valuable lesson that's pretty fundamental to any online business.

However, it's clear sites can make the transition too with a little work and a focused goal. Twitter originally focused on the presentation, making an interesting medium to use, but by playing smart with their information they have turned into the web's realtime search system. Facebook has been undergoing a similar process, they started with a clan interface and used their data to feed their site. Sure, most users gripe about how the presentation changes all the time, but these changes are pretty insignificant compared to the power of Facebook Connect and the Pages data consolidation. Facebook has set themselves up to give a more personalized web and search experience than perhaps even Google. One thing's for sure, it's an interesting time as we see who will be victorious in the fight for the web and I think it will be won by the power of data.

Monday, June 21, 2010

Dynamic Images in Django

Those of you rolling around the internet back in the day might remember some images that displayed some seemingly "personal" information to you. Granted I'm not exactly THAT much of an old timer that I programmed my own Ethernet protocol in assembly or anything- so you too might remember something like the image at right here. It's a little disconcerting at first, "How does some random image know this much about my computer?" Well today I'd like to revisit one of my old favorites for quick prototyping, django, to figure out just that.

When you think about it, it's not really too complicated. When you load a website, your browser sends packets to the site requesting information, the server creates a response addressed to you, and then the browser recreates all this disparate information into a single webpage. It's pretty clear IP addresses are part of this scheme, all data has to have to be sent somewhere after all and the server needs to know how to send the data back to its clients (you). This is part of information in the packet called headers. HTTP Headers give some sort of metadata or context to a server or client telling them how to interpret the actual data that's about to come across. Sometimes it's used, sometimes not, it's really up to the developer but the key point is that almost all browsers send this information. You can take my word that it's not really dangerous or revealing... ...that is if you choose to trust me.

Okay, let's get down to the meat of this thing. There are a couple of key features that we want in here (assuming we'll just duplicate this above image).

  1. We need to get the IP address, this one's pretty easy, we know it will be in a header.
  2. We need a way to get the ISP probably from the IP address.
  3. We need to figure out the Operating System, probably from a header as well.
  4. Finally, we need to return this all as an image
Looking through a list of the HTTP Headers available on the Django website, which are just available as a dictionary in the request, it seems the most relevant are the REMOTE_ADDR to get the client's IP and the HTTP_USER_AGENT, which tends to give the most information about the client. The user agent string is how the client identifies itself to the server, it contains some info about the browser and operating system to let the server know whether it is speaking to a Windows 3.1 desktop or an iPhone 4g. They aren't always reliable and it's pretty easy to spoof them, but for now it's not really a mission critical problem.

So the IP and OS can be gleaned from headers... what about the ISP? This one is a little more tricky, but essentially ISPs buy up blocks of the IP space and then lease them out to their users. Fortunately, these mappings don't change that often, and there are many sources online for a mapping from IP to ISP. I chose to go with an open source geoip wrapper for python, pygeoip. It basically loads a mapping .dat file and then allows the user to query easily by IP, putting it into an ISP name or region.

Well, we've identified the information we need, let's take a look at how to extract it efficiently. The first part of my function is designed to get the information and make it into a nice, user-friendly string. So far, this is what we have in the views.py of my application:

from django.shortcuts import render_to_response
from django.http import HttpResponse
import pygeoip as geo
import re

gi = geo.GeoIP('GeoIPISP.dat')
SYSTEMS =   ("Windows", "Macintosh", "Linux")

def generate(request):
    # Get the names for things
    os = findOS(request.META['HTTP_USER_AGENT'])
    ip = request.META['REMOTE_ADDR']
    org = gi.org_by_addr(ip)

    # Put them in a friendly string
    ip_string = "Welcome, your IP is: " + ip
    org_string = "Your ISP is: " + org
    os_string = "And your operating system is: " + os
 
# I <3 Regex for extracting the OS name
def findOS(ua_string):
    result = [system for system in SYSTEMS if \
              re.search(system, ua_string, re.I) != None]
    result = result[0] if len(result) > 0 else "Other"
    return result


So that is pretty simple, we do a little bit of Regex fanciness combined with a list comprehension to get the Operating System name that we needed. We do a single function call to get ISP from the IP in the HTTP header dictionary. Man I love Python.

But wait! We're not quite done. It has to be an image, just the info isn't good enough! For this I used PIL, the aptly named Python Imaging Library. In PIL it's pretty easy to create a new solid image, so that's what I did. I added a little text at strategic locations, and made both the text and the background a random color with one other function. That's it! Here's the entire views.py file:
from django.shortcuts import render_to_response
from django.http import HttpResponse
import pygeoip as geo
import re
import random
from PIL import Image, ImageDraw, ImageFont

gi      =   geo.GeoIP('GeoIPISP.dat')
SYSTEMS =   ("Windows", "Macintosh", "Linux")
size = (400, 100)
font = ImageFont.truetype("arial.ttf", 15)
top, middle, bottom = (5,10), (5,40), (5,70)

def generate(request):
    # Get the names for things
    os = findOS(request.META['HTTP_USER_AGENT'])
    ip = request.META['REMOTE_ADDR']
    org = gi.org_by_addr(ip)

    # Put them in a friendly string
    ip_string = "Welcome, your IP is: " + ip
    org_string = "Your ISP is: " + org
    os_string = "And your operating system is: " + os
    
    image = Image.new("RGB", size, randomColor())
    draw = ImageDraw.Draw(image)

    color = randomColor()
    
    # Draw the text
    draw.text(top, ip_string, fill=color, font=font)
    draw.text(middle, org_string, fill=color, font=font) 
    draw.text(bottom, os_string, fill=color, font=font)
    
    response = HttpResponse(mimetype="image/png")
    image.save(response, "PNG")
    return response

# I <3 Regex for extracting the OS name
def findOS(ua_string):
    result = [system for system in SYSTEMS if \
              re.search(system, ua_string, re.I) != None]
    result = result[0] if len(result) > 0 else "Other"
    return result

# Generate a random color
def randomColor():
    return tuple([random.randint(0,255) for i in [0]*3])


BAM! Was that quick or what? I've put up the entire django project here if you'd like to read through the source. It comes with two free GeoIP.dat files, so you lucked out! The result is something like the right image here, not a bad facsimile for only a few lines of code, and if the image template had already existed, it would be even shorter. Well, that's it for now. Until next time feel free to check out more information in HTTP headers, it's interesting to see how much extra data is being passed around!

Sunday, June 6, 2010

LZW Compression

All right, as promised last week today's topic is LZW compression. As we talked about in our networking class, LZW compression is about as close as you can get to an optimal compression rate for large compression objects. The single biggest advantage of LZW is that unlike the traditional Huffman Encoding that was discussed in class, very little information needs to be transmitted from the compressor to the decompressor. Moreover, this information is known ahead of time, before any data has even been generated.

Getting into the mechanics, like Huffman encoding, LZW is a variable length-encoding, meaning that transmitted symbols can be of different sizes. In my implementation I'm just going to represent the compressed data as a list of integers for simplicity, but it's easy to imagine a smaller symbol type for large compressions.

So let's get started on how LZW works: let's call the person compressing and then sending the compressed data, 'S', and the receiver to decompress the message 'R'. LZW relies on the fact that both S and R keep a dictionary of strings that they have already seen in the transmission. As S builds its dictionary and sends further symbols to R, R will not only add the received symbols to its output, but will also build up a "mirrored dictionary" that is identical to the one S has already built. In this way both sender and receiver are kept in sync yet do not need to propagate these changes in the dictionary to one another because they are inherent to the algorithm!

I've shown the code to build the initial dictionaries, it's pretty straightforward. It takes in a string of valid characters and returns both the sender and receiver dictionaries from the initial set. At any given time, you only need one dictionary so the other can just be assigned to a dummy variable. The dictionaries initially assign an index to each character - though later strings of characters will receive an index and that is where the compression comes in.

'''
    Creates the initial compression and decompression
    dictionaries, and returns both.
'''
def dict_init(characters):
    compress = dict()
    decompress = dict()
    index = 0
    for char in characters:
        compress[char] = index
        decompress[index] = char
        index += 1
    return compress, decompress

Once we have our dictionaries, it's time to compress the message. This step is pretty simple, we simply move along the list of characters until we arrive at some string of characters that are not in our dictionary. We then send the longest string of characters that were found in the dictionary, and create a new entry in our dictionary for that string plus this new character, outputting the index of the old string. Thus we build our dictionary and our compressed message at the same time.

'''
    Takes in a string of characters along with
    a string of valid characters and returns their
    compressed output as a list of integers.
'''
def compress(string, valid_characters):
    
    # Initialize our bookkeeping
    record, empty = dict_init(valid_characters)
    best = ""
    index = len(valid_characters)
    compressed_output = []
    
    for char in string:
        # Look for the longest string in record
        if best + char not in record:
            # Output the best so far and make new index
            compressed_output.append(record[best])
            record[best+char] = index
            index += 1
            best = char
        # Otherwise, keep looking for longer string
        else:
            best += char

    compressed_output.append(record[best])
    return compressed_output

Alright, now comes the tricky part, decompression. The deal with this is that at each character, the receiver is essentially "one step behind" the sender, because while the receiver can make a guess as to what the sender sent, it's impossible to know for certain until the next symbol is read.

For instance, say the first thing the sender sends is a '4' where 'T:4' in our dictionary. Because 'T' had already been added to the dictionary (as it is a valid symbol), the sender continued in compression but then it came to an 'h'. In this case, the index 4 is sent, and 'Th' is given a new index in the dictionary. Yet when the receiver gets that the first symbol is 'T', it has no idea what the next symbol that has been added to the dictionary is. Therefore the receiver has to keep some 'guesswork' as to what the trailing character of the item to be added to the dictionary is. You can see in the following code that on each iteration this guess is updated with the first character of the new symbol, the one which caused the last symbol to terminate and has since been added to the dictionary.

'''
    Takes in a list of integers along with
    a string of valid characters and returns the
    decompressed output as a string of characters.
'''
def decompress(compressed, valid_characters):
    
    # Initialize our bookkeeping
    empty, record = dict_init(valid_characters)
    prev = record[compressed[0]]
    index = len(valid_characters)
    decompressed_output = prev

    record[index] = prev # Initial guess is first character

    for i in compressed[1:]:
        record[index] += record[i][0] # Update guess
        index += 1
        record[index] = record[i] # Insert new record
        decompressed_output += record[index]
        
    return decompressed_output

And there you have it! So how much compression did all that work get us? Assuming that each sent index counts for a 'symbol' in our message, we can see how compressed each message had become. As we increase the size of the message, it is much more likely that symbols will be repeated and thus we get a better compression rate.

For Shakespeare's Julius Caesar:
Original size: 112472 Compressed Size: 28725
Compression rate: 0.255396898784

For Joyce's Ulysses:
Original size: 1539989 Compressed Size: 305737
Compression rate: 0.198531937566

And for Tolstoy's War and Peace:
Original size: 3223402 Compressed Size: 513291
Compression rate: 0.159238903494

Clearly the bigger they are, the harder they... compress. Generally speaking anyway. If you'd like the full source along with some of my tests, you can download them from here, specially zipped with none other than LZW!

Monday, May 31, 2010

The Magic of B-Trees

I'd like to talk a little about and give an implementation of B-Trees, perhaps the best example of an algorithmic data structure designed specifically 'real computers' rather than the theory of academia. Given that, it took me a while to figure out the why on earth B-Trees are used. It's sort of like the Binary tree's uncoordinated half-cousin who looks slow, but turns out to be speedy when tested.

As it turns out, there is a very real need for a structure that avoids the space constraints of a gigantic hash table while simultaneously providing efficient lookup. When creating database indices or file systems, a sort is often necessary, which tends to throw out the unordered nature of the hash table. Additionally, resizing in particular would be a huge problem in a file system as all files would have to be rehashed.

Okay, "That's fine," you might say, "but why not use a simple binary tree like AVL that does all comparisons and rebalancing in O(log(n))"? It's true, binary trees are both efficient in space, search, and reordering so if you have 1,000,000 elements it will take roughly 19 comparisons to find a given element in average time. The problem however, comes that in practice 19 comparisons becomes too many when the disk must seek to search for the next element.

That's sort of the crux of the b-tree, when we get out of the fairyland of theory and into the harsh realities of life, disk seeking becomes the operation that dominates runtime. B-trees essentially create trees of smaller height where each node contains an index of subtrees and can be viewed as an atomically loaded element. These nodes are then traversed in memory to decide which node to load next, saving disk seeking costs.

Whew, now that we're done with some intro, let's get to coding. I've outlined my b-tree class below. The class allows for any comparable type really but I'm going to use integers as they are our friends.

class BTree:

    def __init__(self, array = [], children = []):

        self.array = array
        self.children = children


Initially our tree starts with an array of numbers and an array of children, simple, right? The 'array' variable acts as the index for the children, directing the search algorithm to the next node. However, with all these arrays it can be tricky to keep our tree balanced and efficient. In practice there are many different rules and algorithms governing how nodes reallocate data, but I'm going to go with a very simple implementation, the 2-3 Tree. 2-3 Trees are perhaps the simplest type of B-Tree with each tree containing either 1 or 2 elements and 0,1 or 2 children.

First I'll outline the search function. We start at the root of the tree and look through its array for our search term. If we encounter an element greater than the one we are searching for, then we look at the child pointer that comes before that element and recurse on that subtree. This can be made more efficient with binary search, but since our arrays are limited to 2 elements we will simply do a linear search here.

def search(self, element):
        
        if len(self.array) == 0 return False

        # Go through the array looking for the element or the index.
        for i in range(len(array)):
            if array[i] == element:
                return True
            if array[i] > element:
                if(len(children) > i):
                    return children[i].search(element)
                else:
                    return False
        # Not found, so check the last child pointer
        if len(self.children) > len(self.array):
            return children[-1].search(element)
        else:
            return False

Alright, the final call is how to do an insertion, which gets complicated quickly with B-trees. I'm just going to demonstrate a moderately correct algorithm of the leaves, though there could definitely be errors in it as I just pseudocoded it in Python. In a few weeks I'll follow up with a more complete (tested) implementation and its comparison to binary trees.

def insert(self, element):
        # Case 1: Leaf is empty
        if len(self.array) == 0:
            self.array.append(element)

        # Case 2: Leaf has only a single data item
        elif len(self.array) == 1:
            if self.array[0] > element:
                self.array.insert(0, element)
            else:
                self.array.append(element)
        
        elif len(self.array) == 2:
            # Case 3: Leaf is full and parent has some data item
            parent = self.parent
            if element < self.array[0]:
                small, middle, large = element, self.array[0], self.array[1]
            elif element > self.array[1]:
                small, middle, large = self.array[0], self.array[1], element
            else:
                small, middle, large = self.array[0], element, self.array[1]

            # No parent, we are at root so one must be created
            if parent == None:
                self.array.remove(largest)
                p = Node(array=[middle], children=[self, Node(array=[largest])]
                
            if len(parent.array) >= 1:
                # Check whether this is the left child or right child
                if self == parent.children[0]:
                    index = 1
                else:
                    index = 2

                parent.insert(middle)
                self.array.remove(large)
                parent.children.insert(index, Node(array=largest))

The most important point to keep in mind is that B-Tree insertions are typically a bottom up process, like in an AVL tree. This keeps things balanced and working well. Thus, sorts, in-order traversals and search are pretty easy and space efficient with a B-Tree. The indexing of the nodes allows nodes to be loaded like pages one at a time. At any rate, the take home message of using B-Trees is this: When the cost of loading the node is greater than the comparison of two values, use a B-Tree. Hopefully I'll run a later post with a non-pseudocoded version that loads nodes as files to simulate the speedup gained from using a B-Tree.

Next week I'll try and explore some compression techniques, hopefully with some tested code and metrics this time. Until then, stay classy!