Monday, June 28, 2010

Content is Key

Earlier this weekend, I took time to redo my website. The old site was primarily flash, didn't have a whole lot of information about me and was on the whole pretty useless. In looking over it and figuring out the layout, it got me thinking about what sorts of things a website, or any consumable really, needs to succeed. So this week there won't be any code or quick tutorials on data structures, just some of my thoughts on products and the web, what has staying power and what does not. For right now, I'm primarily going to talk about digital and print based media because these are the things I'm most familiar with, but I think these principles can be applied to almost anything that people buy.

In my mind, there are three basic parts to selling a product, whether it's a website, software, a movie or a book. There are the content, the presentation, and the spread. The content of the product relates to what purpose the product tends to serve. For a website, it's the information on the site, in a movie or book it's the plot. The presentation or delivery is how the content is presented, essentially the atmosphere of the product. It's the reason you might pay more for a meal at a fancy restaurant than a cheaper one even though the taste of the food is comparable. Finally there's the spread, how people first hear about your product or service. This could be the result of advertisements or a marketing campaign, but it also could be through word of mouth or recommendations.

Obviously the best products and services do all three of these very well. For instance, Apple manages to design a phone that is technologically years ahead of the competition. It's user interface and packaging are easy to use and stylish and even on the fourth iteration of almost the same exact product, huge crowds still show up at the launch. Someone over there knows what they are doing. However, though these three factors can be combined to create a dynamite product or service, I'd like to make the case that only content with some very minimal spread can create a very successful site or product, often outweighing the delivery entirely. It's part of a trend that we've seen time and time again in technology, and I think it demonstrates which way the world is trending as it becomes more information and data centric.

A tale of two games
I'll start with an example from old RTS gaming: Starcraft versus Total Annihilation. While I'm sure that most of you have heard of Starcraft, TA is generally less well known. Per its Wikipedia article and my own experiences playing the game, Total Annihilation was the first 3D RTS game made available, the graphics were cutting edge at the time and it featured two distinct player types similar to Starcraft's separate civilizations. The end result was two very similar games, on the one hand there was Starcraft, a strictly 2D game with three distinct but equally balanced races. On the other, there was Total Annihilation which had an impressive 3D set of units and terrain. It outmatched Starcraft in presentation and graphic display.

So what happened? TA was named the game of the year in 1997 and managed to beat Starcraft to release by six months. Yet Starcraft continues to enjoy a significantly more massive following over its flashier counterpart. Starcraft was BIG, people still have competitions the world over with this game. In the end Starcraft's mechanics made it the better game, despite it's graphical disadvantage. Gameplay was extremely balanced, contained a compelling campaign, and shipped with a fully-featured map editor that allowed players to define their own game types. These core features of the game gave it staying power, which is really the defining characteristic of a product with solid content. In the end the presentation doesn't matter when the core storyline or mechanics are flawed.

However, it's important to note that I'm not saying that a product without good content can't be successful. For example, the film Avatar was wildly successful and broke all sorts of box office records. I'll concede that it had some pretty amazing scenery, and having a 3D movie was a pretty interesting new medium that I think James Cameron explored well. However, how many people do you think will remember Avatar as one of the great movies in the future? I'm wagering that it will fade out in ten years or so. Avatar didn't have the unusual plot twists or interesting storyline that define any sort of great movie. It skimped on content by essentially reusing someone else's story, following cliche after cliche. While it's entertaining, I'd argue that it doesn't have the ability to stand the test of time.

Well, what about twitter?
Ah yes.... what about twitter? How can a service that limits postings to 140 characters possibly be successful with such limited content? That was my whole case right, that the content is the key to a service's longevity and depth.

Well, it's an interesting point, but here is I think there is a fundamental difference with twitter. While an individual user's content might be very limited by the 140 character limit, Twitter's aggregate content is very strong and provides a lot of a value. Think about it, if I told you that you could sign up for a special service just to get random messages I'd send out from time to time, would you do it? Even if you are my best friend and I updated it daily? Heck no, you don't care that I switched from mint toothpaste to cinnamon! But what if many of your friends, news sources and businesses did to give you a realtime flow of information? Now THAT is some powerful information.

And if you look at Twitter's evolution over time, you can see that they have realized this too. Twitter started out as something whose presentation made it interesting: "Tweets, those sound fun!" But over time they've started consolidating their data. Messages are now tagged and directed at different users, giving a high level view of "what's happening now." They recently added the ability to add location data to give tweets more local context and allow better search. Imagine if you're on the street and see some sort of event going on, you can just flip to twitter and search nearby tweets to discover it's the yearly kumquat festival.

But what I think makes Twitter such a powerful platform goes back to their power in the aggregate over the individual. How many tweets can you recall off the top of your head that were posted more than a week ago? I'm guessing it's not too many, simply because you can't fit that much into 140 characters. Tweets are rarely memorable, more often a link to something else or an ongoing conversation. Thus the individual user doesn't really have a lot of pull for interesting ideas and content on twitter. Instead the platform provides the content, and I think twitter knows it.

What does this mean for me?
Well, I mentioned that this whole idea about what makes a site successful initially came from redesigning my website. I started thinking about some of the people who I admire, whose blogs I read and ideas I consider. I came to the startling realization that all of them use very similar formats that showcase the site content above anything else. A few of them are simply blogs such as this one, while others are just some simply formatted html. These are online sites I read all the time, and they spread via reddit, slashdot, and other word of mouth sites. Despite the age of some of the articles, they get linked to again and again for the simple reason that their content is interesting and insightful.

So during all that time when I was trying to figure out how to design my site with a new iteration, I was just avoiding the root problem through design: that I had little important to say. I could try and put up some interesting flash animations or amusing bits about myself, but in the end no design can disguise the fact that the site has little content. I think it's a pretty valuable lesson that's pretty fundamental to any online business.

However, it's clear sites can make the transition too with a little work and a focused goal. Twitter originally focused on the presentation, making an interesting medium to use, but by playing smart with their information they have turned into the web's realtime search system. Facebook has been undergoing a similar process, they started with a clan interface and used their data to feed their site. Sure, most users gripe about how the presentation changes all the time, but these changes are pretty insignificant compared to the power of Facebook Connect and the Pages data consolidation. Facebook has set themselves up to give a more personalized web and search experience than perhaps even Google. One thing's for sure, it's an interesting time as we see who will be victorious in the fight for the web and I think it will be won by the power of data.

Monday, June 21, 2010

Dynamic Images in Django

Those of you rolling around the internet back in the day might remember some images that displayed some seemingly "personal" information to you. Granted I'm not exactly THAT much of an old timer that I programmed my own Ethernet protocol in assembly or anything- so you too might remember something like the image at right here. It's a little disconcerting at first, "How does some random image know this much about my computer?" Well today I'd like to revisit one of my old favorites for quick prototyping, django, to figure out just that.

When you think about it, it's not really too complicated. When you load a website, your browser sends packets to the site requesting information, the server creates a response addressed to you, and then the browser recreates all this disparate information into a single webpage. It's pretty clear IP addresses are part of this scheme, all data has to have to be sent somewhere after all and the server needs to know how to send the data back to its clients (you). This is part of information in the packet called headers. HTTP Headers give some sort of metadata or context to a server or client telling them how to interpret the actual data that's about to come across. Sometimes it's used, sometimes not, it's really up to the developer but the key point is that almost all browsers send this information. You can take my word that it's not really dangerous or revealing... ...that is if you choose to trust me.

Okay, let's get down to the meat of this thing. There are a couple of key features that we want in here (assuming we'll just duplicate this above image).

  1. We need to get the IP address, this one's pretty easy, we know it will be in a header.
  2. We need a way to get the ISP probably from the IP address.
  3. We need to figure out the Operating System, probably from a header as well.
  4. Finally, we need to return this all as an image
Looking through a list of the HTTP Headers available on the Django website, which are just available as a dictionary in the request, it seems the most relevant are the REMOTE_ADDR to get the client's IP and the HTTP_USER_AGENT, which tends to give the most information about the client. The user agent string is how the client identifies itself to the server, it contains some info about the browser and operating system to let the server know whether it is speaking to a Windows 3.1 desktop or an iPhone 4g. They aren't always reliable and it's pretty easy to spoof them, but for now it's not really a mission critical problem.

So the IP and OS can be gleaned from headers... what about the ISP? This one is a little more tricky, but essentially ISPs buy up blocks of the IP space and then lease them out to their users. Fortunately, these mappings don't change that often, and there are many sources online for a mapping from IP to ISP. I chose to go with an open source geoip wrapper for python, pygeoip. It basically loads a mapping .dat file and then allows the user to query easily by IP, putting it into an ISP name or region.

Well, we've identified the information we need, let's take a look at how to extract it efficiently. The first part of my function is designed to get the information and make it into a nice, user-friendly string. So far, this is what we have in the views.py of my application:

from django.shortcuts import render_to_response
from django.http import HttpResponse
import pygeoip as geo
import re

gi = geo.GeoIP('GeoIPISP.dat')
SYSTEMS =   ("Windows", "Macintosh", "Linux")

def generate(request):
    # Get the names for things
    os = findOS(request.META['HTTP_USER_AGENT'])
    ip = request.META['REMOTE_ADDR']
    org = gi.org_by_addr(ip)

    # Put them in a friendly string
    ip_string = "Welcome, your IP is: " + ip
    org_string = "Your ISP is: " + org
    os_string = "And your operating system is: " + os
 
# I <3 Regex for extracting the OS name
def findOS(ua_string):
    result = [system for system in SYSTEMS if \
              re.search(system, ua_string, re.I) != None]
    result = result[0] if len(result) > 0 else "Other"
    return result


So that is pretty simple, we do a little bit of Regex fanciness combined with a list comprehension to get the Operating System name that we needed. We do a single function call to get ISP from the IP in the HTTP header dictionary. Man I love Python.

But wait! We're not quite done. It has to be an image, just the info isn't good enough! For this I used PIL, the aptly named Python Imaging Library. In PIL it's pretty easy to create a new solid image, so that's what I did. I added a little text at strategic locations, and made both the text and the background a random color with one other function. That's it! Here's the entire views.py file:
from django.shortcuts import render_to_response
from django.http import HttpResponse
import pygeoip as geo
import re
import random
from PIL import Image, ImageDraw, ImageFont

gi      =   geo.GeoIP('GeoIPISP.dat')
SYSTEMS =   ("Windows", "Macintosh", "Linux")
size = (400, 100)
font = ImageFont.truetype("arial.ttf", 15)
top, middle, bottom = (5,10), (5,40), (5,70)

def generate(request):
    # Get the names for things
    os = findOS(request.META['HTTP_USER_AGENT'])
    ip = request.META['REMOTE_ADDR']
    org = gi.org_by_addr(ip)

    # Put them in a friendly string
    ip_string = "Welcome, your IP is: " + ip
    org_string = "Your ISP is: " + org
    os_string = "And your operating system is: " + os
    
    image = Image.new("RGB", size, randomColor())
    draw = ImageDraw.Draw(image)

    color = randomColor()
    
    # Draw the text
    draw.text(top, ip_string, fill=color, font=font)
    draw.text(middle, org_string, fill=color, font=font) 
    draw.text(bottom, os_string, fill=color, font=font)
    
    response = HttpResponse(mimetype="image/png")
    image.save(response, "PNG")
    return response

# I <3 Regex for extracting the OS name
def findOS(ua_string):
    result = [system for system in SYSTEMS if \
              re.search(system, ua_string, re.I) != None]
    result = result[0] if len(result) > 0 else "Other"
    return result

# Generate a random color
def randomColor():
    return tuple([random.randint(0,255) for i in [0]*3])


BAM! Was that quick or what? I've put up the entire django project here if you'd like to read through the source. It comes with two free GeoIP.dat files, so you lucked out! The result is something like the right image here, not a bad facsimile for only a few lines of code, and if the image template had already existed, it would be even shorter. Well, that's it for now. Until next time feel free to check out more information in HTTP headers, it's interesting to see how much extra data is being passed around!

Sunday, June 6, 2010

LZW Compression

All right, as promised last week today's topic is LZW compression. As we talked about in our networking class, LZW compression is about as close as you can get to an optimal compression rate for large compression objects. The single biggest advantage of LZW is that unlike the traditional Huffman Encoding that was discussed in class, very little information needs to be transmitted from the compressor to the decompressor. Moreover, this information is known ahead of time, before any data has even been generated.

Getting into the mechanics, like Huffman encoding, LZW is a variable length-encoding, meaning that transmitted symbols can be of different sizes. In my implementation I'm just going to represent the compressed data as a list of integers for simplicity, but it's easy to imagine a smaller symbol type for large compressions.

So let's get started on how LZW works: let's call the person compressing and then sending the compressed data, 'S', and the receiver to decompress the message 'R'. LZW relies on the fact that both S and R keep a dictionary of strings that they have already seen in the transmission. As S builds its dictionary and sends further symbols to R, R will not only add the received symbols to its output, but will also build up a "mirrored dictionary" that is identical to the one S has already built. In this way both sender and receiver are kept in sync yet do not need to propagate these changes in the dictionary to one another because they are inherent to the algorithm!

I've shown the code to build the initial dictionaries, it's pretty straightforward. It takes in a string of valid characters and returns both the sender and receiver dictionaries from the initial set. At any given time, you only need one dictionary so the other can just be assigned to a dummy variable. The dictionaries initially assign an index to each character - though later strings of characters will receive an index and that is where the compression comes in.

'''
    Creates the initial compression and decompression
    dictionaries, and returns both.
'''
def dict_init(characters):
    compress = dict()
    decompress = dict()
    index = 0
    for char in characters:
        compress[char] = index
        decompress[index] = char
        index += 1
    return compress, decompress

Once we have our dictionaries, it's time to compress the message. This step is pretty simple, we simply move along the list of characters until we arrive at some string of characters that are not in our dictionary. We then send the longest string of characters that were found in the dictionary, and create a new entry in our dictionary for that string plus this new character, outputting the index of the old string. Thus we build our dictionary and our compressed message at the same time.

'''
    Takes in a string of characters along with
    a string of valid characters and returns their
    compressed output as a list of integers.
'''
def compress(string, valid_characters):
    
    # Initialize our bookkeeping
    record, empty = dict_init(valid_characters)
    best = ""
    index = len(valid_characters)
    compressed_output = []
    
    for char in string:
        # Look for the longest string in record
        if best + char not in record:
            # Output the best so far and make new index
            compressed_output.append(record[best])
            record[best+char] = index
            index += 1
            best = char
        # Otherwise, keep looking for longer string
        else:
            best += char

    compressed_output.append(record[best])
    return compressed_output

Alright, now comes the tricky part, decompression. The deal with this is that at each character, the receiver is essentially "one step behind" the sender, because while the receiver can make a guess as to what the sender sent, it's impossible to know for certain until the next symbol is read.

For instance, say the first thing the sender sends is a '4' where 'T:4' in our dictionary. Because 'T' had already been added to the dictionary (as it is a valid symbol), the sender continued in compression but then it came to an 'h'. In this case, the index 4 is sent, and 'Th' is given a new index in the dictionary. Yet when the receiver gets that the first symbol is 'T', it has no idea what the next symbol that has been added to the dictionary is. Therefore the receiver has to keep some 'guesswork' as to what the trailing character of the item to be added to the dictionary is. You can see in the following code that on each iteration this guess is updated with the first character of the new symbol, the one which caused the last symbol to terminate and has since been added to the dictionary.

'''
    Takes in a list of integers along with
    a string of valid characters and returns the
    decompressed output as a string of characters.
'''
def decompress(compressed, valid_characters):
    
    # Initialize our bookkeeping
    empty, record = dict_init(valid_characters)
    prev = record[compressed[0]]
    index = len(valid_characters)
    decompressed_output = prev

    record[index] = prev # Initial guess is first character

    for i in compressed[1:]:
        record[index] += record[i][0] # Update guess
        index += 1
        record[index] = record[i] # Insert new record
        decompressed_output += record[index]
        
    return decompressed_output

And there you have it! So how much compression did all that work get us? Assuming that each sent index counts for a 'symbol' in our message, we can see how compressed each message had become. As we increase the size of the message, it is much more likely that symbols will be repeated and thus we get a better compression rate.

For Shakespeare's Julius Caesar:
Original size: 112472 Compressed Size: 28725
Compression rate: 0.255396898784

For Joyce's Ulysses:
Original size: 1539989 Compressed Size: 305737
Compression rate: 0.198531937566

And for Tolstoy's War and Peace:
Original size: 3223402 Compressed Size: 513291
Compression rate: 0.159238903494

Clearly the bigger they are, the harder they... compress. Generally speaking anyway. If you'd like the full source along with some of my tests, you can download them from here, specially zipped with none other than LZW!