LinuxWorld
Subscribe to this site with RSS

The code monkey's guide to cryptographic hashes for content-based addressing

Keeping track of large blocks of content using hash values only can be a a way to save bandwidth and build exciting new applications. Follow some important rules to use hashes correctly and securely.

Prologue

It was 1996, the bandwidth between Australia and the rest of the world was miserable, and Andrew Tridgell had a problem. He wanted to synchronize source code located in Australia with source code on machines around the world, but sending patches was annoying and error-prone, and just sending all the files was painfully slow. Most people would have just waited a few years for trans-Pacific bandwidth to improve; instead Tridgell wrote rsync, the first known instance of content-based addressing (also known as content-addressed storage, or compare-by-hash), an innovation which eventually spread to software like BitTorrent, git, and many document storage products on the market.

Used properly, content-based addressing dramatically reduces bandwidth use, simplifies code, and improves security. But used when it doesn't make sense, it can reduce performance, increase bandwidth usage, and create new security risks. Figuring out when content-based addressing is good and when it is very, very bad requires an understanding of both systems programming and mathematics. What's a programmer to do? In this article, we'll lay out the considerations in language that any programmer (and even some managers) can understand.

Content-based addressing: The basics

Let's start with an example problem. Say you have an MP3 music collection, and your friend sends you a link to a new MP3 but doesn't give you the title. If it's already in your collection, you don't want to bother downloading and storing a second copy, so you want to quickly find out if it is identical to one of your other MP3s. If you only have a few MP3s, you'll just directly compare the new MP3 to each of your existing MP3s. This takes an amount of time proportional to the number of MP3s you already have, and if you have a lot of MP3s (and we know you do), it will run too slowly and you'll look for a smarter algorithm. A method for quickly finding a particular item that is nearly as old as computer science is called the hash table. There are a thousand variations on the hash table, but we'll describe a very simple example, and then use this to understand the tradeoffs of content-based addressing.

Invalid query - session: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)
Newsletter sign-up

Sign up for one of Network World's newsletters compliments of Linux World

Linux & Open Source News Alert
Web Applications Alert
Video and Podcast Alert
Security Alert
Virtualization Alert

Email Address: