“The general idea is pretty simple. We take the input audio. We condition it (adjust it to a known sampling rate and volume.) We pass it through the psychoacoustic model (it’s about a notch more complicated than what you’d see in a mp3 encoder, which ain’t saying much. This is all stuff that was mostly hashed out decades ago.) This model effectively strips the parts of the sound you can’t hear — the desired result being that even if the audio has been compressed or manipulated subaudibly, the
result is still the same. Okay, so the net result of all of this is a vector that covers a very small segment (fraction of a second) of audio. We stack several of these vectors (possibly separated in time by a bit) side-by-side to get a big vector. Then we do completely boring and standard and well-understood statistical and pattern-matching stuff on the vector to make it smaller and more palatable for the server — think of it as lossy compression. Then it goes off to the server. The server is about equal in
complexity to a text search engine. (I say this fully realizing that I have only a vague impression how Google works. It’s certainly a lot more complicated than the obvious hash-table-of-sorted-lists stuff.) It finds the database vector that’s the best match in a fairly boring but efficient way. (No, it does not involve searching through all tracks one by one, no more than Altavista searches through all web pages one by one every time you want to find some porn.) Call the result a submatch. Back at the client, the whole process is repeated a bunch more times, generating a stream of submatches (“Radiohead offset 0.. Radiohead offset 1024 or 16384.. Slashdot’s Gr34test Hits 5262324.. Radiohead offset 3072..”) from the input audio stream. Then, the client looks at the submatches and tries to figure out what the input audio was and where the song boundaries are (did somebody really stick in a sample from Slashdot’s Gr34test Hits, or was that just an unlucky match?)
See? Not magic. It’s a challenging problem, but not an impossible problem. The reason that this doesn’t exist right now is not that generations of scientists have tried and failed, but rather that people didn’t care too much until lately and nobody’s gotten off their ass and done anything about it yet. I like big but approachable problems, which is one of the reasons I’m excited about this.
FOR ALL OF YOU WHO FELL ASLEEP THROUGH THAT: YOU CANNOT ADD AN INAUDIBLE TONE TO THE MUSIC AND BREAK TUNEPRINT. THE FINGERPRINT IS BASED ON THE LARGE-SCALE PSYCHOACOUSTIC FEATURES OF THE MUSIC. IF MP3 ENCODERS CAN DO IT, SO CAN WE. Maybe not perfectly, but enough to have a fighting chance. THAT’S THE WHOLE POINT HERE. ”
so take that, naysayers.