What Colour are your bits?

There’s a classic adventure game called Paranoia which is set in an extremely ~~repressive~~ Utopian futuristic world run by The Computer, who is Your Friend. Looking at a recent LawMeme posting and related discussion, it occurred to me that the concept of colour-coded security clearances in Paranoia provides a good metaphor for a lot of copyright and intellectual freedom issues, and it may illuminate why we sometimes have difficulty communicating and understanding the ideologies in these areas.

In Paranoia, everything has a colour-coded security level (from Infrared up to Ultraviolet) and everybody has a clearance on the same scale. You are not allowed to touch, or have any dealings with, anything that exceeds your clearance. If you’re a Red Troubleshooter, you’re not allowed to walk through an Orange door. Formally, you’re not really supposed to even know about the existence of anything above your clearance. Anyone who breaks the rules is a Commie Mutant Traitor, subject to the death penalty.

Much of the game revolves around the consequences of the security levels. For instance, Friend Computer might assign a team of Red Troubleshooters to re-paint a hallway that ought to be Orange but was painted Yellow by ~~mistake~~ the Commie Mutant Traitors. It’s quite likely in such a case that the Troubleshooters will all end up shooting each other for treason against Friend Computer, since none of them are allowed to touch the paint, go near the hallway, or talk about their mission, and they’re all charged with enforcing the rules on one another.

In intellectual property and some other fields we’re very interested in information, data, artistic works, a whole lot of things that I’ll summarize with the term “bits”. Bits are all the things you can (at least in principle) represent with binary ones and zeroes. And very much of intellectual property law comes down to rules regarding intangible attributes of bits — Who created the bits? Where did they come from? Where are they going? Are they copies of other bits? Those questions are perhaps answerable by “metadata”, but metadata suggests to me additional bits attached to the bits in question, and I’d like to emphasize that I’m talking here about something that is not properly captured by bits at all and actually cannot be, ever. Let’s call it “Colour”, because it turns out to behave a lot like the colour-coded security clearances of the Paranoia universe.

Bits do not naturally have Colour. Colour, in this sense, is not part of the natural universe. Most importantly, you cannot look at bits and observe what Colour they are. I encountered an amusing example of bit Colour recently: one of my friends was talking about how he’d performed John Cage’s famous silent musical composition 4’33” for MP3. Okay, we said, (paraphrasing the conversation here) so you took an appropriate-sized file of zeroes out of /dev/zero and compressed that with an MP3 compressor? No, no, he said. If I did that, it wouldn’t really be 4’33” because to perform the composition, you have to make the silence in a certain way, according to the rules laid down by the composer. It’s not just four minutes and thirty-three seconds of any old silence.

My friend had gone through an elaborate process that basically amounted to performing some other piece of music four minutes and thirty-three seconds long, with a software synthesizer and the volume set to zero. The result was an appropriate-sized file of zeroes — which he compressed with an MP3 compressor. The MP3 file was bit-for-bit identical to one that would have been produced by compressing /dev/zero... but this file was (he claimed) legitimately a recording of 4’33” and the other one wouldn’t have been. The difference was the Colour of the bits. He was asserting that the bits in his copy of 433.mp3 had a different Colour from those in a copy of 433.mp3 I might make by means of the /dev/zero procedure, even though the two files would contain exactly the same bits.

Now, the preceding paragraph is basically nonsense to computer scientists or anyone with a mathematical background. (My friend is one; he’d done this as a sort of elaborate joke.) Numbers are numbers, right? If I add 39 plus 3 and get 42, and you do the same thing, there is no way that “my” 42 can be said to be different from “your” 42. Given two bit-for-bit identical MP3 files, there is no meaningful (to a computer scientist) way to say that one is a recording of the Cage composition and the other one isn’t. There would be no way to test one of the files and see which one it was, because they are actually the same file. Having identical bits means by definition that there can be no difference. Bits don’t have Colour; computer scientists, like computers, are Colour-blind. That is not a mistake or deficiency on our part: rather, we have worked hard to become so. Colour-blindness on the part of computer scientists helps us understand the fact that computers are also Colour-blind, and we need to be intimately familiar with that fact in order to do our jobs.

The trouble is, human beings are not in general Colour-blind. The law is not Colour-blind. It makes a difference not only what bits you have, but where they came from. There’s a very interesting Web page illustrating the Coloured nature of bits in law on the US Naval Observatory Web site. They provide information on that site about when the Sun rises and sets and so on... but they also provide it under a disclaimer saying that this information is not suitable for use in court. If you need to know when the Sun rose or set for use in a court case, then you need an expert witness — because you don’t actually just need the bits that say when the Sun rose. You need those bits to be Coloured with the Colour that allows them to be admissible in court, and the USNO doesn’t provide that. It’s not just a question of accuracy — we all know perfectly well that the USNO’s numbers are good. It’s a question of where the numbers came from. It makes perfect sense to a lawyer that where the information came from is important, in fact maybe more important than the information itself. The law sees Colour.

Suppose you publish an article that happens to contain a sentence identical to one from this article, like “The law sees Colour.” That’s just four words, all of them common, and it might well occur by random chance. Maybe you were thinking about similar ideas to mine and happened to put the words together in a similar way. If so, fine. But maybe you wrote “your” article by cutting and pasting from “mine” — in that case, the words have the Colour that obligates you to follow quotation procedures and worry about “derivative work” status under copyright law and so on. Exactly the same words — represented on a computer by the same bits — can vary in Colour and have differing consequences. When you use those words without quotation marks, either you’re an author or a plagiarist depending on where you got them, even though they are the same words. It matters where the bits came from.

I think Colour is what the designers of Monolith are trying to challenge, although I’m afraid I think their understanding of the issues is superficial on both the legal and computer-science sides. The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation. You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with. Oh, happy day! The lawyers will just have to all go away now, because we’ve demonstrated the absurdity of intellectual property!

The fallacy of Monolith is that it’s playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the “covered by copyright” Colour, and you’re not cleared for it, Citizen. When it’s scrambled by Monolith, the claim is that the resulting file has no Colour — how could it have the copyright Colour? It’s just random bits! Then when it’s descrambled, it still can’t have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer’s rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from. The scrambled file still has the copyright Colour because it came from the copyrighted input file. It doesn’t matter that it looks like, or maybe even is bit-for-bit identical with, some other file that you could get from a random number generator. It happens that you didn’t get it from a random number generator. You got it from copyrighted material; it is copyrighted. The randomly-generated file, even if bit-for-bit identical, would have a different Colour. The Colour inherits through all scrambling and descrambling operations and you’re distributing a copyrighted work, you Commie Mutant Traitor.

To a computer scientist, on the other hand, bits are bits are bits and it is absolutely fundamental that two identical chunks of bits cannot be distinguished. Colour does not exist. I’ve seen computer people claim (indeed, one did this to me just today in the very discussion that inspired this posting) that copyright law inescapably leads to nonsense conclusions like “If I own copyright on one thing, and copyright inherits through XOR, then I own copyright on everything because everything can be obtained from my one thing by XORing it with the right file.” That sounds profound only if you’re a Colour-blind computer scientist; it would be boring nonsense to a lawyer because lawyers are trained to believe in and use Colour, and it’s obvious to a lawyer that the Colour doesn’t magically bleed to the entire universe through the hypothetical random files that might be created some day. You could create the file randomly, but you didn’t. Maybe you could create a file identical to the complete works of Shakespeare by XORing together two files of apparently random garbage. “Why, so can I, or so can any man;” but that doesn’t mean that I am William Shakespeare.

This idea of Colour is a problem for communication between those of us who work in the world of computers, where Colour does not exist, and those of us who work in the law, where Colour exists and is important. Lawyers will ask computer scientists questions about how to determine the Colour of bits (like “How can Friend Computer prevent the Commie Mutant Traitors from making illegal copies of files, while still allowing loyal Troubleshooters to use disk-copying equipment?”), and computer scientists will find it difficult to say anything in response that the lawyers can comprehend — because a big part of computer science is about understanding that Colour does not exist. Someone who cares a lot about what Colour the bits are, and spends a lot of resources on trying to answer that question, is a dangerous idiot if not a Commie Mutant Traitor. In intellectual property law the Colour of bits exists and is of absolutely paramount importance. A computer scientist who won’t tell what Colour the bits are is being deliberately unhelpful, and a computer scientist who denies the very existence of Colour (as any conscientious computer scientist must eventually do) is a dangerous idiot and/or a Commie Mutant Traitor.

There are several ways we could try to avoid the issue. Computer scientists who want to try to be helpful may say, “Okay, you, the lawyer, are a dangerous idiot, but I have to work with you or be thrown in jail as a Commie Mutant Traitor as happened to Dmitry Sklyarov, so I’ll try to address your concerns. You say there is some special property of some bits and we need to know which bits have this property. Fine. We’ll attach tags to the files to say what Colour they are.” In the copyright realm, that’s the “rights management information” solution. It’s what they do with DVDs (region coding), VHS tapes (Macrovision), Adobe eBooks (”you may not read this file aloud”), CDs (SCMS), and many other formats. The trouble is, if we (as computer scientists) are intellectually honest about it, we’ll have to admit that it can’t really work.

The tags are just more bits. You can write a tag that says “this is an Orange tag”, but it will be made out of bits and so it can’t really have a Colour because Colour does not exist. It will just be a Colour-less tag saying “this is an Orange tag”. It will be subject to all the consequences of the fact that Colour does not exist — such as the fact that the tag could be stripped out somewhere down the line. The computer scientists are aware of that; we have to be, because knowing about the non-existence of Colour is what makes us computer scientists in the first place.

What we are doing with rights management information is simulating Colour in a computer-sciencey way. But lawyers will seize on the possibility of doing this kind of simulation and say, “See! You admit it! You can recognize the Colour of bits after all!” and then conclude from there that all the other rules they want to make (such as “Red Troubleshooters may not walk down Orange hallways”) are meaningful in the computer science realm. They’ll say “You can recognize the Colour of bits after all!” rather than “Colour exists after all!” because the idea of Colour not existing in the first place is not within their imagination. The “fact” that Colour is something real is so fundamental to law that it can’t be challenged. Of course Colour exists. We lawyers think about Colour so much that we think we can see it. Why can’t you? Maybe there is something wrong with your eyes. As computer scientists, we need to make clear that Colour simulated by Colour-less tags saying “this is an Orange tag” and such, is still only a simulation. The properties that Colour is supposed to have do not automatically come with the tags, because those properties are Colour, the tags are bits, and bits do not have Colour. Even bits that talk about Colour do not have Colour themselves. There is no such thing as Colour.

Another thing computer scientists will try to do is to treat Colour as a function (in the strict mathematical sense of “function”) of the bits — maybe an uncomputable function (in the strict mathematical sense of “uncomputable”), maybe intractable, but a function nevertheless. We either do that because we mistakenly believe that Colour really is a function, or because we’re a little more sophisticated, we know that it’s not a function, but we think that we can fake it closely enough with a function to get the lawyers off our backs. Either way, the idea is that we should be able to look at bits and somehow determine, from the bits themselves, what Colour they ought to be.

Treating Colour as a function is almost the same as attaching tags to the bits — the difference is that when the Colour is a function of the bits, we don’t have to worry about the tags being detached; on the other hand, when the Colour is a function of the bits, we can never have more than one possible Colour for a given sequence of bits. Monolith depends on exploiting this problem: it assumes that one file can only ever have one Colour, asserts that the Colour of its output file is the “you may copy this” Colour because of the (correct) claim that fixing any other single unchangeable Colour would raise legal problems, and then follows the logic to a claim that it can produce what would otherwise be an illegal copy of the copyrighted input, without breaking copyright law. One Colour per file was never one of the lawyers’ rules of Colour; it’s merely a consequence of “Colour is a function”, and Colour being a function is just something we computer people decided to believe because functions make sense to our training and Colour doesn’t. Colour is not actually a function at all.

Trying to infer the Colour from the bits may seem like an okay thing to do as long as bits are tied to physical objects. You can examine a paper document and determine whether it is an original or a photocopy. You can probably examine something purporting to be a photograph and determine whether it is a photograph of a real scene, or something more complicated. But even in the analog realm, determining Colour by examination is not always possible. You can’t determine by looking at a photograph of two people having sex whether they consented to the sex or not, let alone whether they consented to the making of the photograph. That’s a Colour distinction that is not a function of the bits that make up the photograph — and it’s true even of analog photographs.

Other important questions which you may or may not be able to answer by examining a photograph are “Are those things actually humans, or some kind of simulation?” and “How old are they?” Those questions may have been difficult with analog; they become even more difficult with digital. It is easy to imagine that someone could render by innocent means (drawing or ray tracing or whatever) an image bit-for-bit identical to an image that has the Colour (presumably Pink) of illegal child pornography. In that case, depending on your view of such things, it may matter where the bits came from to the determination of whether they are Pink (illegal) or Green (legal). Identical bits may have different Colour.

Child pornography is an interesting case because I find myself, and I think many people in the computing community will find themselves, on the opposite side of the Colourful/Colour-blind gap from where I would normally be. In copyright I spend a lot of time explaining why Colour doesn’t exist and it doesn’t matter where the bits came from. But when it comes to child pornography, I think maybe Colour should make a difference — if we’re going to ban it at all, it should matter where it came from. Whether any children were actually involved, who did or didn’t give consent, in short: what Colour the bits are. The other side takes the opposite tack: child pornography is dangerous by its very existence, and it doesn’t matter where it came from. They’re claiming that whether some bits are child pornography or not, and if so, whether they’re illegal or not, should be entirely determined by (strictly a function of) the bits themselves. Legality, at least under the obscenity law, should not involve Colour distinctions.

I think computer scientists could actually understand Colour a lot better than we do, because there are places in computer science where Colour does matter. I already mentioned the idea of quoting and plagiarism — identical words are or are not okay to use without quote marks in an academic paper depending on their Colour. Those of us with degrees are able to follow the rules for that because people who aren’t get kicked out of school before finishing their degrees. That’s a general academic application of Colour.

If you’ve any exposure to metrology — not “meteorology”, I mean the science of measurement — you’ll be familiar with the idea of tracing the pedigree of standards. Down in the chemistry lab they have a big jar of buffer solution with a label asserting that it not only has a pH of exactly 7.00, but that its pH is “traceable” to such-and-such primary standard, through a chain that probably terminates at the National Bureau of Standards in Boulder, Colorado, USA. That’s Colour. Not only do you know the pH of the buffer solution, but you know where it came from. Someone other than the National Bureau of Standards might be able to produce a buffer solution that is just as good and just as accurately 7.00 pH. If you have a sample of good pH 7.00 buffer solution it might be indistinguishable from the real traceable standard solution; but it wouldn’t really be the traceable solution unless it had the intangible Colour to make it authentic.

The computer science applications of Colour seem to be mostly specific to security. Suppose your computer is infected with a worm or virus. You want to disinfect it. What do you do? You boot it up from original write-protected install media. Sure, you have a copy of the operating system on the drive already, but you can’t use that copy — it’s the wrong Colour. Then you go through a process of replacing files, maybe examining files, swapping disks around and carefully write-protecting them; throughout, you’re maintaining information on the Colour of each part of the system and each disk until you’ve isolated the questionable files and everything else is known to be the “not infected with virus” Colour. Note that developers of Web applications in Perl use a similar scorekeeping system to keep track of which bits are “tainted” by influence from user input.

When we use Colour like that to protect ourselves against viruses or malicious input, we’re using the Colour to conservatively approximate a difficult or impossible to compute function of the bits. Either our operating system is infected, or it is not. A given sequence of bits either is an infected file or isn’t, and the same sequence of bits will always be either infected or not. Disinfecting a file changes the bits. Infected or not is a function, not a Colour. The trouble is that because any of our files might be infected including the tools we would use to test for infection, we can’t reliably compute the “is infected” function, so we use Colour to approximate “is infected” with something that we can compute and manage — namely “might be infected”. Note that “might be infected” is not a function; the same file can be “might be infected” or “not (might be infected)” depending on where it came from. That is a Colour.

But the “might be infected” Colour is clearly a fictional thing we create to help us approximate a tricky function. It’s still easy to argue that Colour doesn’t really exist. I’ve saved until last what I think is the best example of a Colour in computer science, and I think even the most hardline mathematicians will have to agree that even though this isn’t a function and cannot be represented in bits, it’s something real that we have to be able to think about and care about.

Random numbers have a Colour different from that of non-random numbers. The question of how to determine whether numbers are random or not by looking at them is one of the recurring flame wars of sci.crypt. You can’t do it. Here’s a number: 2. Was that a random number? Well, maybe I got it by rolling a die (a random generator); or maybe I got it by counting my legs (probably not random). If I give you a file of supposedly random bits, there’s no way you can tell whether they are randomly generated or not. The same file could have been generated by a quantum-mechanical random source, monkeys on typewriters, or by encrypting some well-known non-random file with some scheme that may or may not be generally known.

There are statistical tests you can do; for instance, if you look at the file and discover that it contains a copy of the works of Shakespeare, then it doesn’t look much like you would expect randomly generated numbers to look. But it could still be randomly generated. The test tells you whether the file has the statistical properties expected from randomly generated files, not whether the file really is randomly generated or not. It’s not even correct to say “the probability of this being from a random generator is very low” because that’s not true — it either was or was not randomly generated, that’s not open to probability. At best you could say “If we ran a random generator to produce a file this size, the probability of it generating this file would be very low”, which sounds almost the same, but is not.

Note my terminology — I spoke of “randomly generated” numbers. Conscientious cryptographers refuse to use the term “random numbers”. They’ll persistently and annoyingly correct you to say “randomly generated numbers” instead, because it’s not the numbers that are or are not random, it’s the source of the numbers that is or is not random. If you have numbers that are supposed to come from a random source and you start testing them to make sure they’re really “random”, and you throw out the ones that seem not to be, then you end up reducing the Shannon entropy of the source, violating the constraints of the one-time pad if that’s relevant to your application, and generally harming security. I just threw a bunch of math terms at you in that sentence and I don’t plan to explain them here, but all cryptographers understand that it’s not the numbers that matter when you’re talking about randomness. What matters is where the numbers came from — that is, exactly, their Colour.

So if we think we understand cryptography, we ought to be able to understand that Colour is something real even though it is also true that bits by themselves do not have Colour. I think it’s time for computer people to take Colour more seriously — if only so that we can better explain to the lawyers why they must give up their dream of enforcing Colour inside Friend Computer, where Colour does not and cannot exist. Maybe then they’d stop trying to shoot us as Commie Mutant Traitors.