A Dispatch on: Probability & Collisions

Math can be difficult. Especially frequency based statistics. While it and probabilities are very nice tools for the mind, if one doesn't take time to think deeply about what these numbers actually mean there's a high chance they'll come to some very incorrect assumptions. But let's back up a bit, why am I doing frequency based statistics?

When most people need unguessable, unique identifiers, they reach for UUIDs. And most often for version 4. These are randomly generated with entropy that makes collisions infeasible. At least, that's the theory. For those who want a little more assurance around collision prevention, there are several other UUID versions. For example, versions 1 and 6 use a random node identifier1 and a 100-nanosecond precision timestamp. However, these are meant for legacy systems , and the new version 7 is recommended for sortable, index friendly UUIDs instead. It uses a 48 bit millisecond precision timestamp along with bits for entropy and an optional node identifier. You can even design your own UUID by using version 8 which just hands you 122 bits to do what you'd like with.

Even with all of these versions, I don't particularly like UUIDs. They're large, taking up either 32 or 36 characters. They only have 122 usable bits, since 6 bits are used for version and variant information. If we swap hexadecimal (or base16) encoding for base32 and drop 2 bits, we end up with a 24 or 28 character encoding, which is roughly 25% smaller. If we don't mind length, we can use 160 bits giving us space for additional information while keeping the same length encoded string.

I have some entries coming about identifiers and the design choices that I'm making, but one of the benefits that I had originally contemplated was around collisions. Some of the examples of collisions resistance felt too high to be reasonable. For example, the Wikipedia page for UUIDs mentions that it would take roughly 86 years of generating 1 billion UUIDs per second to have a 50% chance of collision. That is a lot of UUIDs. But also reaching that point in the system would make it virtually unusable, since each new UUID has a high chance of colliding with another one. But what is a reasonable probability?

Perhaps 1%? That does seem low. Maybe 1 in 1 billion? I went with a time based approach: 1 per year. This value changes depending on how many values are being generated per year, but with a rate of 1,000 per second, it's a percentage with 8 zeros at the front of it. It would require 18 trillion identifiers being generated before such a probability of collision is reached, which would require around 570 years. Not bad.

I built a spreadsheet to calculate more of these values. I generated some pretty graphs. I started writing a dispatch on this and to display how brilliant my new identifier design was. And then I threw it all in the trash.

The math that I was doing was too simple. It was difficult for me to wrap my head around how to think about the probabilities correctly. My main issue was that the probability isn't static, it changes with each value that's generated. As we create more, the chance collision with a previous value increases. But also, we can't just convert a probability into something that actually happens once in a certain amount of time. The USGS wrote an article about what a 100-year flood is and why the term is suboptimal. The term "100-year flood" is actually shorthand for "a flood with a 100 year recurrence rate", which itself is a compact way of saying a flood is expected to happen once every 100 years based on historical data. Just because there was a 100-year flood last year, doesn't mean there can't be one this year.

In general, people are bad at understanding probability. See the Monty Hall problem. Or election polls. And it felt like I was, in the moment, bad at understanding probability. If I wanted a system design that eliminated collisions, then I shouldn't rely on probability to do it. Instead, I should design a collision-free coordination-free format. Of course, without using totally randomly generated values, there will be some coordination required for partition identifiers, at least in theory.

I could have used UUID version 7 or version 8 for this, but again, there are reasons why I wouldn't want to do that, like the aforementioned base16 encoding. The code will be a little trickier to write, but the benefits in overall system design feel worth those couple extra lines of code.

Designing a custom identifier and token type is another instance of a decision most would avoid. However, I'm not sure if I would have found my way to RFC 9562 and versions 6, 7, and 8 of UUIDs had I not made that choice. I might have used UUID version 4, called it a day, and found other ways to achieve the others goals that I had. Sometimes it's worth it to explore a new path, even if it eventually leads you back to the well trodden one.


  1. Previously a MAC address, but this is no longer recommended ↩︎