Hey all, I’ve been working with @jamiebrynes on getting a Rust SDK working. We’ve been looking at setting things up such that it’s possible to distribute schema files along with worker code using Rust’s package manager, and in doing so came across some questions about packaging and distributing schema files.
In particular, it seems like the expected way of assigning component IDs in schema files (i.e. manually assigning IDs to components, with related components being given a contiguous range of ID values) doesn’t mesh well with having arbitrary third-party schema files included in a project. When I’m defining a component, how can I be sure that an ID that I’m using now isn’t being used by another package that I’m pulling in? Even if I can be sure that the ID doesn’t conflict with any packages I’m currently depending on, I also need to be sure that it won’t conflict with any dependency I pull in at any point in the future of my project.
The way this issue is generally solved in distributed systems is to have random (or effectively random) values generated within a valid range that’s so large as to make collisions a statistic impossibility. Git hashes, for example, are 160 bits (and will at some point switch to being 256 bits), allowing a distributed network of users to make commits independently without needing a central authority to generate unique commit IDs. Similarly, the UUID format is a family of methods for generating psuedo-random 128 bit values with an extremely low likelihood of collision, specifically for the purpose of generating collision-free ID values in distributed systems.
I’ve started randomly generating component IDs for my schema files, but in doing so I learned that the valid range of component IDs is far too small to reliably avoid collisions. Specifically, component IDs are signed 32 bit integer values that must be:
- Greater than 100.
- Less than 536,870,911.
- Not in the range 190,000 to 199999.
This gives a total of 536,868,810 values for component IDs. Which seems like a surprisingly small range, at least considering that other systems use 128 bit IDs and larger.
This article talks about how to calculate the probability of collision with randomly generated IDs. Using the same formula that the article does, I got an estimate of 0.18% chance of collision with an ecosystem of 1,000,000 components. While those odds aren’t super high, it’s still orders of magnitude more likely than what I would expect in order for distribution of shared schemas to be considered reliable.
Okay, now that I’ve finished my initial investigation and shown all of my work, I can actually get to the actual questions I wanted to ask:
First: Is there already an expected method for generating component IDs and distributing shared schemas that I’m not aware of? If this is already a solved problem, I apologize for going on so long about a non-issue
Second: If randomly generating component IDs is the expected way of coming up with them, would it make sense to increase the width of the ID value? Switching to 32 bit unsigned integers and allowing the full range of values would already be an improvement. For even better future-proofing, going up to 64 bit unsigned integers seems like a good idea.