NOMAD Gids
As identifiers in NOMAD we want to use as much as possible something that depends just on the data. Cryptographic checksums have the nice property that it is infeasible to find two values that have the same checksum (a collision). Still, if one performs the checksum on exactly the same bit sequence, the checksums must be the same. So for example checksum of a json dictionary can collide with checksums of files: a file containing that json will have the same checksum. To avoid this we use the strings given in the table below to identify the type of object that one computes the checksum of.
Prefix | Type | Description of the checksummed data |
---|---|---|
f | File content | content of a file |
d | Directory content | sorted json dictionary with the ‘f’ and ’d’ gids of the content of the directory |
| F | File and dates | sorted json dictionary containing the creation and modification dates, along with the ‘f’ gid of the data in the file | | D | Directory and dates | sorted json dictionary containing the creation and modification dates, along with the ‘F’ and ‘D’ gids of the content of the directory | | R | Raw data archive | ‘D’ gid of the archived directory, replacing ‘D’ with ‘R’ | | S | Parsed data archive | ‘R’ gid of the raw data archive, replacing ‘R’ with ‘S’ | | N | Normalized data archive | ‘R’ gid of the raw data archive, replacing ‘R’ with ‘N’ | | C | Calculation | main file URI of the calculation | | p | meta info | dictionary with the meta info and gids of the direct dependencies in the keys meta_parent_section_gid and meta_abstract_types_gid|
Table with the currently defined prefixes for NOMAD gids. nomad_gid_type meta info contains an always up-to-date list.
To avoid the collision problems one could add the type string to the data to be checksummed with some separator. With it a simple collision would be infeasible because the checksummed data would differ in the type string. The resulting identifier would be fully opaque: from it you cannot directly know its type without looking it up.
Instead the type string is used as an explicit prefix of the checksum. This makes the gid longer (currently only one extra character at the beginning), but has several advantages:
- The type of the context is immediately clear from the gid, without having to look it up first.
- Related types might calculate the checksum on the same data. In this case generating the gid of one from another can be done simply changing the prefix (given that the checksum will be the same).
- As the type is at the beginning, one can shorten the gid, and still has an identifier that has a large likelihood of being unique.
The object to be checksummed often has a json representation, in this case we always use the checksum of the json serialized after sorting the keys in all dictionary objects, and writing the most compact json possible (no spaces or newlines).
As checksum we use the SHA 512 sha hashing function because on 64 bit hardware it becomes faster to compute than SHA 256 already with relatively short message lengths. For our purposes 512 bits are too much as they lead to unwieldy long identifiers. SHA 512 can be truncated without losing its good properties (aside the obvious reduction of the size of the target domain and its connected higher collision probability). Hence we use the first 168 bits of it, encoded using url safe base 64 encoding (base64url) rfc4648 without padding. This produces checksums of 28 characters.
168 bits might not be enough for some cryptographic applications, but for practical purposes the collision probability is so small that it can be neglected, in fact only recently (and with a large effort) sha1Collision it has been possible to find collisions in SHA-1 which is 160 bits long. It can be safely assumed that different data will have different checksums.
Thus a NOMAD Gid is a unique identifier that uses one of the prefixes listed and a checksum encoded as described above. This is relatively compact (29 characters), can be used for sharding (excluding the prefix, the checksum itself should be uniformly distributed in any character, and can thus be used to parallelize), can be truncated, and is self describing. Thus, it is a good choice for global identifiers.