Multiset Hashes (MSH)
Suppose we had an unordered set of possibly repeating elements, and we wanted to be able to efficiently update its hash without needing to keep the entire set around in memory. That's exactly what a multiset hash does.
Why a multiset hash, not a set hash? Two reasons - firstly, because when we're incrementally updating the hash, we can't know if the element we're adding already exists in the set if we only have the hash, so we instead consider the number of times the element appears in the set.
Secondly, we use them because collections in the Maru programming model are multisets. In particular, for "linear" operators that don't require an arrangement (e.g. map, filter, sum, count) we won't need to use merkle trees at all, which are quite expensive to verify and update in a SNARK/STARK.
A Multiset is a set of elements such that, for each element, we keep track of its "multiplicity" - that is, the number of times the element appears in the set. This is different from a set, because sets do not consider "duplicates" - adding an element to a set that already contains will not change . If was a multiset, it would change. For more background on multiset hashes, check out this awesome article by Lúcás Meier.
The Hash Function
In Maru, we use the following hash function from Clarke et.al:
Where arithmetic is done over a field in which the discrete log problem is hard and is a cryptographic hash function whose output is an element in the field, and denotes the multiplicity of in the multiset .
We can then incrementally update the hash with a (possibly single-element) set of new values to add as follows, without keeping the entirety of in memory:
An implementation of this function can be found at https://github.com/proxima-one/multiplicity.
There's just one problem with the hash function as it was introduced by Clarke et.al - it requires a really big field to be secure (~4000 bits), and that's way too big to put into a starky STARK or plonky2 SNARK, whose native field is only 64 bits.
Luckily we can do the same thing in an elliptic curve group instead of a field:
The only difference here is that is now a hash function which outputs a point on the elliptic curve, not the field. This allows us to use a much smaller elliptic curve representation, which should be cheaper in a starky STARK or plonky2 SNARK. Lúcás Meier wrote a concrete implementation of this version over the Ristretto curve.
The "Goldilocks Curve"
Since we're doing this in STARKs and SNARKs, we want to avoid non-native field arithmetic. Even though the elliptic curve representation is smaller, non-native field arithmetic is quite expensive. Luckily, Thomas Pornin found a curve whose base field is a degree-5 extension of goldilocks, the native field for both starky and plonky2. While this curve is technically called ecgfg5, we call it the "goldilocks curve" because its value to us is that it's base field is a native extension.
An implementation of the multiset hash using this curve will allow us to avoid non-native field arithmetic when performing multiset hashes, which will make our STARK/SNARK circuits much more efficient.
A safe-rust implementation of the curve can be found here - in practice we'll want to build atop plonky2's implementation of the quintic goldilocks extension instead since we'll be using it in circuits.