I have plenty more of my beginner DevOps materials (see DevOps 101: Hello There!), but I also want to post about problems I’ve run into. So, this is something I’ve been mulling over a bit lately.
The Issue
My team is coordinating with another team on passing files from one AWS S3 bucket to another, and we need to verify that the file at the destination is the same as the file at the source.
Normally, you would probably just rely on S3; it’s reliable, and it does its own checksum validation on its copy operations. However, the work is part of a permanent archive, and verifying file integrity is the core-est of core requirements. Not only do we need to verify the integrity every time we move the file, but we also need to occasionally spot-check our archives while they’re at rest.
Our customer has traditionally used one of the SHA algorithms for file validation. That’s not a problem per se, but calculating a SHA on a very large file (100+ GB is not that unusual) is slow and expensive! We’d rather avoid it as much as possible.
Potential Solution: S3’s “Checksum”
One solution would be if AWS would handle calculating the checksum for us as part of its own file integrity checking. I think it might fit our requirements if we could get an AWS-calculated checksum of the full object that we could then do our spot-checking on later.
As it turns out, AWS automatically provides this as a sort of by-product of the S3’s copy object feature. When it calculates the checksum that it uses for a copy, it stores that information and makes it available for its own later use and for the user.
However, AWS doesn’t offer what they call full-object checksums if you’re using SHA. They only offer composite checksums. The distinction, as the documentation puts it, is:
Full object checksums: A full object checksum is calculated based on all of the content of a multipart upload, covering all data from the first byte of the first part to the last byte of the last part.
Composite checksums: A composite checksum is calculated based on the individual checksums of each part in a multipart upload. Instead of computing a checksum based on all of the data content, this approach aggregates the part-level checksums (from the first part to the last) to produce a single, combined checksum for the complete object.
The main problem, as you may have noticed, is that if you’re using multi-part uploads, then the composite checksum of the entire object is going to depend on your chunks being exactly the same size every time a file is moved. When you’re moving between different services, that can be super brittle: a change in one system’s chunk size would affect the ultimate checksum in completely unpredictable ways.
Full-Object Checksums and Linearization
The reason why you can’t do a full-object checksum using SHA is because SHA is basically a huge polynomial equation; change one bit in the original, and the hash, by design, is changed completely. SHA can’t be linearized, meaning you can’t calculate different parts independently and then re-combine them.
This brings us to Cyclic Redundancy Check (CRC) algorithms. These are also error-detection algorithms, but they’re calculated more or less like the remainder of a giant division problem. And, importantly, if you take the remainders from n division problems, add them together, take the remainder again, you get the remainder of the sum. So, in a super-simple example, if you want the remainder from 150 % 4, you could do it this way:
- 150 % 4 = ??
- 100 % 4 = 0 and 50 % 4 = 2
- (0 + 2) % 4 = 2, therefore 150 % 4 = 2
It’s a roundabout way of calculating such a small modulus, but if you have a thousand file chunks, then you can find the CRC in essentially the same way.
So, that’s why AWS only offers full-object checksums for CRC algorithms: they don’t want to have to compute the whole SHA any more than we do.
What does that mean for us?
I obviously think using CRCs and full-object checksums to replace our manual calculations would save a lot of compute (and therefore both time and money).
It’s still an open question whether or not switching to CRCs will satisfy our requirements. There also might be weird legacy issues that could crop up after relying on one specific algorithm for years.
Let me know if anyone has thoughts on this issue or, especially, if I’ve gotten things wrong.
Leave a comment