Checksums are SHAping up to be complicated!

I have plenty more of my beginner DevOps materials (see DevOps 101: Hello There!), but I also want to post about problems I’ve run into. So, this is something I’ve been mulling over a bit lately.

The Issue

My team is coordinating with another team on passing files from one AWS S3 bucket to another, and we need to verify that the file at the destination is the same as the file at the source.

Normally, you would probably just rely on S3; it’s reliable, and it does its own checksum validation on its copy operations. However, the work is part of a permanent archive, and verifying file integrity is the core-est of core requirements. Not only do we need to verify the integrity every time we move the file, but we also need to occasionally spot-check our archives while they’re at rest.

Our customer has traditionally used one of the SHA algorithms for file validation. That’s not a problem per se, but calculating a SHA on a very large file (100+ GB is not that unusual) is slow and expensive! We’d rather avoid it as much as possible.

Potential Solution: S3’s “Checksum”

One solution would be if AWS would handle calculating the checksum for us as part of its own file integrity checking. I think it might fit our requirements if we could get an AWS-calculated checksum of the full object that we could then do our spot-checking on later.

As it turns out, AWS automatically provides this as a sort of by-product of the S3’s copy object feature. When it calculates the checksum that it uses for a copy, it stores that information and makes it available for its own later use and for the user.

However, AWS doesn’t offer what they call full-object checksums if you’re using SHA. They only offer composite checksums. The distinction, as the documentation puts it, is:

Full object checksums: A full object checksum is calculated based on all of the content of a multipart upload, covering all data from the first byte of the first part to the last byte of the last part.

Composite checksums: A composite checksum is calculated based on the individual checksums of each part in a multipart upload. Instead of computing a checksum based on all of the data content, this approach aggregates the part-level checksums (from the first part to the last) to produce a single, combined checksum for the complete object.

The main problem, as you may have noticed, is that if you’re using multi-part uploads, then the composite checksum of the entire object is going to depend on your chunks being exactly the same size every time a file is moved. When you’re moving between different services, that can be super brittle: a change in one system’s chunk size would affect the ultimate checksum in completely unpredictable ways.

Full-Object Checksums and Linearization

The reason why you can’t do a full-object checksum using SHA is because SHA is basically a huge polynomial equation; change one bit in the original, and the hash, by design, is changed completely. SHA can’t be linearized, meaning you can’t calculate different parts independently and then re-combine them.

This brings us to Cyclic Redundancy Check (CRC) algorithms. These are also error-detection algorithms, but they’re calculated more or less like the remainder of a giant division problem. And, importantly, if you take the remainders from n division problems, add them together, take the remainder again, you get the remainder of the sum. So, in a super-simple example, if you want the remainder from 150 % 4, you could do it this way:

  • 150 % 4 = ??
  • 100 % 4 = 0 and 50 % 4 = 2
  • (0 + 2) % 4 = 2, therefore 150 % 4 = 2

It’s a roundabout way of calculating such a small modulus, but if you have a thousand file chunks, then you can find the CRC in essentially the same way.

So, that’s why AWS only offers full-object checksums for CRC algorithms: they don’t want to have to compute the whole SHA any more than we do.

What does that mean for us?

I obviously think using CRCs and full-object checksums to replace our manual calculations would save a lot of compute (and therefore both time and money).

It’s still an open question whether or not switching to CRCs will satisfy our requirements. There also might be weird legacy issues that could crop up after relying on one specific algorithm for years.

Let me know if anyone has thoughts on this issue or, especially, if I’ve gotten things wrong.

Comments

2 responses to “Checksums are SHAping up to be complicated!”

  1. Miguel A. Tovar Avatar

    Sounds like there’s only really two ways to tackle this problem:
    – either change the algorithm used for the checksum (as you’ve done so, by using CRC instead of SHA),
    – or change the size of the file to make it more manageable.
    You could compress the file to make the size more manageable and then take the SHA, and this should remain consistent as long as the compression method used stays the same. Or perhaps engineer your own solution, like writing a script that breaks a huge file down into multiple files, takes the SHA of the individual files, places those hashes into a file, and then you SHA that file to create one hash. A change in the original file should lead to a change in the final hash. The script can then be shared with the teams involved that need to verify the validity of the file.

    Liked by 1 person

    1. Andrew Kramer Avatar

      If you mean ensuring that our files are always below the 5 GB threshold for a single-part copy, I don’t think that’s always going to be possible. So we’ll need something that accounts for multi-part copies, even if they’re relatively rare.

      I agree with you that compression might be a good idea, for a variety of reasons. Not sure how that would affect other business requirements, though. And compressed or not, we still have the same essential problem of double-checking the integrity unless, as you say, we can always get under that threshold.

      Like

Leave a comment