Tag: aws

  • A little self-congratulations

    A quick personal update: I didn’t really intend to post nothing for the past week or so, but I’ve been really head-down focused on a couple things. One of them was this:

    I took my AWS Certified DevOps Engineer – Professional on Thursday, and I passed! I’m not really allowed to say much about the exam itself, but I will say that the questions are challenging, even on the topics that I felt I had an excellent grasp of.

    Anyway, I took yesterday to celebrate by watching my favorite soccer team, and I’ll be back to posting substantive things tomorrow!

  • Multi-Cloud Infrastructure as Code?

    I’m going to do an uncomfortable thing today: I was thinking about a problem, and I’m just going to share my thoughts before I research it. Then, I’ll actually do the research and refine things a bit. The goal is to show the thought process and learning.


    Courtesy: Wikimedia Commons

    One of the main selling points of HashiCorp’s Terraform is that it can be used for multi-cloud deployments. The benefits of this type of deployment are significant:

    • If one provider has an outage, you can simply use your infrastructure in a different provider.
    • You’re far more resistant to vendor lock-in. If a provider raises its prices, you aren’t stuck there.

    The problem of vendor lock-in is huge. Wherever I’ve worked, there’s always this pervasive background question: “Well, what if we wanted to go with Google instead?” And the answers have been unsatisfying. Realistically, the answer is sort of: “Well, we could start over and re-design all this infrastructure for the new platform.”

    If you look at production Terraform, it’s going to be full of resources such as aws_s3_bucket, which is definitively tied to one specific cloud provider.

    So how can you have Infrastructure as Code (IaC) for multi-cloud deployments, when all your resources are tied to a specific provider?

    One solution (and the one that HashiCorp probably recommends) would be to abstract your infrastructure into generic modules that implement your intentions in each of the cloud providers’ specific resources.

    The user would specify “I want a storage container that is readable to the public and highly available.” The module would then be able to create such a container in AWS, Azure, GCP, or wherever you needed to deploy it.

    So you’d have a module that looked maybe something like this:

    # Module "multicloud_storage"

    variable "cloud_provider" {
    type = "string"
    }

    resource "aws_s3_bucket" "main_bucket" {
    count = var.cloud_provider == "aws" ? 1 : 0
    ...
    }

    resource "azurerm_storage_blob" "main_bucket" {
    count = var.cloud_provider == "azure" ? 1 : 0
    ...
    }

    Disclaimer: Please don’t use this code. It’s exactly as untested as it looks.

    Note that awkward count field on every block. I think you could probably make such a generic module work, but you’d have to implement the thing in every provider that you wanted to support.

    But the configurations for the different providers’ storage systems don’t match up one-to-one. Take, for example, the access tier of your storage: how available the objects are and how quickly they can be accessed. AWS S3 has at least nine, plus an Intelligent-Tiering option, whereas Azure uses hot, cool, cold, and archive. In our hypothetical multi-cloud module, we probably want to abstract this away from the user. We might do something like this:

    ModuleAzure BlobAWS S3
    LavaHotExpress Zone One
    HotHotS3 Standard
    CoolCoolS3 Standard-IA
    ColdColdGlacier Instant Retrieval
    GlacierArchiveGlacier Flexible Retrieval
    Micro-KelvinArchiveGlacier Deep Archive

    This would allow us to offer the available storage classes in both providers, but the actual storage tier chosen is a little obfuscated from the user.

    But what about features that exist in one provider and not another? For example, S3 offers Transfer Acceleration to speed up transfers to and from the bucket, whereas Azure’s Blob seems to rely mainly on parallel uploads for performance.

    Then we get to whole types of resources that exist in one provider but not another. Leaky abstractions. The juice-squeeze ratio of maintaining all of these implementations for lesser-used resource types or highly specific ones like QuickSight.

    I’m about to end the rambling, self-reflecting portion of this post and do some actual research. I hope that someone has created modules like this that allow the same infrastructure to work for multi-cloud deployments. My intuition is that it’s too unwieldy.

    Here I go into the Internet. Fingers crossed!


    Hey, it’s me. I’m back, fifteen minutes later.

    I didn’t find a ton. There are a smattering of tools that claim to handle this.

    For example, Pulumi, an Infrastructure as Code tool and alternative to Terraform, says that it handles multi-cloud deployments natively. I’d be interested in learning more.

    I found several articles offering a guide to multi-cloud Terraform modules. I did not, however, find any well-maintained modules themselves.

    The void feels a little weird to me: there’s obviously a great need for this sort of module. It’s the sort of problem that the open source community has traditionally been good at solving. Like I said before, my intuition is that this is a very difficult (expensive) problem, so maybe the cost just outweighs the demand?

    One Stack Overflow post mentioned that one of the reasons people don’t share Terraform in open source is that it makes it easy to find vulnerabilities in your infrastructure. (But isn’t that supposed to be a strength of open source: to crowdsource the identification and correction of these vulnerabilities?) Anyway, extrapolating a bit: this reluctance to share infrastructure might also also be a huge barrier to making such a multi-cloud module.

    If I were going to implement something professionally, I’d do a lot more than fifteen minutes of research. But, gentle reader, it looks bleak out there. Let me know if there’s anything good out there that I missed.

  • No! Shut them all down, hurry!

    Luke Skywalker: [interrupting] Will you shut up and listen to me! Shut down all the garbage mashers on the detention level, will ya? Do you copy? Shut down all the garbage mashers on the detention level! Shut down all the garbage mashers on the detention level!


    C-3PO: [to R2-D2] No! Shut them all down, hurry!

    The Official AWS Blog (great resource, by the way!) has a post describing how to reduce costs in EC2 by automatically shutting down instances while not in use.

    The short version of the blog post is that you can achieve this sort of shut down in two ways:

    1. Create a CloudWatch alarm to shut down when CPU usage is below a threshold for a period of time.
    2. Create an EventBridge trigger and Lambda to shut down on a schedule.

    I would argue that in most deployments you would have a more precise metric that actually reflects the number of your HTTP requests.

    There are other guides on how to do this; I’ve looked at some, as I’ve been planning to do this for our Fargate instances (not EC2, obviously, but similar enough) in our test environments. However, it’s nice to have an official source on how to do this kind of shutdown.

    The reason we want to do this on my project is to save on cloud costs. The savings probably aren’t that much, but they come from an area of the budget that is limited and needed for other things.

    At any rate, option #2 better reflects what my team will want to do. We have very spiky usage, but when we do go to test an instance, we don’t want to have to wait for it to spin up. Since we have similar work hours, we’ll probably want to shut down the instances except for around 06:30 – 20:00 on weekdays. That way, it’s up whenever people are likely to be working and down at other times.

    One difficulty I have anticipated is, what if someone tries to test at an odd hour. I don’t mind terribly if they need to manually start an instance; it should happen very infrequently. However, I’d like them to see a clear error message describing what’s going on. We won’t use this system in production for obvious reasons, but it would be nice if future devs years from now don’t get confused because they can’t reach the service.

    So, I’m wondering if there’s a good way to dynamically point our ALB to a static file on an S3 bucket or something while the instances are down. It might be possible to set a static file as an error page in the ALB? Not sure yet. Clearly I have not yet given this more than cursory research, but it’s on my mind.

  • AWS Launches Trust Center

    Compliance just got a tiny bit easier in AWS-land. AWS announced that they’re launching their new AWS Trust Center, an all-in-one hub for AWS’s security-related documentation.

    I certainly haven’t read through the whole site, but just eyeballing what they’ve got:

    • Security
    • Compliance
    • Data protection and privacy
    • Operational visibility
    • Report an incident
    • Agreement and terms

    I doubt they’ve even released any new documentation, but it’s a nice step forward to put all this stuff in one place.

  • Easier Cloud-to-Cloud Migrations?

    Cloud with a lock. Courtesy of Wikimedia Commons.

    An Empty (Theoretical) Promise

    It’s long been a promise of Infrastructure as Code tools like Terraform that you could theoretically create platform-independent IaC and deploy freely into any cloud environment. I doubt anyone ever really meant that literally, but the reality is that your cloud infrastructure is inevitably going to be tied quite closely to your provider. If you’re using an aws_vpc resource, it’s pretty unlikely that you could easily turn that into its equivalent in another provider.

    And yet, several of the organizations I’ve worked with have been reluctant to tie themselves closely with one cloud provider or another. The business reality is that the vendor lock-in is a huge source of anxiety: if AWS suddenly and drastically raised their prices, or if they for some reason became unavailable, lots and lots of businesses would be in a big pickle!

    The amount of work required to manually transfer an existing system from one provider to another would be nearly as much as creating the system in the first place.

    GenAI as the Solution?

    I ran across this article about StackGen’s Cloud Migration product. The article isn’t long, so go read it.

    Instead of requiring DevOps teams to map out resources manually, the system uses read-only API access to scan existing cloud environments. It automatically identifies resources, maps dependencies, and – perhaps most importantly – maintains security policies during the transition.

    StackGen isn’t new to using generative AI for infrastructure problems, but they have an interesting approach here:

    1. Use read-only APIs to identify resources, including those not already in IaC.
    2. Use generative AI to map those resources, including security policies, compliance policies, and resource dependencies.
    3. Convert those mapped resources into deployment-ready IaC for the destination environment.

    Using a process like this to migrate from provider to provider is interesting, but the one use case that really gets me thinking is the ability to deploy into a multi-cloud environment.

    I’ll be keeping my eyes on this one.

  • Checksums are SHAping up to be complicated!

    I have plenty more of my beginner DevOps materials (see DevOps 101: Hello There!), but I also want to post about problems I’ve run into. So, this is something I’ve been mulling over a bit lately.

    The Issue

    My team is coordinating with another team on passing files from one AWS S3 bucket to another, and we need to verify that the file at the destination is the same as the file at the source.

    Normally, you would probably just rely on S3; it’s reliable, and it does its own checksum validation on its copy operations. However, the work is part of a permanent archive, and verifying file integrity is the core-est of core requirements. Not only do we need to verify the integrity every time we move the file, but we also need to occasionally spot-check our archives while they’re at rest.

    Our customer has traditionally used one of the SHA algorithms for file validation. That’s not a problem per se, but calculating a SHA on a very large file (100+ GB is not that unusual) is slow and expensive! We’d rather avoid it as much as possible.

    Potential Solution: S3’s “Checksum”

    One solution would be if AWS would handle calculating the checksum for us as part of its own file integrity checking. I think it might fit our requirements if we could get an AWS-calculated checksum of the full object that we could then do our spot-checking on later.

    As it turns out, AWS automatically provides this as a sort of by-product of the S3’s copy object feature. When it calculates the checksum that it uses for a copy, it stores that information and makes it available for its own later use and for the user.

    However, AWS doesn’t offer what they call full-object checksums if you’re using SHA. They only offer composite checksums. The distinction, as the documentation puts it, is:

    Full object checksums: A full object checksum is calculated based on all of the content of a multipart upload, covering all data from the first byte of the first part to the last byte of the last part.

    Composite checksums: A composite checksum is calculated based on the individual checksums of each part in a multipart upload. Instead of computing a checksum based on all of the data content, this approach aggregates the part-level checksums (from the first part to the last) to produce a single, combined checksum for the complete object.

    The main problem, as you may have noticed, is that if you’re using multi-part uploads, then the composite checksum of the entire object is going to depend on your chunks being exactly the same size every time a file is moved. When you’re moving between different services, that can be super brittle: a change in one system’s chunk size would affect the ultimate checksum in completely unpredictable ways.

    Full-Object Checksums and Linearization

    The reason why you can’t do a full-object checksum using SHA is because SHA is basically a huge polynomial equation; change one bit in the original, and the hash, by design, is changed completely. SHA can’t be linearized, meaning you can’t calculate different parts independently and then re-combine them.

    This brings us to Cyclic Redundancy Check (CRC) algorithms. These are also error-detection algorithms, but they’re calculated more or less like the remainder of a giant division problem. And, importantly, if you take the remainders from n division problems, add them together, take the remainder again, you get the remainder of the sum. So, in a super-simple example, if you want the remainder from 150 % 4, you could do it this way:

    • 150 % 4 = ??
    • 100 % 4 = 0 and 50 % 4 = 2
    • (0 + 2) % 4 = 2, therefore 150 % 4 = 2

    It’s a roundabout way of calculating such a small modulus, but if you have a thousand file chunks, then you can find the CRC in essentially the same way.

    So, that’s why AWS only offers full-object checksums for CRC algorithms: they don’t want to have to compute the whole SHA any more than we do.

    What does that mean for us?

    I obviously think using CRCs and full-object checksums to replace our manual calculations would save a lot of compute (and therefore both time and money).

    It’s still an open question whether or not switching to CRCs will satisfy our requirements. There also might be weird legacy issues that could crop up after relying on one specific algorithm for years.

    Let me know if anyone has thoughts on this issue or, especially, if I’ve gotten things wrong.