Tag: cloud

  • Legacy Modernization: Do We Even Need to Do This?

    If you’re very, very lucky, then you’re just at the beginning of your legacy modernization project. You haven’t done development work, and you haven’t even designed anything. Your first step is to carefully consider what you want and how to get there. If so, then well done! You’re already off to a better start than a lot of such projects.

    More often, though, you’re partway through your modernization effort, and it’s struggling. The main risks in such a project, as we’ve discussed, are with losing momentum or money. However, there are plenty of other things that could go wrong, some of which we’ve talked about in other posts. But something isn’t going well, and you’ve got to figure out what to do.

    In either scenario, your first decision has to be whether or not to modernize. Odd though it may sound, I don’t think the answer is always “yes.”

    I firmly believe that there’s nothing inherently wrong with using old software. Old software has already been adapted to meet user needs. If it’s old and still in use, that’s because it works. Upgrading to newer technologies or moving to a public cloud are not worthwhile as goals just for their own sake. Legacy modernization is extremely expensive and not to be undertaken lightly.

    Is It Broken?

    The first question we should consider here is whether our legacy system is meeting all of our business needs. Odds are, it’s not. Something isn’t right about it, or hopefully we wouldn’t be considering modernization at all.

    Maybe the legacy system is too slow, has reached the limits of vertical scaling, and it can’t be horizontally scaled. Maybe its on-premises infrastructure is being retired. Maybe everyone is tired of using a 1980s green-screen interface. Or possibly you have new compliance requirements that aren’t satisfied.

    Whatever is going on, you should identify the particular requirements that are unmet and their importance to your mission. Write them down, and be specific. We’ll need that list in a moment.

    Can It Be Fixed?

    The second question is just as important: can we modify the existing system to meet our requirements?

    • Can we add an API and modern web interface on top of the legacy system?
    • What if we separate the data layer from the application/business layers, so we can separate them and scale horizontally?
    • Can we lift-and-shift into public cloud without rewriting the whole application? (NB: a lift-and-shift is never as easy as it first appears).

    The answer here is often that the existing application can’t be modified.

    • If it’s in COBOL, I understand it’s really hard to find COBOL developers. So the price of maintaining the system might be out of control.
    • If the infrastructure is going away and is out of your control, you’ll definitely need to move to somewhere else, and it might necessitate modernization.
    • Reaching the limits of scaling is very tough to remedy, so it’s one of those scenarios where I think modernization of some sort is frequently justified.

    So, now also write down what makes it impractical to upgrade your existing system.

    What If We’re Already Modernizing?

    Even if your modernization project is already in progress, you need to do this analysis, maybe especially if you’re already underway.

    I’m sure most of my readers understand the sunk cost fallacy, but even so the warning bears repeating.

    A partially-complete modernization project is expensive in terms of money and effort. You’ve probably spent months or years of developer time. There have been a lot of stakeholder conversations, and it takes effort to build the will to modernize in the first place.

    However, no matter what you’ve invested already, you should be considering the cost-benefit of completing the project versus avoiding the remainder of the project by upgrading the existing software.

    This can be a hard analysis to complete: project timelines are notoriously imprecise, and that is doubly true in a project that involves, for example, understanding and replicating obscure or poorly-understood functionality of a decades-old monolith.

    However, this analysis is critical to project success: the difference in cost between upgrading in place versus a complete rewrite can be absolutely enormous. Even if you’re mostly sure you need to replace the system, a small chance of saving all that effort is worth an hour or two of analysis.


    Next time, we’ll cover choosing the architecture in a modernization project. We’ve mostly talked about the Strangler Fig versus Big Bang approaches, and that implies moving to a cloud-based microservices architecture. However, that’s far from the only reasonable approach, and it’s critically important that you move to the architecture that makes sense for your requirements and available developer resources.


    Other posts in this series:

  • Legacy Modernization: Strangler Fig and Big Bang

    I started to write this as one post, but it was turning out to be a lot longer than I want to write out in one sitting. So, I’m going to divide this up into three separate posts: the problem, the misaligned incentives, and thoughts on solutions. I’ll link the other parts here when they’re done.


    Modernization Projects

    I’ve been working in GovTech for a little while, and one of the things I find fascinating about the space is that a lot of the projects involve modernizing an existing system. I know not everyone agrees, but I enjoy learning a long-established system and the challenge of updating it to fit new requirements.

    I think a fair number of modernization projects begin for the wrong reasons. In my opinion, if the existing system meets the functional requirements (does it do its job?) and nonfunctional requirements (is it fast enough? maintainable?), then you don’t need a new system!

    In other words, putting your system in the cloud is not desirable for its own sake. Microservices architecture is a great way to achieve some requirements at certain levels of resources, but it shouldn’t be the goal itself!

    But as a consultant, if you’re being asked to consult on a legacy modernization project, it’s usually not at the start of the project, and it’s definitely not on a project that’s going well. Nobody asks for help when it’s smooth sailing. If you’re calling a consultant on your project, then odds are you’re already in trouble.

    There are a lot of reasons why these types of projects might struggle, but in my opinion the main one is that they sound a lot easier than they really are. After all one might think, “We created this software in the first place; it should be easy to fix it up.”

    I’ll probably write a lot more on other types of modernization pitfalls, but for now I want to focus on one specific issue: the high-level approach to modernizing a monolithic architecture and converting it into an equivalent microservices architecture.

    The Big Bang

    Courtesy: Wikimedia Commons

    Your first instinct when trying to replace a legacy system might be to start completely fresh and kick off your application in an entirely fresh metaphorical universe with a Big Bang-style creation. Then you develop the new system for a while until it does everything that the old system does. And finally, you switch over from the old system to the new and never look back!

    Look, I get why the Big Bang approach is tempting. It’s less complex: you only have to develop on one code base, and there’s little risk of upsetting things during the pre-release phase. Your developers would undoubtedly rather write new code than read the old, so they’re also going to be pushing for this apparently-simple approach.

    Now, don’t get me wrong: I love the simple approach to things. I think in most of software development that the less complex the solution, the easier it will be to maintain and the more resilient it’s likely to be. I think if you want to do something in the complicated way, there had better be a good reason.

    “If you do a big bang rewrite, the only thing you are guaranteed of is the big bang.” – Martin Fowler

    Many readers probably noticed that the Big Bang is also a highly risky approach. For a project of any size, it’s going to be a lengthy process that delivers results only after months or years. It defers all user input until the end of that years-long development. And it renovates a large amount of business logic all at once, from the perspective of the user; the chances of accidentally breaking business rules is very high. To top it off, the difficulty of troubleshooting defects is much higher when you make such sweeping changes all at once.

    From the stakeholders’ perspective, too, it can be a massive risk to expect the collective will for change to persist for years at a time without a deliverable. People leave jobs, and new people have different priorities. The same people move around in an organization. And even the same people in the same position will often change their minds over that period of time, especially if progress isn’t visible.

    In short, the lack of iteration makes the Big Bang approach a huge risk.

    Strangler Fig

    Courtesy: Wikimedia Commons

    The alternative is to use the Strangler Fig pattern: split your monolith into microservices iteratively in small pieces instead of all at once.

    The basic loop of the Strangler Fig approach is:

    1. Identify a piece of the monolith that can be separated from the rest.
    2. Create a microservice that addresses the business logic of that monolith piece.
    3. In the monolith, call the microservice instead of whatever you were doing before.
    4. Remove the piece of the monolith that is no longer used.

    It’s easy to see that this approach is far more iterative and has much more frequent feedback from users. Each release carries a risk of business disruption, but by working on a much smaller piece of the application at a time, resolving these disruptions should be far easier than trying to troubleshoot the entire application at once.

    Just as importantly, you get tangible results on a more frequent basis, so stakeholders have visible progress to share and celebrate! You can announce your progress in organization-wide newsletters, demo for your stakeholders, and report the percentage of the legacy application that you’ve retired.

    In short, you’re trading some complexity in the development work — you have to refactor the monolith — for all of the benefits of iterative, Agile development. Personally, I think that for a project of any appreciable size, that decision is a no-brainer.

    And yet, a lot of organizations still make the easy-seeming choice at the onset of their project. What’s more, they often have difficulty adjusting their approach once their modernization efforts get bogged down. Next time we’ll look closely at the incentives that might push an organization in the wrong direction.


    If you find this topic interesting, I highly recommend Kill It With Fire: Manage Aging Computer Systems (and Future-Proof Modern Ones) by Marianne Bellotti, especially if you find yourself in GovTech.


    Other posts in this series:

  • Multi-Cloud Infrastructure as Code?

    I’m going to do an uncomfortable thing today: I was thinking about a problem, and I’m just going to share my thoughts before I research it. Then, I’ll actually do the research and refine things a bit. The goal is to show the thought process and learning.


    Courtesy: Wikimedia Commons

    One of the main selling points of HashiCorp’s Terraform is that it can be used for multi-cloud deployments. The benefits of this type of deployment are significant:

    • If one provider has an outage, you can simply use your infrastructure in a different provider.
    • You’re far more resistant to vendor lock-in. If a provider raises its prices, you aren’t stuck there.

    The problem of vendor lock-in is huge. Wherever I’ve worked, there’s always this pervasive background question: “Well, what if we wanted to go with Google instead?” And the answers have been unsatisfying. Realistically, the answer is sort of: “Well, we could start over and re-design all this infrastructure for the new platform.”

    If you look at production Terraform, it’s going to be full of resources such as aws_s3_bucket, which is definitively tied to one specific cloud provider.

    So how can you have Infrastructure as Code (IaC) for multi-cloud deployments, when all your resources are tied to a specific provider?

    One solution (and the one that HashiCorp probably recommends) would be to abstract your infrastructure into generic modules that implement your intentions in each of the cloud providers’ specific resources.

    The user would specify “I want a storage container that is readable to the public and highly available.” The module would then be able to create such a container in AWS, Azure, GCP, or wherever you needed to deploy it.

    So you’d have a module that looked maybe something like this:

    # Module "multicloud_storage"

    variable "cloud_provider" {
    type = "string"
    }

    resource "aws_s3_bucket" "main_bucket" {
    count = var.cloud_provider == "aws" ? 1 : 0
    ...
    }

    resource "azurerm_storage_blob" "main_bucket" {
    count = var.cloud_provider == "azure" ? 1 : 0
    ...
    }

    Disclaimer: Please don’t use this code. It’s exactly as untested as it looks.

    Note that awkward count field on every block. I think you could probably make such a generic module work, but you’d have to implement the thing in every provider that you wanted to support.

    But the configurations for the different providers’ storage systems don’t match up one-to-one. Take, for example, the access tier of your storage: how available the objects are and how quickly they can be accessed. AWS S3 has at least nine, plus an Intelligent-Tiering option, whereas Azure uses hot, cool, cold, and archive. In our hypothetical multi-cloud module, we probably want to abstract this away from the user. We might do something like this:

    ModuleAzure BlobAWS S3
    LavaHotExpress Zone One
    HotHotS3 Standard
    CoolCoolS3 Standard-IA
    ColdColdGlacier Instant Retrieval
    GlacierArchiveGlacier Flexible Retrieval
    Micro-KelvinArchiveGlacier Deep Archive

    This would allow us to offer the available storage classes in both providers, but the actual storage tier chosen is a little obfuscated from the user.

    But what about features that exist in one provider and not another? For example, S3 offers Transfer Acceleration to speed up transfers to and from the bucket, whereas Azure’s Blob seems to rely mainly on parallel uploads for performance.

    Then we get to whole types of resources that exist in one provider but not another. Leaky abstractions. The juice-squeeze ratio of maintaining all of these implementations for lesser-used resource types or highly specific ones like QuickSight.

    I’m about to end the rambling, self-reflecting portion of this post and do some actual research. I hope that someone has created modules like this that allow the same infrastructure to work for multi-cloud deployments. My intuition is that it’s too unwieldy.

    Here I go into the Internet. Fingers crossed!


    Hey, it’s me. I’m back, fifteen minutes later.

    I didn’t find a ton. There are a smattering of tools that claim to handle this.

    For example, Pulumi, an Infrastructure as Code tool and alternative to Terraform, says that it handles multi-cloud deployments natively. I’d be interested in learning more.

    I found several articles offering a guide to multi-cloud Terraform modules. I did not, however, find any well-maintained modules themselves.

    The void feels a little weird to me: there’s obviously a great need for this sort of module. It’s the sort of problem that the open source community has traditionally been good at solving. Like I said before, my intuition is that this is a very difficult (expensive) problem, so maybe the cost just outweighs the demand?

    One Stack Overflow post mentioned that one of the reasons people don’t share Terraform in open source is that it makes it easy to find vulnerabilities in your infrastructure. (But isn’t that supposed to be a strength of open source: to crowdsource the identification and correction of these vulnerabilities?) Anyway, extrapolating a bit: this reluctance to share infrastructure might also also be a huge barrier to making such a multi-cloud module.

    If I were going to implement something professionally, I’d do a lot more than fifteen minutes of research. But, gentle reader, it looks bleak out there. Let me know if there’s anything good out there that I missed.

  • North Korean Malware Wins at Hide and Seek

    Courtesy: Wikimedia Commons

    SecurityScorecard has released a report describing how they uncovered evidence of an attack by North Korea’s Lazarus Group against developers. The attack uses sophisticated anti-detection techniques to deliver its new implant Marstech1, designed to steal cryptocurrency wallets.

    Marstech1, a JavaScript implant, is being served by Lazarus’s Command & Control (C2) server, and a similar implant was also added to several open source GitHub repositories.

    This malware targets the directories used by Exodus and Atomic Crypto wallets. It can copy the data, package it, and send it to the C2 server.

    What makes Marstech1 unique, though, is the extent to which its authors have gone to obfuscate the code to avoid detection. From the report:

    The Marstech implants utilize different obfuscation techniques than previously seen. The JS implant that was
    observed utilizes;

    • Control flow flattening & self-invoking functions
    • Random variable and function names
    • Base64 string encoding
    • Anti-debugging (anti-tamporing [sic] checks)
    • Splitting and recombining strings

    This ensures that if the threat actor embedded the JS into a software project it would go unnoticed.

    There’s a full explanation in the report, so if you’re interested I highly recommend it. Suffice it to say that security researchers have their work cut out for them right now.

  • No! Shut them all down, hurry!

    Luke Skywalker: [interrupting] Will you shut up and listen to me! Shut down all the garbage mashers on the detention level, will ya? Do you copy? Shut down all the garbage mashers on the detention level! Shut down all the garbage mashers on the detention level!


    C-3PO: [to R2-D2] No! Shut them all down, hurry!

    The Official AWS Blog (great resource, by the way!) has a post describing how to reduce costs in EC2 by automatically shutting down instances while not in use.

    The short version of the blog post is that you can achieve this sort of shut down in two ways:

    1. Create a CloudWatch alarm to shut down when CPU usage is below a threshold for a period of time.
    2. Create an EventBridge trigger and Lambda to shut down on a schedule.

    I would argue that in most deployments you would have a more precise metric that actually reflects the number of your HTTP requests.

    There are other guides on how to do this; I’ve looked at some, as I’ve been planning to do this for our Fargate instances (not EC2, obviously, but similar enough) in our test environments. However, it’s nice to have an official source on how to do this kind of shutdown.

    The reason we want to do this on my project is to save on cloud costs. The savings probably aren’t that much, but they come from an area of the budget that is limited and needed for other things.

    At any rate, option #2 better reflects what my team will want to do. We have very spiky usage, but when we do go to test an instance, we don’t want to have to wait for it to spin up. Since we have similar work hours, we’ll probably want to shut down the instances except for around 06:30 – 20:00 on weekdays. That way, it’s up whenever people are likely to be working and down at other times.

    One difficulty I have anticipated is, what if someone tries to test at an odd hour. I don’t mind terribly if they need to manually start an instance; it should happen very infrequently. However, I’d like them to see a clear error message describing what’s going on. We won’t use this system in production for obvious reasons, but it would be nice if future devs years from now don’t get confused because they can’t reach the service.

    So, I’m wondering if there’s a good way to dynamically point our ALB to a static file on an S3 bucket or something while the instances are down. It might be possible to set a static file as an error page in the ALB? Not sure yet. Clearly I have not yet given this more than cursory research, but it’s on my mind.

  • Can AI Generate Functional Terraform?

    Nope.

    The end. Thanks for reading my post. Reading time: 1 minute.


    I posted the other day about this topic, and I am intrigued by the possibilities. I’m certainly interested in the ways that you can use it for infrastructure, and the article in that post offers a somewhat-different use case for AI-generated Terraform: cloud migrations and multi-cloud solutions. But I’d by lying if I said I wasn’t very skeptical of the code that it writes.

    With all that on my mind, I appreciate the analysis in this article: “Can AI Generate Functional Terraform?” by Rak Siva.

    I’d add mainly that GenAI is currently about as useful as a very junior developer, and that’s probably because they’re both doing the same thing: Google the results and copy-paste without really understanding.

    Then again, if you’ll indulge a quickly-emerged cliché: none of us saw any of this coming just five years ago.

  • AWS Launches Trust Center

    Compliance just got a tiny bit easier in AWS-land. AWS announced that they’re launching their new AWS Trust Center, an all-in-one hub for AWS’s security-related documentation.

    I certainly haven’t read through the whole site, but just eyeballing what they’ve got:

    • Security
    • Compliance
    • Data protection and privacy
    • Operational visibility
    • Report an incident
    • Agreement and terms

    I doubt they’ve even released any new documentation, but it’s a nice step forward to put all this stuff in one place.

  • Easier Cloud-to-Cloud Migrations?

    Cloud with a lock. Courtesy of Wikimedia Commons.

    An Empty (Theoretical) Promise

    It’s long been a promise of Infrastructure as Code tools like Terraform that you could theoretically create platform-independent IaC and deploy freely into any cloud environment. I doubt anyone ever really meant that literally, but the reality is that your cloud infrastructure is inevitably going to be tied quite closely to your provider. If you’re using an aws_vpc resource, it’s pretty unlikely that you could easily turn that into its equivalent in another provider.

    And yet, several of the organizations I’ve worked with have been reluctant to tie themselves closely with one cloud provider or another. The business reality is that the vendor lock-in is a huge source of anxiety: if AWS suddenly and drastically raised their prices, or if they for some reason became unavailable, lots and lots of businesses would be in a big pickle!

    The amount of work required to manually transfer an existing system from one provider to another would be nearly as much as creating the system in the first place.

    GenAI as the Solution?

    I ran across this article about StackGen’s Cloud Migration product. The article isn’t long, so go read it.

    Instead of requiring DevOps teams to map out resources manually, the system uses read-only API access to scan existing cloud environments. It automatically identifies resources, maps dependencies, and – perhaps most importantly – maintains security policies during the transition.

    StackGen isn’t new to using generative AI for infrastructure problems, but they have an interesting approach here:

    1. Use read-only APIs to identify resources, including those not already in IaC.
    2. Use generative AI to map those resources, including security policies, compliance policies, and resource dependencies.
    3. Convert those mapped resources into deployment-ready IaC for the destination environment.

    Using a process like this to migrate from provider to provider is interesting, but the one use case that really gets me thinking is the ability to deploy into a multi-cloud environment.

    I’ll be keeping my eyes on this one.

  • DevOps 101: Cross-Functional Teams

    Crayons (courtesy of Wikimedia Commons)
    Courtesy Wikimedia Commons

    Cross-functional teams play a vital role in a DevOps culture as they bring together individuals with diverse skills and expertise from different areas of software development and operations.

    By embracing cross-functional teams, organizations can foster collaboration, improve communication, streamline processes, and create an environment conducive to innovation and continuous improvement. In a DevOps culture, where speed, agility, and quality are paramount, cross-functional teams play a crucial role in breaking down barriers, improving collaboration, and delivering high-value software efficiently.

    Read more!

  • Checksums are SHAping up to be complicated!

    I have plenty more of my beginner DevOps materials (see DevOps 101: Hello There!), but I also want to post about problems I’ve run into. So, this is something I’ve been mulling over a bit lately.

    The Issue

    My team is coordinating with another team on passing files from one AWS S3 bucket to another, and we need to verify that the file at the destination is the same as the file at the source.

    Normally, you would probably just rely on S3; it’s reliable, and it does its own checksum validation on its copy operations. However, the work is part of a permanent archive, and verifying file integrity is the core-est of core requirements. Not only do we need to verify the integrity every time we move the file, but we also need to occasionally spot-check our archives while they’re at rest.

    Our customer has traditionally used one of the SHA algorithms for file validation. That’s not a problem per se, but calculating a SHA on a very large file (100+ GB is not that unusual) is slow and expensive! We’d rather avoid it as much as possible.

    Potential Solution: S3’s “Checksum”

    One solution would be if AWS would handle calculating the checksum for us as part of its own file integrity checking. I think it might fit our requirements if we could get an AWS-calculated checksum of the full object that we could then do our spot-checking on later.

    As it turns out, AWS automatically provides this as a sort of by-product of the S3’s copy object feature. When it calculates the checksum that it uses for a copy, it stores that information and makes it available for its own later use and for the user.

    However, AWS doesn’t offer what they call full-object checksums if you’re using SHA. They only offer composite checksums. The distinction, as the documentation puts it, is:

    Full object checksums: A full object checksum is calculated based on all of the content of a multipart upload, covering all data from the first byte of the first part to the last byte of the last part.

    Composite checksums: A composite checksum is calculated based on the individual checksums of each part in a multipart upload. Instead of computing a checksum based on all of the data content, this approach aggregates the part-level checksums (from the first part to the last) to produce a single, combined checksum for the complete object.

    The main problem, as you may have noticed, is that if you’re using multi-part uploads, then the composite checksum of the entire object is going to depend on your chunks being exactly the same size every time a file is moved. When you’re moving between different services, that can be super brittle: a change in one system’s chunk size would affect the ultimate checksum in completely unpredictable ways.

    Full-Object Checksums and Linearization

    The reason why you can’t do a full-object checksum using SHA is because SHA is basically a huge polynomial equation; change one bit in the original, and the hash, by design, is changed completely. SHA can’t be linearized, meaning you can’t calculate different parts independently and then re-combine them.

    This brings us to Cyclic Redundancy Check (CRC) algorithms. These are also error-detection algorithms, but they’re calculated more or less like the remainder of a giant division problem. And, importantly, if you take the remainders from n division problems, add them together, take the remainder again, you get the remainder of the sum. So, in a super-simple example, if you want the remainder from 150 % 4, you could do it this way:

    • 150 % 4 = ??
    • 100 % 4 = 0 and 50 % 4 = 2
    • (0 + 2) % 4 = 2, therefore 150 % 4 = 2

    It’s a roundabout way of calculating such a small modulus, but if you have a thousand file chunks, then you can find the CRC in essentially the same way.

    So, that’s why AWS only offers full-object checksums for CRC algorithms: they don’t want to have to compute the whole SHA any more than we do.

    What does that mean for us?

    I obviously think using CRCs and full-object checksums to replace our manual calculations would save a lot of compute (and therefore both time and money).

    It’s still an open question whether or not switching to CRCs will satisfy our requirements. There also might be weird legacy issues that could crop up after relying on one specific algorithm for years.

    Let me know if anyone has thoughts on this issue or, especially, if I’ve gotten things wrong.