Tag: technology

Google DeepMind Announced an LLM-Based Problem-Solver
Earlier this week, Google DeepMind announced its new research tool AlphaEvolve. Basically, it’s an LLM-driven tool that uses evolutionary algorithms to find solutions for certain math or software problems. It’s already come up with optimizations on a few important problems that could lead to efficiency gains within the AI space and perhaps beyond.

Disclaimer: I haven’t had time to read the whole paper yet, but I’ve managed to read Google DeepMind’s blog post, watch this interview and read a few news articles.

The main limitation of AlphaEvolve is that it can only work on problems where the solution can be evaluated by a machine. So, on the trivial end of things, this might be similar to a LeetCode problem such as “Reverse a linked list”. To solve this, AlphaEvolve would come up with a few solutions and evaluate them for both correctness and efficiency. Obviously, this is the sort of problem that computer science students should be able to solve in their sleep.

Of course, what’s interesting is when you direct AlphaEvolve to work on more interesting problems.

How does it work?

Evolutionary algorithms can solve problems within a large solution-space by tweaking parameters and running many different trials. Selection criteria and evaluation methods can vary, but the general idea is to choose the best solutions from one generation, tweak them a bit, and run a new generation of trials.

Where AlphaEvolve improves on this method of problem-solving is that it uses an LLM to direct progress without relying solely on randomness of the parameters. It also uses automatic code generation so that the parameters tested are (or can be?) code implementations.

The novel thing here is that LLMs aren’t just generating code, they’re guiding the search across a massive algorithmic space. This leads to verifiably novel solutions, not just rediscovering old ones.

Purloined from their blog post

What problems can it solve?

AlphaEvolve can only work on problems that can be evaluated by machine. These evaluations can be mathematical correctness, performance metrics, or even physical simulations. The key is that there’s an automated, not human, way to judge success. By taking the human out of the loop, they can run thousands or millions of trials until it finds its solutions.

Despite being limited to this specific type of question, there are a lot of problems in that space, including data center scheduling, hardware design, AI training and inference, and mathematics. In Google DeepMind’s blog post, they said:

“To investigate AlphaEvolve’s breadth, we applied the system to over 50 open problems in mathematical analysis, geometry, combinatorics and number theory. The system’s flexibility enabled us to set up most experiments in a matter of hours. In roughly 75% of cases, it rediscovered state-of-the-art solutions, to the best of our knowledge.”

One of the solutions that has been highly touted is its 48-step universal and recurse-able algorithm for multiplying 4×4 matrices, with major implications for machine learning and graphics. The algorithms are a bit beyond my understanding of linear algebra, but here’s what I’ve gathered:
- If you do it the usual way, you can multiply a 2×2 matrix in 8 steps. Basically, you multiply each number by each of the others and then take the sums.
- There is an optimization to multiply a 2×2 matrix in only 7 steps, and mathematicians have determined that 7 steps is the optimal solution for this problem.
- Because the 7-step algorithm can be done recursively, you can calculate a 4×4 matrix in 7^2 = 49 steps. Basically, you consider the 4×4 matrix as a set of 2×2 matrices and multiply them out.
- AlphaEvolve’s solution is one calculation more efficient than the above 7^2=49 step algorithm. So, on the same problem it should be around 2% more efficient.
- AlphaEvolve’s solution can also be used recursively, so calculating a larger matrix should also be more efficient. I’m not totally clear about how much it would speed things up for which size of matrix.
The reason this seemingly small optimization is so important is that we do a ton of matrix multiplication in machine learning, in both training and inference. So a small difference here can make an enormous difference.

Similarly, one of the other problems that AlphaEvolve worked on was something (we don’t seem to know exactly what, and it’s probably proprietary) that provided Google with an optimization to its data centers that “recovers on average 0.7% of Google’s fleet-wide compute resources”. Given the immense scale of Google’s data centers, this would be a huge sum of money!

Why does it matter?

The major advance here isn’t just speed—it’s novelty. AlphaEvolve didn’t just find better implementations of known algorithms; in some cases, it created ones that are new to science, like the 48-step recursive matrix multiplication.

One of the major criticisms of LLMs has been that, despite their vast reservoir of knowledge, LLMs haven’t really synthesized that knowledge to come up with new discoveries. (To be clear, there have been a few such discoveries from other areas of AI, such as DeepMind’s AlphaFold). Well, now we have an LLM-based method to make those discoveries, albeit only for a specific type of problem.Keeping in mind its limitations, the algorithmic improvements to matrix multiplication alone could generate huge savings in energy, cooling, and environmental damage in the coming years.
2025-05-16
Legacy Modernization: Do We Even Need to Do This?
If you’re very, very lucky, then you’re just at the beginning of your legacy modernization project. You haven’t done development work, and you haven’t even designed anything. Your first step is to carefully consider what you want and how to get there. If so, then well done! You’re already off to a better start than a lot of such projects.

More often, though, you’re partway through your modernization effort, and it’s struggling. The main risks in such a project, as we’ve discussed, are with losing momentum or money. However, there are plenty of other things that could go wrong, some of which we’ve talked about in other posts. But something isn’t going well, and you’ve got to figure out what to do.

In either scenario, your first decision has to be whether or not to modernize. Odd though it may sound, I don’t think the answer is always “yes.”

I firmly believe that there’s nothing inherently wrong with using old software. Old software has already been adapted to meet user needs. If it’s old and still in use, that’s because it works. Upgrading to newer technologies or moving to a public cloud are not worthwhile as goals just for their own sake. Legacy modernization is extremely expensive and not to be undertaken lightly.

Is It Broken?

The first question we should consider here is whether our legacy system is meeting all of our business needs. Odds are, it’s not. Something isn’t right about it, or hopefully we wouldn’t be considering modernization at all.

Maybe the legacy system is too slow, has reached the limits of vertical scaling, and it can’t be horizontally scaled. Maybe its on-premises infrastructure is being retired. Maybe everyone is tired of using a 1980s green-screen interface. Or possibly you have new compliance requirements that aren’t satisfied.

Whatever is going on, you should identify the particular requirements that are unmet and their importance to your mission. Write them down, and be specific. We’ll need that list in a moment.

Can It Be Fixed?

The second question is just as important: can we modify the existing system to meet our requirements?
- Can we add an API and modern web interface on top of the legacy system?
- What if we separate the data layer from the application/business layers, so we can separate them and scale horizontally?
- Can we lift-and-shift into public cloud without rewriting the whole application? (NB: a lift-and-shift is never as easy as it first appears).
The answer here is often that the existing application can’t be modified.
- If it’s in COBOL, I understand it’s really hard to find COBOL developers. So the price of maintaining the system might be out of control.
- If the infrastructure is going away and is out of your control, you’ll definitely need to move to somewhere else, and it might necessitate modernization.
- Reaching the limits of scaling is very tough to remedy, so it’s one of those scenarios where I think modernization of some sort is frequently justified.
So, now also write down what makes it impractical to upgrade your existing system.

What If We’re Already Modernizing?

Even if your modernization project is already in progress, you need to do this analysis, maybe especially if you’re already underway.

I’m sure most of my readers understand the sunk cost fallacy, but even so the warning bears repeating.

A partially-complete modernization project is expensive in terms of money and effort. You’ve probably spent months or years of developer time. There have been a lot of stakeholder conversations, and it takes effort to build the will to modernize in the first place.

However, no matter what you’ve invested already, you should be considering the cost-benefit of completing the project versus avoiding the remainder of the project by upgrading the existing software.

This can be a hard analysis to complete: project timelines are notoriously imprecise, and that is doubly true in a project that involves, for example, understanding and replicating obscure or poorly-understood functionality of a decades-old monolith.

However, this analysis is critical to project success: the difference in cost between upgrading in place versus a complete rewrite can be absolutely enormous. Even if you’re mostly sure you need to replace the system, a small chance of saving all that effort is worth an hour or two of analysis.

Next time, we’ll cover choosing the architecture in a modernization project. We’ve mostly talked about the Strangler Fig versus Big Bang approaches, and that implies moving to a cloud-based microservices architecture. However, that’s far from the only reasonable approach, and it’s critically important that you move to the architecture that makes sense for your requirements and available developer resources.

Other posts in this series:
2025-04-05
Legacy Modernization: This is fine!
Timeline of the Universe. Courtesy: Wikimedia Commons

Let’s say you have a legacy piece of software, and something isn’t working about it. Maybe it’s too expensive to maintain, or nobody knows how, or it’s impossible to hire developers who know COBOL. Maybe it’s unreliable or unscalable.
- The Big Bang approach is where you make a whole new piece of software, do your best to replicate the old software, and on one terrifying day, you disable the old software and turn on the new.
- The Strangler Fig pattern is the generally-accepted way to modernize legacy software into a microservices architecture. It involves breaking services off of the legacy software and replacing them one-by-one with new microservices. This has the advantage of making smaller, incremental changes, with corresponding smaller and more frequent risks.
We’ve talked a bit about why it’s easy to start off in the easy-looking Big Bang approach. It’s simpler, and it’s tempting to avoid messing with your legacy system; after all, why make changes to the thing you’re trying to get rid of?

So, you’re attempting a Big Bang start-over project. What now? Normally, the project goes off on the risky path and continues that way for a while. You’re probably even making progress, getting things off the ground, gathering requirements, starting to re-implement everything. If you’re very, very lucky you complete your project under-budget and on-time, you flip the switch and everyone migrates to the new system with a minimum of fuss. I’m sure this does happen sometimes; the Big Bang is risky but it’s not impossible.

How do you know, then, if you’re in trouble? What are the risks, and what do they look like before they sink the project?

Are we there yet?

Big Bang projects often take years to get to their Minimum Viable Product (MVP). During that time, the organization is putting a lot of resources into the project, and there are no results to speak of. If you’re clever, you can claim small wins and keep stakeholders interested, but even so if you’re pinned down and asked “Are we there yet?” your answer is “no.”

Some of the problem here is that it’s really hard to estimate the time frame of a modernization project. Often you’re modernizing because nobody understands the existing system, and gathering your current requirements is a large part of the project. What’s more, figuring out your requirements can often be a matter of archaeology: “Hey, what’s this section of code?” followed by “Hey, is anybody still using this feature I discovered?” And without firm requirements, how can you guess how long it will take to complete the rewrite?

So projects are left with either a really wide time frame (“It could take anywhere from one week to six years”) or they take the Engineer Scotty Method and estimate the top end (“It’ll take six years, cap’n!”). In either case, an astute stakeholder would be extremely skeptical of such a long-running modernization project, but sometimes they just have no other choice.

A lot can happen over the course of a multi-year project. Budgets tighten, and unfinished projects are often among the first to be cut. Personnel change, and new managers might not buy into a years-long project with no tangible results. Attitudes change; people that agreed to the project in the past may no longer believe in it.

The real danger here, though, is when the project exceeds the time estimates. The original estimate might have been too optimistic. Requirements could be discovered in the course of code-spelunking. Requirements can change – there could be new legal or security requirements, or workflows might have changed. But any of these can cause the project to extend past the original time estimates.

And when the project extends beyond its schedule, that’s when patience really gets thin.

So, if you’re part of a legacy modernization project, how do you identify and head off these sorts of problems? I unfortunately don’t have too much of a solution. If you have multi-year time estimates, I’d say your risk of losing momentum is pretty high. But the main thing is to just keep an eye on stakeholders’ temperatures with regards to the project: Are people getting impatient? But also: Are budgets in doubt? Do requirements change frequently?

What if it also did this?

Another big risk is scope creep. With any project, you run the risk of stakeholders asking for new features before you’re even done with your MVP. Ideally, you’d have a product manager that can say no to changes in the scope, but even so it can sometimes be difficult to say no.

Obviously, these types of scope changes can extend your timeframe and stretch your developer resources. They’re challenging, and they’re not unique to legacy modernization projects. However, the extreme time to MVP on a Big Bang project comes with an equally high risk of scope creep.

What do you mean it’s supposed to do that?

Your Big Band project has carefully gathered requirements, worked with users to verify every use case is accounted for, combed through the old code to find everything that the legacy system used to do. Then you’ve painstakingly recreated all of the old features and maybe discarded a few that nobody needs anymore. You’ve maintained stakeholder momentum for months or years and managed to last all the way to the end of your long timeline. You’re ready to switch over from the legacy system to the replacement!

Now comes perhaps the biggest risk of the project: Day One. There are a million things that can go wrong: mistaken requirements, mis-implemented features, unmigrated or incorrectly migrated data, incorrect integrations with outside systems, networking problems, scaling problems, features marked for deletion that – oops! – it turns out Accounting is actually using. Many of these can be discovered with careful testing, but not all of them. Flipping the switch is the moment of truth: the moment when all of the changes first encounter a real user.

It almost goes without saying that frequent deployments and short user feedback loops are considered best-practices for a reason. User feedback is the gold standard for the success of a product, and the more changes you’ve made in between feedback, the greater the risk of doing things that don’t align with expectations. With potentially years without feedback, you’re running an awfully high risk!

Moreover, the blast radius on an all-at-once deployment is the entire application. If you release a single feature, you likely won’t cause other features to fail. If you release everything, anything can fail. And when the blast radius is huge, so is the area you have to search in troubleshooting the things that have failed.

So, how do you know when you’re at risk of this type of Day One catastrophe? Anytime you’re making a lot of changes at once, but especially if you’re changing out one application for another. Fixing it really just requires not doing that.

These aren’t an exhaustive list of the ways you can get in hot water in a Big Bang-style, but these are some of the worst trouble you can get into. If you’re seeing these things on your project, then you probably need to change course. In the next post, I’ll touch on how to change course once you’ve gone a significant way in the wrong direction.

Other posts in this series:
2025-03-31
Legacy Modernization: Your Instincts Might Be Wrong
I was originally going to write about legacy modernization in a single post, but it was long, and it made more sense to split it up. See the first part on the design patterns and anti-patterns.

Courtesy: Wikimedia Commons

There are a lot of possible reasons why a legacy modernization project might end up using the Big Bang anti-pattern. To be totally clear, it’s not usually that the organization is doing things wrong: there are quite a few traps that can land the project in a difficult situation. We’ll try to explore a few of the more common such challenges, but for today we’re going to consider why the Big Bang approach seems to be everyone’s first instinct.

Not Everyone Knows the Proven Pattern

Not everyone who takes on a legacy modernization project is familiar with the established pattern. Your architect certainly should be, but in a lot of organizations (especially in government, where software development is not their primary mission) it is not reasonable to expect stakeholders or maybe even managers to understand software architecture.

The problem, for sure, is one of education. Ideally, you would start off the modernization project by making sure everyone was aware of the different types of approaches and the pros and cons of each.

However, from the perspective of the consultant (which is how I usually find myself in these types of projects), it’s pretty rare to come in at the beginning. The usual situation is that a legacy modernization project has languished for months or even years before anyone admits that they need outside expertise. Put another way, consultants don’t get called in when the modernization project has gone well from the start.

So, in the absence of experts in legacy modernization, the beginning of a project — when the high-level approach is chosen — tends to be dominated by people who aren’t aware of the patterns.

The Big Bang is Simpler

It’s harder to read code than write it.

— Joel Spolsky, “Things You Should Never Do, Part I“

Let’s compare the two approaches at a surface level.

The Strangler Fig necessarily requires working on more parts of the code, since it involves modifying the legacy application. And as a result, it involves a risk of disrupting things during implementation. Furthermore, modifying the legacy application involves reading and making changes to code that has often been neglected for years.

The Big Bang approach instead limits the changes only to the replacement system, and therefore it also limits the risk of disruption during implementation.

The Big Bang is the simpler approach, and your instincts as a software developer ought to lean toward the less-complex solution.

However, as we’ve discussed, the Big Bang trades immediate safety for a huge risk of catastrophic failure at change-over time. It’s one of the exceptions, where taking the more complicated approach is worthwhile, but it’s almost undeniable that on a surface level the Big Bang is simpler.

The Big Bang is How We Replace Most Other Things

The Big Bang is also analogous to a lot of the way we replace things in life: if you want a new car, you don’t replace one part at a time like Theseus’s Ship; you procure a completely different car and dispose of the old.

The problem, though, is that custom software is not like a car. If we need an analogy, it’s probably more like a house: individual, decorated to your own tastes, and difficult to replace. When you don’t like your bathroom, it’s usually much better to remodel than to replace the entire house and go through the entire expensive and time-consuming process of purchasing, selling, packing up your life, and relocation.

But no matter what analogy you subscribe to for software development or legacy modernization, replacing a legacy application is a huge task — bigger than writing a greenfield application. You have to copy business logic that has usually evolved over years. What’s more, the application itself is often tightly coupled, making it difficult to split off pieces for incremental modernization.

We’ll come back to the tight coupling in another post, but for now it’s important because the obvious solution to such a problem is to not even try to split up the legacy application.

As a software engineer, your instincts are an important part of crafting a solution. For small-scale changes, it can often be sufficient to choose the most obvious solution to the most obvious problem. That’s even a good maxim for large-scale problems, but with an important caveat: the larger the scope of your project, the more crucial it is to consider all of your alternatives, including the not-obvious or not-simple ones.

Other posts in this series:
2025-03-25
Legacy Modernization: Strangler Fig and Big Bang
I started to write this as one post, but it was turning out to be a lot longer than I want to write out in one sitting. So, I’m going to divide this up into three separate posts: the problem, the misaligned incentives, and thoughts on solutions. I’ll link the other parts here when they’re done.

Modernization Projects

I’ve been working in GovTech for a little while, and one of the things I find fascinating about the space is that a lot of the projects involve modernizing an existing system. I know not everyone agrees, but I enjoy learning a long-established system and the challenge of updating it to fit new requirements.

I think a fair number of modernization projects begin for the wrong reasons. In my opinion, if the existing system meets the functional requirements (does it do its job?) and nonfunctional requirements (is it fast enough? maintainable?), then you don’t need a new system!

In other words, putting your system in the cloud is not desirable for its own sake. Microservices architecture is a great way to achieve some requirements at certain levels of resources, but it shouldn’t be the goal itself!

But as a consultant, if you’re being asked to consult on a legacy modernization project, it’s usually not at the start of the project, and it’s definitely not on a project that’s going well. Nobody asks for help when it’s smooth sailing. If you’re calling a consultant on your project, then odds are you’re already in trouble.

There are a lot of reasons why these types of projects might struggle, but in my opinion the main one is that they sound a lot easier than they really are. After all one might think, “We created this software in the first place; it should be easy to fix it up.”

I’ll probably write a lot more on other types of modernization pitfalls, but for now I want to focus on one specific issue: the high-level approach to modernizing a monolithic architecture and converting it into an equivalent microservices architecture.

The Big Bang

Courtesy: Wikimedia Commons

Your first instinct when trying to replace a legacy system might be to start completely fresh and kick off your application in an entirely fresh metaphorical universe with a Big Bang-style creation. Then you develop the new system for a while until it does everything that the old system does. And finally, you switch over from the old system to the new and never look back!

Look, I get why the Big Bang approach is tempting. It’s less complex: you only have to develop on one code base, and there’s little risk of upsetting things during the pre-release phase. Your developers would undoubtedly rather write new code than read the old, so they’re also going to be pushing for this apparently-simple approach.

Now, don’t get me wrong: I love the simple approach to things. I think in most of software development that the less complex the solution, the easier it will be to maintain and the more resilient it’s likely to be. I think if you want to do something in the complicated way, there had better be a good reason.

“If you do a big bang rewrite, the only thing you are guaranteed of is the big bang.” – Martin Fowler

Many readers probably noticed that the Big Bang is also a highly risky approach. For a project of any size, it’s going to be a lengthy process that delivers results only after months or years. It defers all user input until the end of that years-long development. And it renovates a large amount of business logic all at once, from the perspective of the user; the chances of accidentally breaking business rules is very high. To top it off, the difficulty of troubleshooting defects is much higher when you make such sweeping changes all at once.

From the stakeholders’ perspective, too, it can be a massive risk to expect the collective will for change to persist for years at a time without a deliverable. People leave jobs, and new people have different priorities. The same people move around in an organization. And even the same people in the same position will often change their minds over that period of time, especially if progress isn’t visible.

In short, the lack of iteration makes the Big Bang approach a huge risk.

Strangler Fig

Courtesy: Wikimedia Commons

The alternative is to use the Strangler Fig pattern: split your monolith into microservices iteratively in small pieces instead of all at once.

The basic loop of the Strangler Fig approach is:
1. Identify a piece of the monolith that can be separated from the rest.
2. Create a microservice that addresses the business logic of that monolith piece.
3. In the monolith, call the microservice instead of whatever you were doing before.
4. Remove the piece of the monolith that is no longer used.
It’s easy to see that this approach is far more iterative and has much more frequent feedback from users. Each release carries a risk of business disruption, but by working on a much smaller piece of the application at a time, resolving these disruptions should be far easier than trying to troubleshoot the entire application at once.

Just as importantly, you get tangible results on a more frequent basis, so stakeholders have visible progress to share and celebrate! You can announce your progress in organization-wide newsletters, demo for your stakeholders, and report the percentage of the legacy application that you’ve retired.

In short, you’re trading some complexity in the development work — you have to refactor the monolith — for all of the benefits of iterative, Agile development. Personally, I think that for a project of any appreciable size, that decision is a no-brainer.

And yet, a lot of organizations still make the easy-seeming choice at the onset of their project. What’s more, they often have difficulty adjusting their approach once their modernization efforts get bogged down. Next time we’ll look closely at the incentives that might push an organization in the wrong direction.

If you find this topic interesting, I highly recommend Kill It With Fire: Manage Aging Computer Systems (and Future-Proof Modern Ones) by Marianne Bellotti, especially if you find yourself in GovTech.

Other posts in this series:
2025-03-09
DeepSeek Security Review: “Not overtly malicious” but still concerning
I think by now everyone in the tech industry already knows about DeepSeek: it’s the new mold-breaking, disruptive Large Language Model (LLM) from the Chinese company of the same name. It achieves good performance, and the company claims to have trained it for a tiny fraction of the cost of the top LLMs. Certainly, it’s svelte enough to run a version of it on an Android device.

There have been security concerns from the start, and a few countries have banned or restricted its use, including Italy, Australia, and the United States Navy.

SecurityScorecard’s STRIKE team has performed in-depth analysis of DeepSeek, and their results are mixed. Their key findings:
- The DeepSeek Android app has security vulnerabilities, such as weak encryption, SQL injection risks, and hardcoded keys.
- It has a broad data collection scope, including user inputs, device data, and keystroke patterns, stored in China.
- There are concerns about data transmission to Chinese state-owned entities and ByteDance.
- The app employs anti-debugging mechanisms.
- DeepSeek has faced regulatory scrutiny and bans in multiple countries.
- Code analysis reveals integration with ByteDance‘s services.
- The app requests permissions for internet access, phone state, and location.
- Third-party domains that the app connects to, like Ktor, have failing security scores, which raises business risks related to data security.
- Despite security weaknesses and privacy concerns, no overtly malicious behavior was detected.
I think a lot of these are unsurprising: DeepSeek was up front about their data being stored within the People’s Republic of China. The requests for permissions that the app doesn’t really need are almost standard these days, and if Google did it (they do), we wouldn’t think twice.

Of concern to me is their poor security practices in general, combined with collecting potentially quite private data. As STRIKE points out, it’s weird to use anti-debugging mechanisms, especially for a company claiming to be transparent.

I don’t think this analysis is going to change anyone’s opinion of DeepSeek: it was widely criticized as a security risk before, just on the basis of sending information to China. Lax security within the app is probably not a big deal compared to that, but it does potentially mean that your data might be exposed to other entities as well.

I promise: next time I’ll write about something other than SecurityScorecard. I came across this one while reading the previous report, and I wanted to see what they had to say.
2025-03-02
LLMs are not going to take your job (yet)
It seems like every site I go to with DevOps or Software Engineering news is inundated with — and I mean positively drowning in — articles about LLMs doing the job of engineers.

I was away for the weekend, and this is a smattering of the headlines I came back to, just on the first couple sites that I frequently read from:
To be fair, the tone has generally shifted away from the early Chicken Little sky-is-falling doomsday analysis; three of those four articles take a much more limited view of AI’s ability to code (and the fourth isn’t testing the possibility of replacing SWEs).

There have been benchmarks done on the coding ability of AI, but the last article above — “AI Coding: New Research Shows…” — talks about a new academic paper proposing a more in-depth benchmark that perhaps better captures the work done by a SWE or SWE Manager.

The study, “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?“, describes their new benchmark. I’ll let the abstract speak for itself:

We introduce SWE-Lancer, a benchmark of over
1,400 freelance software engineering tasks from
Upwork, valued at $1 million USD total in real-world
payouts. SWE-Lancer encompasses both
independent engineering tasks — ranging from
$50 bug fixes to $32,000 feature implementations
— and managerial tasks, where models
choose between technical implementation proposals.
Independent tasks are graded with end-to-
end tests triple-verified by experienced software
engineers, while managerial decisions are
assessed against the choices of the original hired
engineering managers. We evaluate model performance
and find that frontier models are still
unable to solve the majority of tasks… By
mapping model performance to monetary value,
we hope SWE-Lancer enables greater research
into the economic impact of AI model development.

Some interesting things from the article:
- They allow the real payouts act as a proxy for the difficulty of the task.
- They use the real-world Expensify open-source repository, and the tasks sometimes require the context of different parts of the code base to solve.
- They grade by end-to-end tests instead of unit tests. This is much less susceptible to manipulation and it provides a better insight into the actual efficacy of the solution within the product.
- They used a variety of tasks categorized either as “IC SWE” for implementation tasks or “SWE Manager” for making a choice between different proposals.
Results

On the IC SWE tasks, the article compared OpenAI’s ChatGTP o1 model against Anthropic’s Claude 3.5 Sonnet. (They also included GPT-4o, which performed less well.)

Model Tasks Passed Money Earned
GPT o1 16.5% 12.1%
Claude 3.5 Sonnet 26.2% 24.5%

Notice that the percentage of tasks is higher than the money earned. This tells me that on average the models were more successful at the easier tasks, as one might expect.

The rates of success were much higher on the SWE Manager tasks:

Model Tasks Passed Money Earned
GPT o1 41.5% 51.8%
Claude 3.5 Sonnet 44.9% 56.8%

Interestingly, the AI seems to have done well at harder-than-average tasks.

I’d like to also point out that since the SWE Manager tasks involved choosing between 4-5 proposed solutions, a coin-flip would have passed 20-25% of tasks. I don’t say this to minimize the results; only to say that there’s no valuable comparison between the performance on the two data sets.

So what does this mean? Am I out of a job yet? Probably not, but it’s very useful to have better benchmarks.

I’d really like to see a proper analysis of how well the AIs do based on the difficulty of the task: are they competent at junior-type problems yet?

And for comparison, I’d like to see how actual human engineers do on these benchmarks.
2025-02-26
Multi-Cloud Infrastructure as Code?
I’m going to do an uncomfortable thing today: I was thinking about a problem, and I’m just going to share my thoughts before I research it. Then, I’ll actually do the research and refine things a bit. The goal is to show the thought process and learning.

Courtesy: Wikimedia Commons

One of the main selling points of HashiCorp’s Terraform is that it can be used for multi-cloud deployments. The benefits of this type of deployment are significant:
- If one provider has an outage, you can simply use your infrastructure in a different provider.
- You’re far more resistant to vendor lock-in. If a provider raises its prices, you aren’t stuck there.
The problem of vendor lock-in is huge. Wherever I’ve worked, there’s always this pervasive background question: “Well, what if we wanted to go with Google instead?” And the answers have been unsatisfying. Realistically, the answer is sort of: “Well, we could start over and re-design all this infrastructure for the new platform.”

If you look at production Terraform, it’s going to be full of resources such as aws_s3_bucket, which is definitively tied to one specific cloud provider.

So how can you have Infrastructure as Code (IaC) for multi-cloud deployments, when all your resources are tied to a specific provider?

One solution (and the one that HashiCorp probably recommends) would be to abstract your infrastructure into generic modules that implement your intentions in each of the cloud providers’ specific resources.

The user would specify “I want a storage container that is readable to the public and highly available.” The module would then be able to create such a container in AWS, Azure, GCP, or wherever you needed to deploy it.

So you’d have a module that looked maybe something like this:
```
# Module "multicloud_storage"

variable "cloud_provider" {
  type = "string"
}

resource "aws_s3_bucket" "main_bucket" {
    count = var.cloud_provider == "aws" ? 1 : 0
    ...
}

resource "azurerm_storage_blob" "main_bucket" {
    count = var.cloud_provider == "azure" ? 1 : 0
    ...
}
```
Disclaimer: Please don’t use this code. It’s exactly as untested as it looks.

Note that awkward count field on every block. I think you could probably make such a generic module work, but you’d have to implement the thing in every provider that you wanted to support.

But the configurations for the different providers’ storage systems don’t match up one-to-one. Take, for example, the access tier of your storage: how available the objects are and how quickly they can be accessed. AWS S3 has at least nine, plus an Intelligent-Tiering option, whereas Azure uses hot, cool, cold, and archive. In our hypothetical multi-cloud module, we probably want to abstract this away from the user. We might do something like this:

Module Azure Blob AWS S3
Lava Hot Express Zone One
Hot Hot S3 Standard
Cool Cool S3 Standard-IA
Cold Cold Glacier Instant Retrieval
Glacier Archive Glacier Flexible Retrieval
Micro-Kelvin Archive Glacier Deep Archive

This would allow us to offer the available storage classes in both providers, but the actual storage tier chosen is a little obfuscated from the user.

But what about features that exist in one provider and not another? For example, S3 offers Transfer Acceleration to speed up transfers to and from the bucket, whereas Azure’s Blob seems to rely mainly on parallel uploads for performance.

Then we get to whole types of resources that exist in one provider but not another. Leaky abstractions. The juice-squeeze ratio of maintaining all of these implementations for lesser-used resource types or highly specific ones like QuickSight.

I’m about to end the rambling, self-reflecting portion of this post and do some actual research. I hope that someone has created modules like this that allow the same infrastructure to work for multi-cloud deployments. My intuition is that it’s too unwieldy.

Here I go into the Internet. Fingers crossed!

Hey, it’s me. I’m back, fifteen minutes later.

I didn’t find a ton. There are a smattering of tools that claim to handle this.

For example, Pulumi, an Infrastructure as Code tool and alternative to Terraform, says that it handles multi-cloud deployments natively. I’d be interested in learning more.

I found several articles offering a guide to multi-cloud Terraform modules. I did not, however, find any well-maintained modules themselves.

The void feels a little weird to me: there’s obviously a great need for this sort of module. It’s the sort of problem that the open source community has traditionally been good at solving. Like I said before, my intuition is that this is a very difficult (expensive) problem, so maybe the cost just outweighs the demand?

One Stack Overflow post mentioned that one of the reasons people don’t share Terraform in open source is that it makes it easy to find vulnerabilities in your infrastructure. (But isn’t that supposed to be a strength of open source: to crowdsource the identification and correction of these vulnerabilities?) Anyway, extrapolating a bit: this reluctance to share infrastructure might also also be a huge barrier to making such a multi-cloud module.

If I were going to implement something professionally, I’d do a lot more than fifteen minutes of research. But, gentle reader, it looks bleak out there. Let me know if there’s anything good out there that I missed.
2025-02-20
North Korean Malware Wins at Hide and Seek
Courtesy: Wikimedia Commons

SecurityScorecard has released a report describing how they uncovered evidence of an attack by North Korea’s Lazarus Group against developers. The attack uses sophisticated anti-detection techniques to deliver its new implant Marstech1, designed to steal cryptocurrency wallets.

Marstech1, a JavaScript implant, is being served by Lazarus’s Command & Control (C2) server, and a similar implant was also added to several open source GitHub repositories.

This malware targets the directories used by Exodus and Atomic Crypto wallets. It can copy the data, package it, and send it to the C2 server.

What makes Marstech1 unique, though, is the extent to which its authors have gone to obfuscate the code to avoid detection. From the report:
The Marstech implants utilize different obfuscation techniques than previously seen. The JS implant that was
observed utilizes;
- Control flow flattening & self-invoking functions
- Random variable and function names
- Base64 string encoding
- Anti-debugging (anti-tamporing [sic] checks)
- Splitting and recombining strings
This ensures that if the threat actor embedded the JS into a software project it would go unnoticed.
There’s a full explanation in the report, so if you’re interested I highly recommend it. Suffice it to say that security researchers have their work cut out for them right now.
2025-02-18
No! Shut them all down, hurry!
Luke Skywalker: [interrupting] Will you shut up and listen to me! Shut down all the garbage mashers on the detention level, will ya? Do you copy? Shut down all the garbage mashers on the detention level! Shut down all the garbage mashers on the detention level!

C-3PO: [to R2-D2] No! Shut them all down, hurry!

The Official AWS Blog (great resource, by the way!) has a post describing how to reduce costs in EC2 by automatically shutting down instances while not in use.

The short version of the blog post is that you can achieve this sort of shut down in two ways:
1. Create a CloudWatch alarm to shut down when CPU usage is below a threshold for a period of time.
2. Create an EventBridge trigger and Lambda to shut down on a schedule.
I would argue that in most deployments you would have a more precise metric that actually reflects the number of your HTTP requests.

There are other guides on how to do this; I’ve looked at some, as I’ve been planning to do this for our Fargate instances (not EC2, obviously, but similar enough) in our test environments. However, it’s nice to have an official source on how to do this kind of shutdown.

The reason we want to do this on my project is to save on cloud costs. The savings probably aren’t that much, but they come from an area of the budget that is limited and needed for other things.

At any rate, option #2 better reflects what my team will want to do. We have very spiky usage, but when we do go to test an instance, we don’t want to have to wait for it to spin up. Since we have similar work hours, we’ll probably want to shut down the instances except for around 06:30 – 20:00 on weekdays. That way, it’s up whenever people are likely to be working and down at other times.

One difficulty I have anticipated is, what if someone tries to test at an odd hour. I don’t mind terribly if they need to manually start an instance; it should happen very infrequently. However, I’d like them to see a clear error message describing what’s going on. We won’t use this system in production for obvious reasons, but it would be nice if future devs years from now don’t get confused because they can’t reach the service.

So, I’m wondering if there’s a good way to dynamically point our ALB to a static file on an S3 bucket or something while the instances are down. It might be possible to set a static file as an error page in the ALB? Not sure yet. Clearly I have not yet given this more than cursory research, but it’s on my mind.
2025-02-17

Module	Azure Blob	AWS S3
Lava	Hot	Express Zone One
Hot	Hot	S3 Standard
Cool	Cool	S3 Standard-IA
Cold	Cold	Glacier Instant Retrieval
Glacier	Archive	Glacier Flexible Retrieval
Micro-Kelvin	Archive	Glacier Deep Archive

Model	Tasks Passed	Money Earned
GPT o1	16.5%	12.1%
Claude 3.5 Sonnet	26.2%	24.5%

Model	Tasks Passed	Money Earned
GPT o1	41.5%	51.8%
Claude 3.5 Sonnet	44.9%	56.8%

Tag: technology

How does it work?

What problems can it solve?

Why does it matter?

Is It Broken?

Can It Be Fixed?

What If We’re Already Modernizing?

Are we there yet?

What if it also did this?

What do you mean it’s supposed to do that?

Not Everyone Knows the Proven Pattern

The Big Bang is Simpler

The Big Bang is How We Replace Most Other Things

Modernization Projects

The Big Bang

Strangler Fig

Results