Tag: ai

  • Google DeepMind Announced an LLM-Based Problem-Solver

    Earlier this week, Google DeepMind announced its new research tool AlphaEvolve. Basically, it’s an LLM-driven tool that uses evolutionary algorithms to find solutions for certain math or software problems. It’s already come up with optimizations on a few important problems that could lead to efficiency gains within the AI space and perhaps beyond.

    Disclaimer: I haven’t had time to read the whole paper yet, but I’ve managed to read Google DeepMind’s blog post, watch this interview and read a few news articles.

    The main limitation of AlphaEvolve is that it can only work on problems where the solution can be evaluated by a machine. So, on the trivial end of things, this might be similar to a LeetCode problem such as “Reverse a linked list”. To solve this, AlphaEvolve would come up with a few solutions and evaluate them for both correctness and efficiency. Obviously, this is the sort of problem that computer science students should be able to solve in their sleep.

    Of course, what’s interesting is when you direct AlphaEvolve to work on more interesting problems.

    How does it work?

    Evolutionary algorithms can solve problems within a large solution-space by tweaking parameters and running many different trials. Selection criteria and evaluation methods can vary, but the general idea is to choose the best solutions from one generation, tweak them a bit, and run a new generation of trials.

    Where AlphaEvolve improves on this method of problem-solving is that it uses an LLM to direct progress without relying solely on randomness of the parameters. It also uses automatic code generation so that the parameters tested are (or can be?) code implementations.

    The novel thing here is that LLMs aren’t just generating code, they’re guiding the search across a massive algorithmic space. This leads to verifiably novel solutions, not just rediscovering old ones.

    Purloined from their blog post

    What problems can it solve?

    AlphaEvolve can only work on problems that can be evaluated by machine. These evaluations can be mathematical correctness, performance metrics, or even physical simulations. The key is that there’s an automated, not human, way to judge success. By taking the human out of the loop, they can run thousands or millions of trials until it finds its solutions.

    Despite being limited to this specific type of question, there are a lot of problems in that space, including data center scheduling, hardware design, AI training and inference, and mathematics. In Google DeepMind’s blog post, they said:

    “To investigate AlphaEvolve’s breadth, we applied the system to over 50 open problems in mathematical analysis, geometry, combinatorics and number theory. The system’s flexibility enabled us to set up most experiments in a matter of hours. In roughly 75% of cases, it rediscovered state-of-the-art solutions, to the best of our knowledge.”

    One of the solutions that has been highly touted is its 48-step universal and recurse-able algorithm for multiplying 4×4 matrices, with major implications for machine learning and graphics. The algorithms are a bit beyond my understanding of linear algebra, but here’s what I’ve gathered:

    • If you do it the usual way, you can multiply a 2×2 matrix in 8 steps. Basically, you multiply each number by each of the others and then take the sums.
    • There is an optimization to multiply a 2×2 matrix in only 7 steps, and mathematicians have determined that 7 steps is the optimal solution for this problem.
    • Because the 7-step algorithm can be done recursively, you can calculate a 4×4 matrix in 7^2 = 49 steps. Basically, you consider the 4×4 matrix as a set of 2×2 matrices and multiply them out.
    • AlphaEvolve’s solution is one calculation more efficient than the above 7^2=49 step algorithm. So, on the same problem it should be around 2% more efficient.
    • AlphaEvolve’s solution can also be used recursively, so calculating a larger matrix should also be more efficient. I’m not totally clear about how much it would speed things up for which size of matrix.

    The reason this seemingly small optimization is so important is that we do a ton of matrix multiplication in machine learning, in both training and inference. So a small difference here can make an enormous difference.

    Similarly, one of the other problems that AlphaEvolve worked on was something (we don’t seem to know exactly what, and it’s probably proprietary) that provided Google with an optimization to its data centers that “recovers on average 0.7% of Google’s fleet-wide compute resources”. Given the immense scale of Google’s data centers, this would be a huge sum of money!

    Why does it matter?

    The major advance here isn’t just speed—it’s novelty. AlphaEvolve didn’t just find better implementations of known algorithms; in some cases, it created ones that are new to science, like the 48-step recursive matrix multiplication.

    One of the major criticisms of LLMs has been that, despite their vast reservoir of knowledge, LLMs haven’t really synthesized that knowledge to come up with new discoveries. (To be clear, there have been a few such discoveries from other areas of AI, such as DeepMind’s AlphaFold). Well, now we have an LLM-based method to make those discoveries, albeit only for a specific type of problem.Keeping in mind its limitations, the algorithmic improvements to matrix multiplication alone could generate huge savings in energy, cooling, and environmental damage in the coming years.

  • Legacy Modernization: Do We Even Need to Do This?

    If you’re very, very lucky, then you’re just at the beginning of your legacy modernization project. You haven’t done development work, and you haven’t even designed anything. Your first step is to carefully consider what you want and how to get there. If so, then well done! You’re already off to a better start than a lot of such projects.

    More often, though, you’re partway through your modernization effort, and it’s struggling. The main risks in such a project, as we’ve discussed, are with losing momentum or money. However, there are plenty of other things that could go wrong, some of which we’ve talked about in other posts. But something isn’t going well, and you’ve got to figure out what to do.

    In either scenario, your first decision has to be whether or not to modernize. Odd though it may sound, I don’t think the answer is always “yes.”

    I firmly believe that there’s nothing inherently wrong with using old software. Old software has already been adapted to meet user needs. If it’s old and still in use, that’s because it works. Upgrading to newer technologies or moving to a public cloud are not worthwhile as goals just for their own sake. Legacy modernization is extremely expensive and not to be undertaken lightly.

    Is It Broken?

    The first question we should consider here is whether our legacy system is meeting all of our business needs. Odds are, it’s not. Something isn’t right about it, or hopefully we wouldn’t be considering modernization at all.

    Maybe the legacy system is too slow, has reached the limits of vertical scaling, and it can’t be horizontally scaled. Maybe its on-premises infrastructure is being retired. Maybe everyone is tired of using a 1980s green-screen interface. Or possibly you have new compliance requirements that aren’t satisfied.

    Whatever is going on, you should identify the particular requirements that are unmet and their importance to your mission. Write them down, and be specific. We’ll need that list in a moment.

    Can It Be Fixed?

    The second question is just as important: can we modify the existing system to meet our requirements?

    • Can we add an API and modern web interface on top of the legacy system?
    • What if we separate the data layer from the application/business layers, so we can separate them and scale horizontally?
    • Can we lift-and-shift into public cloud without rewriting the whole application? (NB: a lift-and-shift is never as easy as it first appears).

    The answer here is often that the existing application can’t be modified.

    • If it’s in COBOL, I understand it’s really hard to find COBOL developers. So the price of maintaining the system might be out of control.
    • If the infrastructure is going away and is out of your control, you’ll definitely need to move to somewhere else, and it might necessitate modernization.
    • Reaching the limits of scaling is very tough to remedy, so it’s one of those scenarios where I think modernization of some sort is frequently justified.

    So, now also write down what makes it impractical to upgrade your existing system.

    What If We’re Already Modernizing?

    Even if your modernization project is already in progress, you need to do this analysis, maybe especially if you’re already underway.

    I’m sure most of my readers understand the sunk cost fallacy, but even so the warning bears repeating.

    A partially-complete modernization project is expensive in terms of money and effort. You’ve probably spent months or years of developer time. There have been a lot of stakeholder conversations, and it takes effort to build the will to modernize in the first place.

    However, no matter what you’ve invested already, you should be considering the cost-benefit of completing the project versus avoiding the remainder of the project by upgrading the existing software.

    This can be a hard analysis to complete: project timelines are notoriously imprecise, and that is doubly true in a project that involves, for example, understanding and replicating obscure or poorly-understood functionality of a decades-old monolith.

    However, this analysis is critical to project success: the difference in cost between upgrading in place versus a complete rewrite can be absolutely enormous. Even if you’re mostly sure you need to replace the system, a small chance of saving all that effort is worth an hour or two of analysis.


    Next time, we’ll cover choosing the architecture in a modernization project. We’ve mostly talked about the Strangler Fig versus Big Bang approaches, and that implies moving to a cloud-based microservices architecture. However, that’s far from the only reasonable approach, and it’s critically important that you move to the architecture that makes sense for your requirements and available developer resources.


    Other posts in this series:

  • DeepSeek Security Review: “Not overtly malicious” but still concerning

    I think by now everyone in the tech industry already knows about DeepSeek: it’s the new mold-breaking, disruptive Large Language Model (LLM) from the Chinese company of the same name. It achieves good performance, and the company claims to have trained it for a tiny fraction of the cost of the top LLMs. Certainly, it’s svelte enough to run a version of it on an Android device.

    There have been security concerns from the start, and a few countries have banned or restricted its use, including Italy, Australia, and the United States Navy.

    SecurityScorecard’s STRIKE team has performed in-depth analysis of DeepSeek, and their results are mixed. Their key findings:

    • The DeepSeek Android app has security vulnerabilities, such as weak encryption, SQL injection risks, and hardcoded keys.
    • It has a broad data collection scope, including user inputs, device data, and keystroke patterns, stored in China.
    • There are concerns about data transmission to Chinese state-owned entities and ByteDance.
    • The app employs anti-debugging mechanisms.
    • DeepSeek has faced regulatory scrutiny and bans in multiple countries.
    • Code analysis reveals integration with ByteDance‘s services.
    • The app requests permissions for internet access, phone state, and location.
    • Third-party domains that the app connects to, like Ktor, have failing security scores, which raises business risks related to data security.
    • Despite security weaknesses and  privacy concerns, no overtly malicious behavior was detected.

    I think a lot of these are unsurprising: DeepSeek was up front about their data being stored within the People’s Republic of China. The requests for permissions that the app doesn’t really need are almost standard these days, and if Google did it (they do), we wouldn’t think twice.

    Of concern to me is their poor security practices in general, combined with collecting potentially quite private data. As STRIKE points out, it’s weird to use anti-debugging mechanisms, especially for a company claiming to be transparent.

    I don’t think this analysis is going to change anyone’s opinion of DeepSeek: it was widely criticized as a security risk before, just on the basis of sending information to China. Lax security within the app is probably not a big deal compared to that, but it does potentially mean that your data might be exposed to other entities as well.


    I promise: next time I’ll write about something other than SecurityScorecard. I came across this one while reading the previous report, and I wanted to see what they had to say.

  • LLMs are not going to take your job (yet)

    It seems like every site I go to with DevOps or Software Engineering news is inundated with — and I mean positively drowning in — articles about LLMs doing the job of engineers.

    I was away for the weekend, and this is a smattering of the headlines I came back to, just on the first couple sites that I frequently read from:

    To be fair, the tone has generally shifted away from the early Chicken Little sky-is-falling doomsday analysis; three of those four articles take a much more limited view of AI’s ability to code (and the fourth isn’t testing the possibility of replacing SWEs).

    There have been benchmarks done on the coding ability of AI, but the last article above — “AI Coding: New Research Shows…” — talks about a new academic paper proposing a more in-depth benchmark that perhaps better captures the work done by a SWE or SWE Manager.

    The study, “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?“, describes their new benchmark. I’ll let the abstract speak for itself:

    We introduce SWE-Lancer, a benchmark of over
    1,400 freelance software engineering tasks from
    Upwork, valued at $1 million USD total in real-world
    payouts. SWE-Lancer encompasses both
    independent engineering tasks — ranging from
    $50 bug fixes to $32,000 feature implementations
    — and managerial tasks, where models
    choose between technical implementation proposals.
    Independent tasks are graded with end-to-
    end tests triple-verified by experienced software
    engineers, while managerial decisions are
    assessed against the choices of the original hired
    engineering managers. We evaluate model performance
    and find that frontier models are still
    unable to solve the majority of tasks
    … By
    mapping model performance to monetary value,
    we hope SWE-Lancer enables greater research
    into the economic impact of AI model development.

    Some interesting things from the article:

    • They allow the real payouts act as a proxy for the difficulty of the task.
    • They use the real-world Expensify open-source repository, and the tasks sometimes require the context of different parts of the code base to solve.
    • They grade by end-to-end tests instead of unit tests. This is much less susceptible to manipulation and it provides a better insight into the actual efficacy of the solution within the product.
    • They used a variety of tasks categorized either as “IC SWE” for implementation tasks or “SWE Manager” for making a choice between different proposals.

    Results

    On the IC SWE tasks, the article compared OpenAI’s ChatGTP o1 model against Anthropic’s Claude 3.5 Sonnet. (They also included GPT-4o, which performed less well.)

    ModelTasks PassedMoney Earned
    GPT o116.5%12.1%
    Claude 3.5 Sonnet26.2%24.5%

    Notice that the percentage of tasks is higher than the money earned. This tells me that on average the models were more successful at the easier tasks, as one might expect.

    The rates of success were much higher on the SWE Manager tasks:

    ModelTasks PassedMoney Earned
    GPT o141.5%51.8%
    Claude 3.5 Sonnet44.9%56.8%

    Interestingly, the AI seems to have done well at harder-than-average tasks.

    I’d like to also point out that since the SWE Manager tasks involved choosing between 4-5 proposed solutions, a coin-flip would have passed 20-25% of tasks. I don’t say this to minimize the results; only to say that there’s no valuable comparison between the performance on the two data sets.


    So what does this mean? Am I out of a job yet? Probably not, but it’s very useful to have better benchmarks.

    I’d really like to see a proper analysis of how well the AIs do based on the difficulty of the task: are they competent at junior-type problems yet?

    And for comparison, I’d like to see how actual human engineers do on these benchmarks.

  • Can AI Generate Functional Terraform?

    Nope.

    The end. Thanks for reading my post. Reading time: 1 minute.


    I posted the other day about this topic, and I am intrigued by the possibilities. I’m certainly interested in the ways that you can use it for infrastructure, and the article in that post offers a somewhat-different use case for AI-generated Terraform: cloud migrations and multi-cloud solutions. But I’d by lying if I said I wasn’t very skeptical of the code that it writes.

    With all that on my mind, I appreciate the analysis in this article: “Can AI Generate Functional Terraform?” by Rak Siva.

    I’d add mainly that GenAI is currently about as useful as a very junior developer, and that’s probably because they’re both doing the same thing: Google the results and copy-paste without really understanding.

    Then again, if you’ll indulge a quickly-emerged cliché: none of us saw any of this coming just five years ago.

  • Easier Cloud-to-Cloud Migrations?

    Cloud with a lock. Courtesy of Wikimedia Commons.

    An Empty (Theoretical) Promise

    It’s long been a promise of Infrastructure as Code tools like Terraform that you could theoretically create platform-independent IaC and deploy freely into any cloud environment. I doubt anyone ever really meant that literally, but the reality is that your cloud infrastructure is inevitably going to be tied quite closely to your provider. If you’re using an aws_vpc resource, it’s pretty unlikely that you could easily turn that into its equivalent in another provider.

    And yet, several of the organizations I’ve worked with have been reluctant to tie themselves closely with one cloud provider or another. The business reality is that the vendor lock-in is a huge source of anxiety: if AWS suddenly and drastically raised their prices, or if they for some reason became unavailable, lots and lots of businesses would be in a big pickle!

    The amount of work required to manually transfer an existing system from one provider to another would be nearly as much as creating the system in the first place.

    GenAI as the Solution?

    I ran across this article about StackGen’s Cloud Migration product. The article isn’t long, so go read it.

    Instead of requiring DevOps teams to map out resources manually, the system uses read-only API access to scan existing cloud environments. It automatically identifies resources, maps dependencies, and – perhaps most importantly – maintains security policies during the transition.

    StackGen isn’t new to using generative AI for infrastructure problems, but they have an interesting approach here:

    1. Use read-only APIs to identify resources, including those not already in IaC.
    2. Use generative AI to map those resources, including security policies, compliance policies, and resource dependencies.
    3. Convert those mapped resources into deployment-ready IaC for the destination environment.

    Using a process like this to migrate from provider to provider is interesting, but the one use case that really gets me thinking is the ability to deploy into a multi-cloud environment.

    I’ll be keeping my eyes on this one.