Build Artifacts with Andrew

Category: News

Google DeepMind Announced an LLM-Based Problem-Solver
Earlier this week, Google DeepMind announced its new research tool AlphaEvolve. Basically, it’s an LLM-driven tool that uses evolutionary algorithms to find solutions for certain math or software problems. It’s already come up with optimizations on a few important problems that could lead to efficiency gains within the AI space and perhaps beyond.

Disclaimer: I haven’t had time to read the whole paper yet, but I’ve managed to read Google DeepMind’s blog post, watch this interview and read a few news articles.

The main limitation of AlphaEvolve is that it can only work on problems where the solution can be evaluated by a machine. So, on the trivial end of things, this might be similar to a LeetCode problem such as “Reverse a linked list”. To solve this, AlphaEvolve would come up with a few solutions and evaluate them for both correctness and efficiency. Obviously, this is the sort of problem that computer science students should be able to solve in their sleep.

Of course, what’s interesting is when you direct AlphaEvolve to work on more interesting problems.

How does it work?

Evolutionary algorithms can solve problems within a large solution-space by tweaking parameters and running many different trials. Selection criteria and evaluation methods can vary, but the general idea is to choose the best solutions from one generation, tweak them a bit, and run a new generation of trials.

Where AlphaEvolve improves on this method of problem-solving is that it uses an LLM to direct progress without relying solely on randomness of the parameters. It also uses automatic code generation so that the parameters tested are (or can be?) code implementations.

The novel thing here is that LLMs aren’t just generating code, they’re guiding the search across a massive algorithmic space. This leads to verifiably novel solutions, not just rediscovering old ones.

Purloined from their blog post

What problems can it solve?

AlphaEvolve can only work on problems that can be evaluated by machine. These evaluations can be mathematical correctness, performance metrics, or even physical simulations. The key is that there’s an automated, not human, way to judge success. By taking the human out of the loop, they can run thousands or millions of trials until it finds its solutions.

Despite being limited to this specific type of question, there are a lot of problems in that space, including data center scheduling, hardware design, AI training and inference, and mathematics. In Google DeepMind’s blog post, they said:

“To investigate AlphaEvolve’s breadth, we applied the system to over 50 open problems in mathematical analysis, geometry, combinatorics and number theory. The system’s flexibility enabled us to set up most experiments in a matter of hours. In roughly 75% of cases, it rediscovered state-of-the-art solutions, to the best of our knowledge.”

One of the solutions that has been highly touted is its 48-step universal and recurse-able algorithm for multiplying 4×4 matrices, with major implications for machine learning and graphics. The algorithms are a bit beyond my understanding of linear algebra, but here’s what I’ve gathered:
- If you do it the usual way, you can multiply a 2×2 matrix in 8 steps. Basically, you multiply each number by each of the others and then take the sums.
- There is an optimization to multiply a 2×2 matrix in only 7 steps, and mathematicians have determined that 7 steps is the optimal solution for this problem.
- Because the 7-step algorithm can be done recursively, you can calculate a 4×4 matrix in 7^2 = 49 steps. Basically, you consider the 4×4 matrix as a set of 2×2 matrices and multiply them out.
- AlphaEvolve’s solution is one calculation more efficient than the above 7^2=49 step algorithm. So, on the same problem it should be around 2% more efficient.
- AlphaEvolve’s solution can also be used recursively, so calculating a larger matrix should also be more efficient. I’m not totally clear about how much it would speed things up for which size of matrix.
The reason this seemingly small optimization is so important is that we do a ton of matrix multiplication in machine learning, in both training and inference. So a small difference here can make an enormous difference.

Similarly, one of the other problems that AlphaEvolve worked on was something (we don’t seem to know exactly what, and it’s probably proprietary) that provided Google with an optimization to its data centers that “recovers on average 0.7% of Google’s fleet-wide compute resources”. Given the immense scale of Google’s data centers, this would be a huge sum of money!

Why does it matter?

The major advance here isn’t just speed—it’s novelty. AlphaEvolve didn’t just find better implementations of known algorithms; in some cases, it created ones that are new to science, like the 48-step recursive matrix multiplication.

One of the major criticisms of LLMs has been that, despite their vast reservoir of knowledge, LLMs haven’t really synthesized that knowledge to come up with new discoveries. (To be clear, there have been a few such discoveries from other areas of AI, such as DeepMind’s AlphaFold). Well, now we have an LLM-based method to make those discoveries, albeit only for a specific type of problem.Keeping in mind its limitations, the algorithmic improvements to matrix multiplication alone could generate huge savings in energy, cooling, and environmental damage in the coming years.
2025-05-16
DeepSeek Security Review: “Not overtly malicious” but still concerning
I think by now everyone in the tech industry already knows about DeepSeek: it’s the new mold-breaking, disruptive Large Language Model (LLM) from the Chinese company of the same name. It achieves good performance, and the company claims to have trained it for a tiny fraction of the cost of the top LLMs. Certainly, it’s svelte enough to run a version of it on an Android device.

There have been security concerns from the start, and a few countries have banned or restricted its use, including Italy, Australia, and the United States Navy.

SecurityScorecard’s STRIKE team has performed in-depth analysis of DeepSeek, and their results are mixed. Their key findings:
- The DeepSeek Android app has security vulnerabilities, such as weak encryption, SQL injection risks, and hardcoded keys.
- It has a broad data collection scope, including user inputs, device data, and keystroke patterns, stored in China.
- There are concerns about data transmission to Chinese state-owned entities and ByteDance.
- The app employs anti-debugging mechanisms.
- DeepSeek has faced regulatory scrutiny and bans in multiple countries.
- Code analysis reveals integration with ByteDance‘s services.
- The app requests permissions for internet access, phone state, and location.
- Third-party domains that the app connects to, like Ktor, have failing security scores, which raises business risks related to data security.
- Despite security weaknesses and privacy concerns, no overtly malicious behavior was detected.
I think a lot of these are unsurprising: DeepSeek was up front about their data being stored within the People’s Republic of China. The requests for permissions that the app doesn’t really need are almost standard these days, and if Google did it (they do), we wouldn’t think twice.

Of concern to me is their poor security practices in general, combined with collecting potentially quite private data. As STRIKE points out, it’s weird to use anti-debugging mechanisms, especially for a company claiming to be transparent.

I don’t think this analysis is going to change anyone’s opinion of DeepSeek: it was widely criticized as a security risk before, just on the basis of sending information to China. Lax security within the app is probably not a big deal compared to that, but it does potentially mean that your data might be exposed to other entities as well.

I promise: next time I’ll write about something other than SecurityScorecard. I came across this one while reading the previous report, and I wanted to see what they had to say.
2025-03-02
LLMs are not going to take your job (yet)
It seems like every site I go to with DevOps or Software Engineering news is inundated with — and I mean positively drowning in — articles about LLMs doing the job of engineers.

I was away for the weekend, and this is a smattering of the headlines I came back to, just on the first couple sites that I frequently read from:
To be fair, the tone has generally shifted away from the early Chicken Little sky-is-falling doomsday analysis; three of those four articles take a much more limited view of AI’s ability to code (and the fourth isn’t testing the possibility of replacing SWEs).

There have been benchmarks done on the coding ability of AI, but the last article above — “AI Coding: New Research Shows…” — talks about a new academic paper proposing a more in-depth benchmark that perhaps better captures the work done by a SWE or SWE Manager.

The study, “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?“, describes their new benchmark. I’ll let the abstract speak for itself:

We introduce SWE-Lancer, a benchmark of over
1,400 freelance software engineering tasks from
Upwork, valued at $1 million USD total in real-world
payouts. SWE-Lancer encompasses both
independent engineering tasks — ranging from
$50 bug fixes to $32,000 feature implementations
— and managerial tasks, where models
choose between technical implementation proposals.
Independent tasks are graded with end-to-
end tests triple-verified by experienced software
engineers, while managerial decisions are
assessed against the choices of the original hired
engineering managers. We evaluate model performance
and find that frontier models are still
unable to solve the majority of tasks… By
mapping model performance to monetary value,
we hope SWE-Lancer enables greater research
into the economic impact of AI model development.

Some interesting things from the article:
- They allow the real payouts act as a proxy for the difficulty of the task.
- They use the real-world Expensify open-source repository, and the tasks sometimes require the context of different parts of the code base to solve.
- They grade by end-to-end tests instead of unit tests. This is much less susceptible to manipulation and it provides a better insight into the actual efficacy of the solution within the product.
- They used a variety of tasks categorized either as “IC SWE” for implementation tasks or “SWE Manager” for making a choice between different proposals.
Results

On the IC SWE tasks, the article compared OpenAI’s ChatGTP o1 model against Anthropic’s Claude 3.5 Sonnet. (They also included GPT-4o, which performed less well.)

Model Tasks Passed Money Earned
GPT o1 16.5% 12.1%
Claude 3.5 Sonnet 26.2% 24.5%

Notice that the percentage of tasks is higher than the money earned. This tells me that on average the models were more successful at the easier tasks, as one might expect.

The rates of success were much higher on the SWE Manager tasks:

Model Tasks Passed Money Earned
GPT o1 41.5% 51.8%
Claude 3.5 Sonnet 44.9% 56.8%

Interestingly, the AI seems to have done well at harder-than-average tasks.

I’d like to also point out that since the SWE Manager tasks involved choosing between 4-5 proposed solutions, a coin-flip would have passed 20-25% of tasks. I don’t say this to minimize the results; only to say that there’s no valuable comparison between the performance on the two data sets.

So what does this mean? Am I out of a job yet? Probably not, but it’s very useful to have better benchmarks.

I’d really like to see a proper analysis of how well the AIs do based on the difficulty of the task: are they competent at junior-type problems yet?

And for comparison, I’d like to see how actual human engineers do on these benchmarks.
2025-02-26
North Korean Malware Wins at Hide and Seek
Courtesy: Wikimedia Commons

SecurityScorecard has released a report describing how they uncovered evidence of an attack by North Korea’s Lazarus Group against developers. The attack uses sophisticated anti-detection techniques to deliver its new implant Marstech1, designed to steal cryptocurrency wallets.

Marstech1, a JavaScript implant, is being served by Lazarus’s Command & Control (C2) server, and a similar implant was also added to several open source GitHub repositories.

This malware targets the directories used by Exodus and Atomic Crypto wallets. It can copy the data, package it, and send it to the C2 server.

What makes Marstech1 unique, though, is the extent to which its authors have gone to obfuscate the code to avoid detection. From the report:
The Marstech implants utilize different obfuscation techniques than previously seen. The JS implant that was
observed utilizes;
- Control flow flattening & self-invoking functions
- Random variable and function names
- Base64 string encoding
- Anti-debugging (anti-tamporing [sic] checks)
- Splitting and recombining strings
This ensures that if the threat actor embedded the JS into a software project it would go unnoticed.
There’s a full explanation in the report, so if you’re interested I highly recommend it. Suffice it to say that security researchers have their work cut out for them right now.
2025-02-18
Can AI Generate Functional Terraform?

Nope.

The end. Thanks for reading my post. Reading time: 1 minute.

I posted the other day about this topic, and I am intrigued by the possibilities. I’m certainly interested in the ways that you can use it for infrastructure, and the article in that post offers a somewhat-different use case for AI-generated Terraform: cloud migrations and multi-cloud solutions. But I’d by lying if I said I wasn’t very skeptical of the code that it writes.

With all that on my mind, I appreciate the analysis in this article: “Can AI Generate Functional Terraform?” by Rak Siva.

I’d add mainly that GenAI is currently about as useful as a very junior developer, and that’s probably because they’re both doing the same thing: Google the results and copy-paste without really understanding.

Then again, if you’ll indulge a quickly-emerged cliché: none of us saw any of this coming just five years ago.

2025-02-16
AWS Launches Trust Center
Compliance just got a tiny bit easier in AWS-land. AWS announced that they’re launching their new AWS Trust Center, an all-in-one hub for AWS’s security-related documentation.

I certainly haven’t read through the whole site, but just eyeballing what they’ve got:
- Security
- Compliance
- Data protection and privacy
- Operational visibility
- Report an incident
- Agreement and terms
I doubt they’ve even released any new documentation, but it’s a nice step forward to put all this stuff in one place.
2025-02-15
Easier Cloud-to-Cloud Migrations?
An Empty (Theoretical) Promise

It’s long been a promise of Infrastructure as Code tools like Terraform that you could theoretically create platform-independent IaC and deploy freely into any cloud environment. I doubt anyone ever really meant that literally, but the reality is that your cloud infrastructure is inevitably going to be tied quite closely to your provider. If you’re using an aws_vpc resource, it’s pretty unlikely that you could easily turn that into its equivalent in another provider.

And yet, several of the organizations I’ve worked with have been reluctant to tie themselves closely with one cloud provider or another. The business reality is that the vendor lock-in is a huge source of anxiety: if AWS suddenly and drastically raised their prices, or if they for some reason became unavailable, lots and lots of businesses would be in a big pickle!

The amount of work required to manually transfer an existing system from one provider to another would be nearly as much as creating the system in the first place.

GenAI as the Solution?

I ran across this article about StackGen’s Cloud Migration product. The article isn’t long, so go read it.

Instead of requiring DevOps teams to map out resources manually, the system uses read-only API access to scan existing cloud environments. It automatically identifies resources, maps dependencies, and – perhaps most importantly – maintains security policies during the transition.

StackGen isn’t new to using generative AI for infrastructure problems, but they have an interesting approach here:
1. Use read-only APIs to identify resources, including those not already in IaC.
2. Use generative AI to map those resources, including security policies, compliance policies, and resource dependencies.
3. Convert those mapped resources into deployment-ready IaC for the destination environment.
Using a process like this to migrate from provider to provider is interesting, but the one use case that really gets me thinking is the ability to deploy into a multi-cloud environment.

I’ll be keeping my eyes on this one.
2025-02-12
Cyberattack brings down Newspaper Publisher

Lee Enterprises, one of the largest publishers of newspapers in the United States, has had outages caused by a cyberattack. There have been no details on the nature of the attack, but the St. Louis Post-Dispatch has been affected.

Read more on Tech Crunch.

2025-02-10