Tag: LLM

  • Google DeepMind Announced an LLM-Based Problem-Solver

    Earlier this week, Google DeepMind announced its new research tool AlphaEvolve. Basically, it’s an LLM-driven tool that uses evolutionary algorithms to find solutions for certain math or software problems. It’s already come up with optimizations on a few important problems that could lead to efficiency gains within the AI space and perhaps beyond.

    Disclaimer: I haven’t had time to read the whole paper yet, but I’ve managed to read Google DeepMind’s blog post, watch this interview and read a few news articles.

    The main limitation of AlphaEvolve is that it can only work on problems where the solution can be evaluated by a machine. So, on the trivial end of things, this might be similar to a LeetCode problem such as “Reverse a linked list”. To solve this, AlphaEvolve would come up with a few solutions and evaluate them for both correctness and efficiency. Obviously, this is the sort of problem that computer science students should be able to solve in their sleep.

    Of course, what’s interesting is when you direct AlphaEvolve to work on more interesting problems.

    How does it work?

    Evolutionary algorithms can solve problems within a large solution-space by tweaking parameters and running many different trials. Selection criteria and evaluation methods can vary, but the general idea is to choose the best solutions from one generation, tweak them a bit, and run a new generation of trials.

    Where AlphaEvolve improves on this method of problem-solving is that it uses an LLM to direct progress without relying solely on randomness of the parameters. It also uses automatic code generation so that the parameters tested are (or can be?) code implementations.

    The novel thing here is that LLMs aren’t just generating code, they’re guiding the search across a massive algorithmic space. This leads to verifiably novel solutions, not just rediscovering old ones.

    Purloined from their blog post

    What problems can it solve?

    AlphaEvolve can only work on problems that can be evaluated by machine. These evaluations can be mathematical correctness, performance metrics, or even physical simulations. The key is that there’s an automated, not human, way to judge success. By taking the human out of the loop, they can run thousands or millions of trials until it finds its solutions.

    Despite being limited to this specific type of question, there are a lot of problems in that space, including data center scheduling, hardware design, AI training and inference, and mathematics. In Google DeepMind’s blog post, they said:

    “To investigate AlphaEvolve’s breadth, we applied the system to over 50 open problems in mathematical analysis, geometry, combinatorics and number theory. The system’s flexibility enabled us to set up most experiments in a matter of hours. In roughly 75% of cases, it rediscovered state-of-the-art solutions, to the best of our knowledge.”

    One of the solutions that has been highly touted is its 48-step universal and recurse-able algorithm for multiplying 4×4 matrices, with major implications for machine learning and graphics. The algorithms are a bit beyond my understanding of linear algebra, but here’s what I’ve gathered:

    • If you do it the usual way, you can multiply a 2×2 matrix in 8 steps. Basically, you multiply each number by each of the others and then take the sums.
    • There is an optimization to multiply a 2×2 matrix in only 7 steps, and mathematicians have determined that 7 steps is the optimal solution for this problem.
    • Because the 7-step algorithm can be done recursively, you can calculate a 4×4 matrix in 7^2 = 49 steps. Basically, you consider the 4×4 matrix as a set of 2×2 matrices and multiply them out.
    • AlphaEvolve’s solution is one calculation more efficient than the above 7^2=49 step algorithm. So, on the same problem it should be around 2% more efficient.
    • AlphaEvolve’s solution can also be used recursively, so calculating a larger matrix should also be more efficient. I’m not totally clear about how much it would speed things up for which size of matrix.

    The reason this seemingly small optimization is so important is that we do a ton of matrix multiplication in machine learning, in both training and inference. So a small difference here can make an enormous difference.

    Similarly, one of the other problems that AlphaEvolve worked on was something (we don’t seem to know exactly what, and it’s probably proprietary) that provided Google with an optimization to its data centers that “recovers on average 0.7% of Google’s fleet-wide compute resources”. Given the immense scale of Google’s data centers, this would be a huge sum of money!

    Why does it matter?

    The major advance here isn’t just speed—it’s novelty. AlphaEvolve didn’t just find better implementations of known algorithms; in some cases, it created ones that are new to science, like the 48-step recursive matrix multiplication.

    One of the major criticisms of LLMs has been that, despite their vast reservoir of knowledge, LLMs haven’t really synthesized that knowledge to come up with new discoveries. (To be clear, there have been a few such discoveries from other areas of AI, such as DeepMind’s AlphaFold). Well, now we have an LLM-based method to make those discoveries, albeit only for a specific type of problem.Keeping in mind its limitations, the algorithmic improvements to matrix multiplication alone could generate huge savings in energy, cooling, and environmental damage in the coming years.

  • DeepSeek Security Review: “Not overtly malicious” but still concerning

    I think by now everyone in the tech industry already knows about DeepSeek: it’s the new mold-breaking, disruptive Large Language Model (LLM) from the Chinese company of the same name. It achieves good performance, and the company claims to have trained it for a tiny fraction of the cost of the top LLMs. Certainly, it’s svelte enough to run a version of it on an Android device.

    There have been security concerns from the start, and a few countries have banned or restricted its use, including Italy, Australia, and the United States Navy.

    SecurityScorecard’s STRIKE team has performed in-depth analysis of DeepSeek, and their results are mixed. Their key findings:

    • The DeepSeek Android app has security vulnerabilities, such as weak encryption, SQL injection risks, and hardcoded keys.
    • It has a broad data collection scope, including user inputs, device data, and keystroke patterns, stored in China.
    • There are concerns about data transmission to Chinese state-owned entities and ByteDance.
    • The app employs anti-debugging mechanisms.
    • DeepSeek has faced regulatory scrutiny and bans in multiple countries.
    • Code analysis reveals integration with ByteDance‘s services.
    • The app requests permissions for internet access, phone state, and location.
    • Third-party domains that the app connects to, like Ktor, have failing security scores, which raises business risks related to data security.
    • Despite security weaknesses and  privacy concerns, no overtly malicious behavior was detected.

    I think a lot of these are unsurprising: DeepSeek was up front about their data being stored within the People’s Republic of China. The requests for permissions that the app doesn’t really need are almost standard these days, and if Google did it (they do), we wouldn’t think twice.

    Of concern to me is their poor security practices in general, combined with collecting potentially quite private data. As STRIKE points out, it’s weird to use anti-debugging mechanisms, especially for a company claiming to be transparent.

    I don’t think this analysis is going to change anyone’s opinion of DeepSeek: it was widely criticized as a security risk before, just on the basis of sending information to China. Lax security within the app is probably not a big deal compared to that, but it does potentially mean that your data might be exposed to other entities as well.


    I promise: next time I’ll write about something other than SecurityScorecard. I came across this one while reading the previous report, and I wanted to see what they had to say.

  • LLMs are not going to take your job (yet)

    It seems like every site I go to with DevOps or Software Engineering news is inundated with — and I mean positively drowning in — articles about LLMs doing the job of engineers.

    I was away for the weekend, and this is a smattering of the headlines I came back to, just on the first couple sites that I frequently read from:

    To be fair, the tone has generally shifted away from the early Chicken Little sky-is-falling doomsday analysis; three of those four articles take a much more limited view of AI’s ability to code (and the fourth isn’t testing the possibility of replacing SWEs).

    There have been benchmarks done on the coding ability of AI, but the last article above — “AI Coding: New Research Shows…” — talks about a new academic paper proposing a more in-depth benchmark that perhaps better captures the work done by a SWE or SWE Manager.

    The study, “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?“, describes their new benchmark. I’ll let the abstract speak for itself:

    We introduce SWE-Lancer, a benchmark of over
    1,400 freelance software engineering tasks from
    Upwork, valued at $1 million USD total in real-world
    payouts. SWE-Lancer encompasses both
    independent engineering tasks — ranging from
    $50 bug fixes to $32,000 feature implementations
    — and managerial tasks, where models
    choose between technical implementation proposals.
    Independent tasks are graded with end-to-
    end tests triple-verified by experienced software
    engineers, while managerial decisions are
    assessed against the choices of the original hired
    engineering managers. We evaluate model performance
    and find that frontier models are still
    unable to solve the majority of tasks
    … By
    mapping model performance to monetary value,
    we hope SWE-Lancer enables greater research
    into the economic impact of AI model development.

    Some interesting things from the article:

    • They allow the real payouts act as a proxy for the difficulty of the task.
    • They use the real-world Expensify open-source repository, and the tasks sometimes require the context of different parts of the code base to solve.
    • They grade by end-to-end tests instead of unit tests. This is much less susceptible to manipulation and it provides a better insight into the actual efficacy of the solution within the product.
    • They used a variety of tasks categorized either as “IC SWE” for implementation tasks or “SWE Manager” for making a choice between different proposals.

    Results

    On the IC SWE tasks, the article compared OpenAI’s ChatGTP o1 model against Anthropic’s Claude 3.5 Sonnet. (They also included GPT-4o, which performed less well.)

    ModelTasks PassedMoney Earned
    GPT o116.5%12.1%
    Claude 3.5 Sonnet26.2%24.5%

    Notice that the percentage of tasks is higher than the money earned. This tells me that on average the models were more successful at the easier tasks, as one might expect.

    The rates of success were much higher on the SWE Manager tasks:

    ModelTasks PassedMoney Earned
    GPT o141.5%51.8%
    Claude 3.5 Sonnet44.9%56.8%

    Interestingly, the AI seems to have done well at harder-than-average tasks.

    I’d like to also point out that since the SWE Manager tasks involved choosing between 4-5 proposed solutions, a coin-flip would have passed 20-25% of tasks. I don’t say this to minimize the results; only to say that there’s no valuable comparison between the performance on the two data sets.


    So what does this mean? Am I out of a job yet? Probably not, but it’s very useful to have better benchmarks.

    I’d really like to see a proper analysis of how well the AIs do based on the difficulty of the task: are they competent at junior-type problems yet?

    And for comparison, I’d like to see how actual human engineers do on these benchmarks.