Tag: News

  • DeepSeek Security Review: “Not overtly malicious” but still concerning

    I think by now everyone in the tech industry already knows about DeepSeek: it’s the new mold-breaking, disruptive Large Language Model (LLM) from the Chinese company of the same name. It achieves good performance, and the company claims to have trained it for a tiny fraction of the cost of the top LLMs. Certainly, it’s svelte enough to run a version of it on an Android device.

    There have been security concerns from the start, and a few countries have banned or restricted its use, including Italy, Australia, and the United States Navy.

    SecurityScorecard’s STRIKE team has performed in-depth analysis of DeepSeek, and their results are mixed. Their key findings:

    • The DeepSeek Android app has security vulnerabilities, such as weak encryption, SQL injection risks, and hardcoded keys.
    • It has a broad data collection scope, including user inputs, device data, and keystroke patterns, stored in China.
    • There are concerns about data transmission to Chinese state-owned entities and ByteDance.
    • The app employs anti-debugging mechanisms.
    • DeepSeek has faced regulatory scrutiny and bans in multiple countries.
    • Code analysis reveals integration with ByteDance‘s services.
    • The app requests permissions for internet access, phone state, and location.
    • Third-party domains that the app connects to, like Ktor, have failing security scores, which raises business risks related to data security.
    • Despite security weaknesses and  privacy concerns, no overtly malicious behavior was detected.

    I think a lot of these are unsurprising: DeepSeek was up front about their data being stored within the People’s Republic of China. The requests for permissions that the app doesn’t really need are almost standard these days, and if Google did it (they do), we wouldn’t think twice.

    Of concern to me is their poor security practices in general, combined with collecting potentially quite private data. As STRIKE points out, it’s weird to use anti-debugging mechanisms, especially for a company claiming to be transparent.

    I don’t think this analysis is going to change anyone’s opinion of DeepSeek: it was widely criticized as a security risk before, just on the basis of sending information to China. Lax security within the app is probably not a big deal compared to that, but it does potentially mean that your data might be exposed to other entities as well.


    I promise: next time I’ll write about something other than SecurityScorecard. I came across this one while reading the previous report, and I wanted to see what they had to say.

  • LLMs are not going to take your job (yet)

    It seems like every site I go to with DevOps or Software Engineering news is inundated with — and I mean positively drowning in — articles about LLMs doing the job of engineers.

    I was away for the weekend, and this is a smattering of the headlines I came back to, just on the first couple sites that I frequently read from:

    To be fair, the tone has generally shifted away from the early Chicken Little sky-is-falling doomsday analysis; three of those four articles take a much more limited view of AI’s ability to code (and the fourth isn’t testing the possibility of replacing SWEs).

    There have been benchmarks done on the coding ability of AI, but the last article above — “AI Coding: New Research Shows…” — talks about a new academic paper proposing a more in-depth benchmark that perhaps better captures the work done by a SWE or SWE Manager.

    The study, “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?“, describes their new benchmark. I’ll let the abstract speak for itself:

    We introduce SWE-Lancer, a benchmark of over
    1,400 freelance software engineering tasks from
    Upwork, valued at $1 million USD total in real-world
    payouts. SWE-Lancer encompasses both
    independent engineering tasks — ranging from
    $50 bug fixes to $32,000 feature implementations
    — and managerial tasks, where models
    choose between technical implementation proposals.
    Independent tasks are graded with end-to-
    end tests triple-verified by experienced software
    engineers, while managerial decisions are
    assessed against the choices of the original hired
    engineering managers. We evaluate model performance
    and find that frontier models are still
    unable to solve the majority of tasks
    … By
    mapping model performance to monetary value,
    we hope SWE-Lancer enables greater research
    into the economic impact of AI model development.

    Some interesting things from the article:

    • They allow the real payouts act as a proxy for the difficulty of the task.
    • They use the real-world Expensify open-source repository, and the tasks sometimes require the context of different parts of the code base to solve.
    • They grade by end-to-end tests instead of unit tests. This is much less susceptible to manipulation and it provides a better insight into the actual efficacy of the solution within the product.
    • They used a variety of tasks categorized either as “IC SWE” for implementation tasks or “SWE Manager” for making a choice between different proposals.

    Results

    On the IC SWE tasks, the article compared OpenAI’s ChatGTP o1 model against Anthropic’s Claude 3.5 Sonnet. (They also included GPT-4o, which performed less well.)

    ModelTasks PassedMoney Earned
    GPT o116.5%12.1%
    Claude 3.5 Sonnet26.2%24.5%

    Notice that the percentage of tasks is higher than the money earned. This tells me that on average the models were more successful at the easier tasks, as one might expect.

    The rates of success were much higher on the SWE Manager tasks:

    ModelTasks PassedMoney Earned
    GPT o141.5%51.8%
    Claude 3.5 Sonnet44.9%56.8%

    Interestingly, the AI seems to have done well at harder-than-average tasks.

    I’d like to also point out that since the SWE Manager tasks involved choosing between 4-5 proposed solutions, a coin-flip would have passed 20-25% of tasks. I don’t say this to minimize the results; only to say that there’s no valuable comparison between the performance on the two data sets.


    So what does this mean? Am I out of a job yet? Probably not, but it’s very useful to have better benchmarks.

    I’d really like to see a proper analysis of how well the AIs do based on the difficulty of the task: are they competent at junior-type problems yet?

    And for comparison, I’d like to see how actual human engineers do on these benchmarks.

  • North Korean Malware Wins at Hide and Seek

    Courtesy: Wikimedia Commons

    SecurityScorecard has released a report describing how they uncovered evidence of an attack by North Korea’s Lazarus Group against developers. The attack uses sophisticated anti-detection techniques to deliver its new implant Marstech1, designed to steal cryptocurrency wallets.

    Marstech1, a JavaScript implant, is being served by Lazarus’s Command & Control (C2) server, and a similar implant was also added to several open source GitHub repositories.

    This malware targets the directories used by Exodus and Atomic Crypto wallets. It can copy the data, package it, and send it to the C2 server.

    What makes Marstech1 unique, though, is the extent to which its authors have gone to obfuscate the code to avoid detection. From the report:

    The Marstech implants utilize different obfuscation techniques than previously seen. The JS implant that was
    observed utilizes;

    • Control flow flattening & self-invoking functions
    • Random variable and function names
    • Base64 string encoding
    • Anti-debugging (anti-tamporing [sic] checks)
    • Splitting and recombining strings

    This ensures that if the threat actor embedded the JS into a software project it would go unnoticed.

    There’s a full explanation in the report, so if you’re interested I highly recommend it. Suffice it to say that security researchers have their work cut out for them right now.

  • Can AI Generate Functional Terraform?

    Nope.

    The end. Thanks for reading my post. Reading time: 1 minute.


    I posted the other day about this topic, and I am intrigued by the possibilities. I’m certainly interested in the ways that you can use it for infrastructure, and the article in that post offers a somewhat-different use case for AI-generated Terraform: cloud migrations and multi-cloud solutions. But I’d by lying if I said I wasn’t very skeptical of the code that it writes.

    With all that on my mind, I appreciate the analysis in this article: “Can AI Generate Functional Terraform?” by Rak Siva.

    I’d add mainly that GenAI is currently about as useful as a very junior developer, and that’s probably because they’re both doing the same thing: Google the results and copy-paste without really understanding.

    Then again, if you’ll indulge a quickly-emerged cliché: none of us saw any of this coming just five years ago.

  • AWS Launches Trust Center

    Compliance just got a tiny bit easier in AWS-land. AWS announced that they’re launching their new AWS Trust Center, an all-in-one hub for AWS’s security-related documentation.

    I certainly haven’t read through the whole site, but just eyeballing what they’ve got:

    • Security
    • Compliance
    • Data protection and privacy
    • Operational visibility
    • Report an incident
    • Agreement and terms

    I doubt they’ve even released any new documentation, but it’s a nice step forward to put all this stuff in one place.

  • Cyberattack brings down Newspaper Publisher

    Lee Enterprises, one of the largest publishers of newspapers in the United States, has had outages caused by a cyberattack. There have been no details on the nature of the attack, but the St. Louis Post-Dispatch has been affected.

    Read more on Tech Crunch.