Tag: automation

LLMs are not going to take your job (yet)
It seems like every site I go to with DevOps or Software Engineering news is inundated with — and I mean positively drowning in — articles about LLMs doing the job of engineers.

I was away for the weekend, and this is a smattering of the headlines I came back to, just on the first couple sites that I frequently read from:
To be fair, the tone has generally shifted away from the early Chicken Little sky-is-falling doomsday analysis; three of those four articles take a much more limited view of AI’s ability to code (and the fourth isn’t testing the possibility of replacing SWEs).

There have been benchmarks done on the coding ability of AI, but the last article above — “AI Coding: New Research Shows…” — talks about a new academic paper proposing a more in-depth benchmark that perhaps better captures the work done by a SWE or SWE Manager.

The study, “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?“, describes their new benchmark. I’ll let the abstract speak for itself:

We introduce SWE-Lancer, a benchmark of over
1,400 freelance software engineering tasks from
Upwork, valued at $1 million USD total in real-world
payouts. SWE-Lancer encompasses both
independent engineering tasks — ranging from
$50 bug fixes to $32,000 feature implementations
— and managerial tasks, where models
choose between technical implementation proposals.
Independent tasks are graded with end-to-
end tests triple-verified by experienced software
engineers, while managerial decisions are
assessed against the choices of the original hired
engineering managers. We evaluate model performance
and find that frontier models are still
unable to solve the majority of tasks… By
mapping model performance to monetary value,
we hope SWE-Lancer enables greater research
into the economic impact of AI model development.

Some interesting things from the article:
- They allow the real payouts act as a proxy for the difficulty of the task.
- They use the real-world Expensify open-source repository, and the tasks sometimes require the context of different parts of the code base to solve.
- They grade by end-to-end tests instead of unit tests. This is much less susceptible to manipulation and it provides a better insight into the actual efficacy of the solution within the product.
- They used a variety of tasks categorized either as “IC SWE” for implementation tasks or “SWE Manager” for making a choice between different proposals.
Results

On the IC SWE tasks, the article compared OpenAI’s ChatGTP o1 model against Anthropic’s Claude 3.5 Sonnet. (They also included GPT-4o, which performed less well.)

Model Tasks Passed Money Earned
GPT o1 16.5% 12.1%
Claude 3.5 Sonnet 26.2% 24.5%

Notice that the percentage of tasks is higher than the money earned. This tells me that on average the models were more successful at the easier tasks, as one might expect.

The rates of success were much higher on the SWE Manager tasks:

Model Tasks Passed Money Earned
GPT o1 41.5% 51.8%
Claude 3.5 Sonnet 44.9% 56.8%

Interestingly, the AI seems to have done well at harder-than-average tasks.

I’d like to also point out that since the SWE Manager tasks involved choosing between 4-5 proposed solutions, a coin-flip would have passed 20-25% of tasks. I don’t say this to minimize the results; only to say that there’s no valuable comparison between the performance on the two data sets.

So what does this mean? Am I out of a job yet? Probably not, but it’s very useful to have better benchmarks.

I’d really like to see a proper analysis of how well the AIs do based on the difficulty of the task: are they competent at junior-type problems yet?

And for comparison, I’d like to see how actual human engineers do on these benchmarks.
2025-02-26
No! Shut them all down, hurry!
Luke Skywalker: [interrupting] Will you shut up and listen to me! Shut down all the garbage mashers on the detention level, will ya? Do you copy? Shut down all the garbage mashers on the detention level! Shut down all the garbage mashers on the detention level!

C-3PO: [to R2-D2] No! Shut them all down, hurry!

The Official AWS Blog (great resource, by the way!) has a post describing how to reduce costs in EC2 by automatically shutting down instances while not in use.

The short version of the blog post is that you can achieve this sort of shut down in two ways:
1. Create a CloudWatch alarm to shut down when CPU usage is below a threshold for a period of time.
2. Create an EventBridge trigger and Lambda to shut down on a schedule.
I would argue that in most deployments you would have a more precise metric that actually reflects the number of your HTTP requests.

There are other guides on how to do this; I’ve looked at some, as I’ve been planning to do this for our Fargate instances (not EC2, obviously, but similar enough) in our test environments. However, it’s nice to have an official source on how to do this kind of shutdown.

The reason we want to do this on my project is to save on cloud costs. The savings probably aren’t that much, but they come from an area of the budget that is limited and needed for other things.

At any rate, option #2 better reflects what my team will want to do. We have very spiky usage, but when we do go to test an instance, we don’t want to have to wait for it to spin up. Since we have similar work hours, we’ll probably want to shut down the instances except for around 06:30 – 20:00 on weekdays. That way, it’s up whenever people are likely to be working and down at other times.

One difficulty I have anticipated is, what if someone tries to test at an odd hour. I don’t mind terribly if they need to manually start an instance; it should happen very infrequently. However, I’d like them to see a clear error message describing what’s going on. We won’t use this system in production for obvious reasons, but it would be nice if future devs years from now don’t get confused because they can’t reach the service.

So, I’m wondering if there’s a good way to dynamically point our ALB to a static file on an S3 bucket or something while the instances are down. It might be possible to set a static file as an error page in the ALB? Not sure yet. Clearly I have not yet given this more than cursory research, but it’s on my mind.
2025-02-17
Easier Cloud-to-Cloud Migrations?
An Empty (Theoretical) Promise

It’s long been a promise of Infrastructure as Code tools like Terraform that you could theoretically create platform-independent IaC and deploy freely into any cloud environment. I doubt anyone ever really meant that literally, but the reality is that your cloud infrastructure is inevitably going to be tied quite closely to your provider. If you’re using an aws_vpc resource, it’s pretty unlikely that you could easily turn that into its equivalent in another provider.

And yet, several of the organizations I’ve worked with have been reluctant to tie themselves closely with one cloud provider or another. The business reality is that the vendor lock-in is a huge source of anxiety: if AWS suddenly and drastically raised their prices, or if they for some reason became unavailable, lots and lots of businesses would be in a big pickle!

The amount of work required to manually transfer an existing system from one provider to another would be nearly as much as creating the system in the first place.

GenAI as the Solution?

I ran across this article about StackGen’s Cloud Migration product. The article isn’t long, so go read it.

Instead of requiring DevOps teams to map out resources manually, the system uses read-only API access to scan existing cloud environments. It automatically identifies resources, maps dependencies, and – perhaps most importantly – maintains security policies during the transition.

StackGen isn’t new to using generative AI for infrastructure problems, but they have an interesting approach here:
1. Use read-only APIs to identify resources, including those not already in IaC.
2. Use generative AI to map those resources, including security policies, compliance policies, and resource dependencies.
3. Convert those mapped resources into deployment-ready IaC for the destination environment.
Using a process like this to migrate from provider to provider is interesting, but the one use case that really gets me thinking is the ability to deploy into a multi-cloud environment.

I’ll be keeping my eyes on this one.
2025-02-12
DevOps 101: Keeping it Agile and Lean

DevOps didn’t come from nowhere, and it doesn’t operate in a vacuum. Any time you’re trying to apply DevOps principles, there are already going to be existing ways of doing things either in the workplace or, at the very least, in the minds of the team as they come together. You might build a team of thirty professionals, and they would have thirty different opinions on the “right” way to do things.

Relatedly, it’s hard to define DevOps, because it’s one of those terms that’s used in slightly different ways by different people. There’s a fair amount of debate about what is and isn’t DevOps.

There’s a whole discussion about whether or not it should be “DevOps” or “DevSecOps” and whether or not the two terms refer to the same thing. In a future post, I’ll explain why I prefer the term DevOps, even though I think everyone would agree that the “Sec” (Security) portion of the culture is extremely important.

Read more!

2025-02-08
DevOps 101: Culture and Mindset
Courtesy Wikimedia Commons

I’m sure everyone wants to start by diving into pipelines or traces or something. I certainly wanted that as a learner: get me straight to the technical stuff! But DevOps is more than just a set of tools or technologies; it’s a mindset, a way of thinking, a commitment to bridging the gap between development and operations (and security — more on that later) to enable faster and more reliable software delivery.DevOps culture encourages:
- Breaking down silos and fostering a sense of shared ownership and accountability.
- Open communication, trust, and mutual respect among team members.
- Blameless culture, where failures are seen as learning opportunities rather than occasions for blame.
Read more!
2025-02-07

Model	Tasks Passed	Money Earned
GPT o1	16.5%	12.1%
Claude 3.5 Sonnet	26.2%	24.5%

Model	Tasks Passed	Money Earned
GPT o1	41.5%	51.8%
Claude 3.5 Sonnet	44.9%	56.8%

Tag: automation

LLMs are not going to take your job (yet)

Results

No! Shut them all down, hurry!

Easier Cloud-to-Cloud Migrations?

An Empty (Theoretical) Promise

GenAI as the Solution?

DevOps 101: Keeping it Agile and Lean

DevOps 101: Culture and Mindset