今日摘要

Simon Willison:I recently talked with Joseph Ruscio about AI coding tools for Heavybit's High Leverage podcast: Ep. #9, The AI Coding Paradigm Sh…

Simon Willison:Behind the Scenes Hardening Firefox with Claude Mythos Preview Fascinating, in-depth details on how Mozilla used their access to t…

Simon Willison:There weren't a lot of big new announcements from Anthropic at yesterday's Code w/ Claude event, but the biggest by far was the de…

Simon Willison:Using Claude Code: The Unreasonable Effectiveness of HTML Thought-provoking piece by Thariq Shihipar (on the Claude Code team at A…

Simon Willison:Release: llm-gemini 0.31 gemini-3.1-flash-lite is no longer a preview Here's my write-up of the Gemini 3.1 Flash-Lite Preview mode…

总结 + 观点:Tool: GitHub Repo Stats One of the things I alwa…|中文观点:GitHub Repo Stats 的核心不在新鲜感,而在它是否能提升工程效率、部署稳定性…

总结 + 观点:Tool: Big Words I'm using my vibe coded macOS pr…|中文观点:Big Words 的核心不在新鲜感,而在它是否能提升工程效率、部署稳定性或开发者工作流。

总结 + 观点:WebRTC is designed to degrade and drop my prompt…|中文观点:围绕 Quoting Luke Curley,真正重要的是它会不会影响团队的模型选型、性能…

总结 + 观点:Release: datasette-referrer-policy 0.1 The OpenS…|中文观点:datasette-referrer-policy 0.1 的核心不在新鲜感,而在它是否能…

Vibe coding and agentic engineering are getting closer than I'd like

来源:Simon Willison

标签:#ai_engineering_blogs #trend-signal

作者:

原文:I recently talked with Joseph Ruscio about AI coding tools for Heavybit's High Leverage podcast: Ep. #9, The AI Coding Paradigm Shift with Simon Willison Here are some of my highlights, including my disturbing realization that vibe coding and agentic engineering have started to converge in my own work. One thing I really enjoy about podcasts is that they sometimes push me to think out loud in a way that exposes an idea I've not previously been able to put into words. Vibe coding and agentic engineering are starting to overlap A few weeks after vibe coding was first coined I published Not all AI-assisted programming is vibe coding (but vibe coding rocks) where I firmly staked out my belief that "vibe coding" is a very different beast from responsible use of AI to write code, which I've since started to call agentic engineering When Joseph brought up the distinction between the two I had a sudden realization that they're not nearly as distinct for me as they used to be: Weirdly though, those things have started to blur for me already, which is quite upsetting. I thought we had a very clear delineation where vibe coding is the thing where you're not looking at the code at all. You might not even know how to program. You might be a non-programmer who asks for a thing, and gets a thing, and if the thing works, then great! And if it doesn't, you tell it that it doesn't work and cross your fingers. But at no point are you really caring about the code quality or any of those additional constraints. And my take on vibe coding was that it's fantastic, provided you understand when it can be used and when it can't. A personal tool for you, where if there's a bug it hurts only you, go ahead! If you're building software for other people, vibe coding is grossly irresponsible because it's other people's information. Other people get hurt by your stupid bugs. You need to have a higher level than that. This contrasts with agentic engineering where you are a professional software engineer. You understand security and maintainability and operations and performance and so forth. You're using these tools to the highest of your own ability. I'm finding the scope of challenges I can take on has gone up by a significant amount because I've got the support of these tools. But I'm still leaning on my 25 years of experience as a software engineer. The goal is to build high quality production systems: if you're building lower quality stuff faster, I think that's bad. I want to build higher quality stuff faster. I want everything I'm building to be better in every way than it was before. The problem is that as the coding agents get more reliable, I'm not reviewing every line of code that they write anymore, even for my production level stuff. I know full well that if you ask Claude Code to build a JSON API endpoint that runs a SQL query and outputs the results as JSON, it's just going to do it right. It's not going to mess that up. You have it add automated tests, you have it add documentation, you know it's going to be good. But I'm not reviewing that code. And now I've got that feeling of guilt: if I haven't reviewed the code, is it really responsible for me to use this in production? The thing that really helps me is thinking back to when I've worked at larger organizations where I've been an engineering manager. Other teams are building software that my team depends on. If another team hands over something and says, "hey, this is the image resize service, here's how to use it to resize your images"... I'm not going to go and read every line of code that they wrote. I'm going to look at their documentation and I'm going to use it to resize some images. And then I'm going to start shipping my own features. And if I start running into problems where the image resizer thing appears to have bugs or the performance isn't good, that's when I might dig into their Git repositories and see what's going on. But for the most part I treat that as a semi-black box that I don't look at until I need to. I'm starting to treat the agents in the same way. And it still feels uncomfortable, because human beings are accountable for what they do. A team can build a reputation. I can say "I trust that team over there. They built good software in the past. They're not going to build something rubbish because that affects their professional reputations." Claude Code does not have a professional reputation! It can't take accountability for what it's done. But it's been proving itself anyway - time and time again it's churning out straightforward things and doing them right in the style that I like. There's an element of the normalization of deviance here - every time a model turns out to have written the right code without me monitoring it closely there's a risk that I'll trust it at the wrong moment in the future and get burned. The new challenge of evaluating software It used to be if you found a GitHub repository with a hundred commits and a good readme and automated tests and stuff, you could be pretty sure that the person writing that had put a lot of care and attention into that project. And now I can knock out a git repository with a hundred commits and a beautiful readme and comprehensive tests of every line of code in half an hour! It looks identical to those projects that have had a great deal of care and attention. Maybe it is as good as them. I don't know. I can't tell from looking at it. Even for my own projects, I can't tell. So I realized what I value more than the quality of the tests and documentation is that I want somebody to have used the thing. If you've got a vibe coded thing which you have used every day for the past two weeks, that's much more valuable to me than something that you've just spat out and hardly even exercised. The bottlenecks have shifted If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks? The entire software development lifecycle was, it turns out, designed around the idea that it takes a day to produce a few hundred lines of code. And now it doesn't. It's not just the downstream stuff, it's the upstream stuff as well. I saw a great talk by Jenny Wen who's the design leader at Anthropic, where she said we have all of these design processes that are based around the idea that you need to get the design right - because if you hand it off to the engineers and they spend three months building the wrong thing, that's catastrophic. There's this whole very extensive design process that you put in place because that design results in expensive work. But if it doesn't take three months to build, maybe the design process can be a whole lot riskier because cost, if you get something wrong, has been reduced so much. Why I'm still not afraid for my career When I look at my conversations with the agents, it's very clear to me that this is moon language for the vast majority of human beings. There are a whole bunch of reasons I'm not scared that my career as a software engineer is over now that computers can write their own code, partly because these things are amplifiers of existing experience. If you know what you're doing, you can run so much faster with them. I'm constantly reminded as I work with these tools how hard the thing that we do is. Producing software is a ferociously difficult thing to do. And you could give me all of the AI tools in the world and what we're trying to achieve here is still really difficult. Matthew Yglesias, who's a political commentator, yesterday tweeted "Five months in, I think I've decided that I don't want to vibecode I want professionally managed software companies to use AI coding assistance to make more/better/cheaper software products that they sell to me for money." And that feels about right to me. I can plumb my house if I watch enough YouTube videos on plumbing. I would rather hire a plumber. On the threat to SaaS providers of companies rolling their own solutions instead: I just realized it's the thing I said earlier about how I only want to use your side project if you've used it for a few weeks. The enterprise version of that is I don't want a CRM unless at least two other giant enterprises have successfully used that CRM for six months. You want solutions that are proven to work before you take a risk on them. Tags: ai generative-ai llms podcast-appearances vibe-coding coding-agents agentic-engineering

链接:https://simonwillison.net/2026/May/6/vibe-coding-and-agentic-engineering/#atom-everything

观点:从 Vibe coding and agentic engineering are getting closer than... 看,后续更应关注安全事故是否改变企业采购、接入和上线前的合规门槛。

Behind the Scenes Hardening Firefox with Claude Mythos Preview

来源:Simon Willison

标签:#ai_engineering_blogs #trend-signal

作者:

原文:Behind the Scenes Hardening Firefox with Claude Mythos Preview Fascinating, in-depth details on how Mozilla used their access to the Claude Mythos preview to locate and then fix hundreds of vulnerabilities in Firefox: Suddenly, the bugs are very good Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop. Dealing with reports that look plausibly correct but are wrong imposes an asymmetric cost on project maintainers: it’s cheap and easy to prompt an LLM to find a “problem” in code, but slow and expensive to respond to it. It is difficult to overstate how much this dynamic changed for us over a few short months. This was due to a combination of two main factors. First, the models got a lot more capable. Second, we dramatically improved our techniques for harnessing these models steering them, scaling them, and stacking them to generate large amounts of signal and filter out the noise. They include some detailed bug descriptions too, including a 20-year old XSLT bug and a 15-year-old bug in the element. A lot of the attempts made by the harness were blocked by Firefox's existing defense-in-depth measures, which is reassuring. Mozilla were fixing around 20-30 security bugs in Firefox per month through 2025. That jumped to 423 in April. Via Lobste.rs Tags: firefox mozilla security ai generative-ai llms anthropic claude ai-security-research</p>

链接:https://simonwillison.net/2026/May/7/firefox-claude-mythos/#atom-everything

</div>
观点:从 Behind the Scenes Hardening Firefox with Claude Mythos Previ... 看,后续更应关注安全事故是否改变企业采购、接入和上线前的合规门槛。
</div> </div>

Notes on the xAI/Anthropic data center deal

来源:Simon Willison

标签:#ai_engineering_blogs #engineering-value

作者:

原文:There weren't a lot of big new announcements from Anthropic at yesterday's Code w/ Claude event, but the biggest by far was the deal they've struck with SpaceX/xAI to use "all of the capacity of their Colossus data center". As I mentioned in my live blog of the keynote that's the one with the particularly bad environmental record The gas turbines installed to power the facility initially ran without Clean Air Act permits or pollution control devices, which they got away with by classifying them as "temporary". Credible reports link it to increases in hospital admissions relating to low air quality. Andy Masley, one of the most prolific voices pushing back against misleading rhetoric about data centers (see The AI water issue is fake and Data center land issues are fake had this to say about Colossus: I would simply not run my computing out of this specific data center I get that Anthropic are severely compute-constrained, but in a world where the very existence of "AI data centers" is a red-hot political issue (see recent news out of Utah for a fresh example), signing up with this particular data center is a really bad look. There was a lot of initial chatter about how this meant xAI were clearly giving up on their own Grok models, since all of their capacity would be sold to Anthropic instead. That was a misconception - Anthropic are getting Colossus 1, but xAI are keeping their larger Colossus 2 data center for their own work. As an interesting side note, the night before the Anthropic announcement, xAI sent out a deprecation notice for Grok 4.1 Fast and several other models providing just two weeks' notice before shutdown, reported here by @xlr8harder from SpeechMap: This is terrible @xai. I just spent time and money to migrate to grok 4.1 fast, and you're disabling it with less than two weeks notice, after releasing it in November, with no migration path to a fast/cheap alternative. I will never depend on one of your products again. Here's SpeechMap's detailed explanation of how they selected Grok 4.1 Fast for their project in March. Were xAI serving those models out of Colossus 1? xAI owner Elon Musk (who previously delighted in calling Anthropic "Misanthropic" tweeted the following: By way of background for those who care, I spent a lot of time last week with senior members of the Anthropic team to understand what they do to ensure Claude is good for humanity and was impressed. After that, I was ok leasing Colossus 1 to Anthropic, as SpaceXAI had already moved training to Colossus 2. And then shortly afterwards Just as SpaceX launches hundreds of satellites for competitors with fair terms and pricing, we will provide compute to AI companies that are taking the right steps to ensure it is good for humanity. We reserve the right to reclaim the compute if their AI engages in actions that harm humanity. Presumably the criteria for "harm humanity" are decided by Elon himself. Sounds like a new form of supply chain risk for Anthropic to me! Tags: ai llms anthropic ai-ethics ai-energy-usage xai andy-masley

链接:https://simonwillison.net/2026/May/7/xai-anthropic/#atom-everything

观点:Notes on the xAI/Anthropic data center deal 的核心不在新鲜感,而在它是否能提升工程效率、部署稳定性或开发者工作流。

Using Claude Code: The Unreasonable Effectiveness of HTML

来源:Simon Willison

标签:#ai_engineering_blogs #trend-signal

作者:

原文:Using Claude Code: The Unreasonable Effectiveness of HTML Thought-provoking piece by Thariq Shihipar (on the Claude Code team at Anthropic) advocating for HTML over Markdown as an output format to request from Claude. The article is crammed with interesting examples (collected on this site and prompt suggestions like this one: Help me review this PR by creating an HTML artifact that describes it. I'm not very familiar with the streaming/backpressure logic so focus on that. Render the actual diff with inline margin annotations, color-code findings by severity and whatever else might be needed to convey the concept well. I've been defaulting to asking for most things in Markdown since the GPT-4 days, when the 8,192 token limit meant that Markdown's token-efficiency over HTML was extremely worthwhile. Thariq's piece here has caused me to reconsider that, especially for output. Asking Claude for an explanation in HTML means it can drop in SVG diagrams, interactive widgets, in-page navigation and all sorts of other neat ways of making the information more pleasant to navigate. I wrote about Useful patterns for building HTML tools last December, but that was focused very much on interactive utilities like the ones on my tools.simonwillison.net site. I'm excited to start experimenting more with rich HTML explanations in response to ad-hoc prompts. Trying this out on copy.fail copy.fail describes a recently discovered Linux security exploit, including a proof of concept distributed as obfuscated Python. I tried having GPT-5.5 create an HTML explanation of the exploit like this: curl https://copy.fail/exp llm -m gpt-5.5 -s 'Explain this code in detail. Reformat it, expand out any confusing bits and go deep into what it does and how it works. Output HTML, neatly styled and using capabilities of HTML and CSS and JavaScript to make the explanation rich and interactive and as clear as possible' Here's the resulting HTML page It's pretty good, though I should have emphasized explaining the exploit over the Python harness around it. Tags: html security markdown ai prompt-engineering generative-ai llms llm claude-code

链接:https://simonwillison.net/2026/May/8/unreasonable-effectiveness-of-html/#atom-everything

观点:从 Using Claude Code: The Unreasonable Effectiveness of HTML 看,后续更应关注安全事故是否改变企业采购、接入和上线前的合规门槛。

llm-gemini 0.31

来源:Simon Willison

标签:#ai_engineering_blogs #ecosystem-shift

作者:

原文:Release: llm-gemini 0.31 gemini-3.1-flash-lite is no longer a preview Here's my write-up of the Gemini 3.1 Flash-Lite Preview model back in March. I don't believe this new non-preview model has changed since then. Tags: llm-release gemini llm google generative-ai ai llms

链接:https://simonwillison.net/2026/May/7/llm-gemini/#atom-everything

观点:对 llm-gemini 0.31 来说,更值得判断的是它会不会进入团队默认工具链,而不是短期讨论热度。

GitHub Repo Stats

来源:Simon Willison

标签:#ai_engineering_blogs #trend-signal

作者:

原文:Tool: GitHub Repo Stats One of the things I always look for when evaluating a new GitHub repository is the number of commits it has... but that number isn't visible on GitHub's mobile site layout. I built this tool to fix that, using this prompt: Given a GitHub repo URL or foo/bar repo ID show information about that repo absorbed via wither REST or graphql CORS fetch() including the number of commits in the repo and other useful stats Example output for simonw/datasette and simonw/llm Tags: github

链接:https://simonwillison.net/2026/May/7/github-repo-stats/#atom-everything

观点:GitHub Repo Stats 的核心不在新鲜感,而在它是否能提升工程效率、部署稳定性或开发者工作流。

Big Words

来源:Simon Willison

标签:#ai_engineering_blogs #learning-value

作者:

原文:Tool: Big Words I'm using my vibe coded macOS presentations tool to put together a talk, and I wanted to add a slide with some text on it. The tool only accepts URLs, so I put together a quick page that accepts query string arguments and turns them into a simple slide. Here's an example: https://tools.simonwillison.net/big-words?text=simonwillison.net&gradient=1&size=9.5 Double click or double tap the page to access a form for modifying the different options. Tags: vibe-coding tools

链接:https://simonwillison.net/2026/May/7/big-words/#atom-everything

观点:Big Words 的核心不在新鲜感,而在它是否能提升工程效率、部署稳定性或开发者工作流。

Quoting Luke Curley

来源:Simon Willison

标签:#ai_engineering_blogs #engineering-value

作者:

原文:WebRTC is designed to degrade and drop my prompt during poor network conditions. wtf my dude WebRTC aggressively drops audio packets to keep latency low. If you’ve ever heard distorted audio on a conference call, that’s WebRTC baybee. The idea is that conference calls depend on rapid back-and-forth, so pausing to wait for audio is unacceptable. …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate. After all, I’m paying good money to boil the ocean, and a garbage prompt means a garbage response. It’s not like LLMs are particularly responsive anyway. But I’m not allowed to wait It’s impossible to even retransmit a WebRTC audio packet within a browser; we tried at Discord. The implementation is hard-coded for real-time latency or else -- Luke Curley OpenAI’s WebRTC Problem, in response to How OpenAI delivers low-latency voice AI at scale Tags: webrtc openai

链接:https://simonwillison.net/2026/May/9/luke-curley/#atom-everything

观点:围绕 Quoting Luke Curley,真正重要的是它会不会影响团队的模型选型、性能边界和产品体验。

datasette-referrer-policy 0.1

来源:Simon Willison

标签:#ai_engineering_blogs #ecosystem-shift

作者:

原文:Release: datasette-referrer-policy 0.1 The OpenStreetMap tiles on the Datasette global-power-plants demo weren't displaying correctly. This turned out to be caused by two bugs. The first is that the CAPTCHA I added to that site a few weeks ago was triggering for the .json fetch requests used by the map plugin, and since those weren't HTML the user was not being asked to solve them. Here's the fix The second was that OpenStreetMap quite reasonably block tile requests from sites that use a Referrer-Policy: no-referrer header. Datasette does this by default, and I didn't want to change that default on people without warning - so I had Codex GPT-5.5 build me a new plugin to help set that header to another value. Tags: openstreetmap http datasette

链接:https://simonwillison.net/2026/May/5/datasette-referrer-policy/#atom-everything

观点:datasette-referrer-policy 0.1 的核心不在新鲜感,而在它是否能提升工程效率、部署稳定性或开发者工作流。