Could HTTP 402 be the Future of the Web?

When Tim Berners-Lee first created HTTP, way back in the early 1990s, he included a whole bunch of HTTP status codes that seemed like they might turn out to be useful. The one everybody knows is 404 Not Found - because as human being, browsing around the web, you see that one a lot. Some of them are basically invisible to humans: every time you load a web page, there’s actually an HTTP 200 OK happening behind the scenes, but you don’t see that unless you go looking for it. Same with various kinds of redirects - if the server gives back an HTTP 301, it’s saying “hey, the page you wanted has moved permanently; here’s the new address”; a 307 Temporary Redirect is saying “that’s over there right now but it might be back here later so don’t update your bookmarks”, and a 304 is saying to your browser “hey, the version of the page that’s in your cache is still good; just use that one”.

If, like me, you’ve spent a lot of time building web apps and APIs, you’ve probably spent hours of your life poring over the HTTP specifications looking for the best response code for a particular situation… like if somebody asks for something which they just deleted, should you return a 404 Not Found, or a HTTP 410 Gone - “hey, that USED to be here but it’s gone and it isn’t coming back”

And along the way, you’ve probably noticed this status code: HTTP 402 Payment Required. This code has been there since the dawn of the world wide web, but, to quote the Mozilla Developer Network: “the initial purpose of this code was for digital payment systems, however this status code is rarely used and no standard convention exists”.

Well, that might be about to change. Earlier this week, the folks over at Cloudflare announced they’re introducing something called “pay per crawl” - enabling content owners to charge AI crawlers for access.

Let’s put that in context for a second. For the vast majority of commercial websites out there, their ability to make money is directly linked to how many humans are browsing their site. Advertising, engagement, growth, subscribers - however you want to slice it, if you want to monetize the web, you probably want humans looking at your content. It’s not a great model - in fact, it’s mostly turned out to be pretty horrible in all kinds of ways - but, despite multiple valiant attempts to develop better ways to pay creators for content using some sort of microtransactions, advertising is the only one that’s really stood the test of time.

Then AI comes along. By which, in this context, I specifically mean language models trained on publicly available web data. I’ve published a transcript of this video on my blog. If somebody comes along in a couple of weeks and Googles “dylan beattie cloudflare pay per crawl”, Google could provide an AI-generated summary of this article. Or, like a lot of folks out there, that person might skip Google completely and just ask ChatGPT what Dylan Beattie thinks about pay per crawl - and they’ll get a nice, friendly - and maybe even accurate - summary of my post, and so that person never makes it as far as my website.

That is a tectonic shift in the way commercial web publishing works. For nearly thirty years, search engines have been primarily about driving traffic to websites; the entire field of SEO - search engine optimisation - is about how to engineer your own sites and your own content to make it more appealing to engines like Google and Bing. Now, we’re looking at a shift to a model where AI tools crawl your site, slurp up all your content, and endlessly regurgitate it for the edification of their users - and so the connection between the person who’s writing the thing, and the person who’s reading the thing, is completely lost.

For folks like me, who enjoy writing for humans, that’s just sad. For writers who rely on human web traffic to earn their living, it’s catastrophic… and that’s what makes Cloudflare’s proposal so interesting: they’re proposing a way to charge crawlers to read your content.

Now, websites have historically relied on something called a robots.txt file to control what search engines can and can’t see… but it’s advisory. The web was designed to be open. Robots.txt is like leaving all your doors and windows wide open and putting a sign on the lawn saying “NO BURGLARS ALLOWED”, and it’s just one of the many, many ways in which the architects of the web were… let’s say optimistically naïve about how humans actually behave. Which is maybe understandable, given that CERN, the nuclear research centre on the French/Swiss border where Tim Berners-Lee invented the World Wide Web, isn’t renowned for being a hotbed of unscrupulous capitalism.

So you had a choice: you make your content wide open and ask the robots to play nice, or lock it away behind a paywall so that only paying subscribers can see it. Which, of course, means the robots can’t see it, so your site never shows up in Google, so we invented all kinds of clever ways to create content that was accessible to search engines but asked human visitors to register, or sign in, or create an account…

Now, in theory, Cloudflare’s proposal is pretty simple. If your website is hosted behind a Cloudflare proxy - and according to Backlinko, of the world’s ten thousand busiest websites, 43% of them use Cloudflare, so that’s a LOT of websites - then when an AI crawler comes asking for content, you can reply with an HTTP 402 Payment Required, and give them a price - and if the crawler wants to pay, they try again, and include a crawler-exact-price header indicating “yes, I will pay that much” - this is reactive negotiation. Alternatively, there’s proactive negotiation, where the crawler says “hey, I’ll give you five bucks for that web page” and Cloudflare says “Yeah, sure!” - or if you’ve told your site that page costs ten bucks, the crawler gets an HTTP 402 Payment Required, and they’re free to make a better offer or go crawl somewhere else.

Incidentally, folks, I’m anthropomorphising here. Crawlers are software. They don’t ask, they don’t want to pay, they don’t agree. Even “crawling” is a metaphor. We’re talking about configuration settings and binary logic in a piece of software that makes network requests. Sure, it makes a better story if you’ve got this picture in your head of some sort of creepy-crawly insect knocking on doors and haggling over prices, but it is just code. Don’t get too attached to the metaphor.

Anyway. That’s the model. Sounds simple, apart from two tiny little details… how do you know the thing making the requests is an AI crawler, and who handles the money?

The first part is being done using a proposal called Web Bot Auth, based on two drafts from the Internet Engineering Task Force, the IETF: one, known as the directory draft, is a mechanism for allowing crawlers and web bots to publish a cryptographic key that websites can use to authenticate those bots, and the second, the protocol draft, is a mechanism for using those keys to validate individual HTTP requests.

That way, anybody running a web crawler can create a cryptographic key pair and register that key pair with Cloudflare - “hey, we’re legit, our crawler will identify itself using THIS key, here’s the URL where you can validate the key, and by the way, here’s where you send the bill”.

And that’s the second part: Cloudflare is proposing to aggregate all of those requests, charge the crawler account, and distribute the revenue to the website owners whose content is being crawled. Cloudflare acts as what’s called the Merchant of Record for those transactions, which should make it all much more straightforward when it comes to things like taxation.

Let’s be realistic here. The technical bits of this are not that complicated. They’re built using existing web standards and protocols. The financial elements of the proposal are far more complex, but this is Cloudflare, a company that probably understands the relationship between internet traffic, billing and international revenue models better than anybody.

There’s one big question that isn’t addressed in their post: what stops AI bots just pretending to be regular humans using regular browsers, and bypassing the whole pay per crawl thing? I’m guessing that, this being Cloudflare, AI bot detection is one of the things they’re quite good at… but publishers also have the option now of putting everything behind some kind of paywall; humans have to sign in, and bots have to validate. There’s also no indication as to what sort of amounts they have in mind - beyond the fact their examples are in US dollars, as opposed to, say, micros, which are a standard currency unit in Google’s payment API that’s worth one millionth of a US dollar. But I guess capitalism will figure that out.

Folks, I have to be honest. I’ve been working on the web since it was invented, and this is the first thing I’ve seen in a long, long time that is genuinely exciting. Not necessarily at face value - I don’t care that much about Cloudflare making AI bots pay to crawl websites. No, what’s exciting is that if Cloudflare goes all-in on this, this could be a big step towards a standard model, and a set of protocols, for monetising automated access to online content - even if neither Cloudflare nor AI is involved.

Imagine a decentralised music streaming service, where the artists host their own media and playback apps negotiate access to that media via a central broker that validates the app requests and distributes the revenue. Playback costs ten cents; if an AI wants to ingest, mash up and remix your music? Fifty bucks. Or a local news sites that can actually make money out of covering local news… how much would you pay to know what’s actually going on with all those sirens and smoke down the end of the High Street, from an experienced reporter who is actually on the scene asking questions, as opposed to somebody in an office recycling stuff they read on social media?

And the fact that the proposal is based around 402 Payment Required, something that’s been part of the web since the days before Google, Facebook, something that’s older than Netscape and Internet Explorer? That just makes me happy. It reminds me of the web back in the 1990s, when the protocols and proposals were all still new, and exciting, and it seemed like there was no limit to what we’d be able to build with them. And yeah, perhaps I’m being overly optimistic… but y’know, looking around at the state of the world, and the web, these days, maybe we could all use a little optimism.