The ‘web scraping consent model’ as you so bootlickingly put it is simple: if you put something on the internet, anyone can and will look at it. If you don’t agree to that, then we’re better off without you.
I completely disagree. The road to hell is paved with good intentions. What he proposes is just another flavor of technofascism, control disguised as ethics. These so-called humane scraping barriers will end up blocking humans, not machines. We have seen this before with CAPTCHAs, reCAPTCHAs, and all those “human verification” gimmicks, always bypassed by bots and always annoying to real people. Personally, I have nothing against crawlers and bots; they should do their shady jobs. Trying to wall yourself off is just meh. And those fancy licenses or “ethical use” terms won’t change anything either. The web is not the United States, and nobody really cares about someone’s imaginary social contracts. Maybe it’s time to accept a simple fact: once something goes on the web, it becomes public territory, and no one can still pretend to control the flow of information.
I agree with you on some points here. The problem is that these crawlers are hostile to the point of DDOSing sites
So the problem is not that someone archives your public accessible data, the problem is that in doing so either breaks your site or makes you pay for the excess traffic.
I think the web is now broken beyond repair. The commercialisation killed it and the tech monopolies are all that’s left.
So I think small invite only fully encrypted enclaves are all that is left, until someone comes up with a “new Internet”, that can resist the " Techbros", but for now I don’t see that.
Also I don’t see the Fediverse as a solution, it’s just under the radar for now, but if it gets bigger it will be coöpted and sunk.
Personally, I have nothing against crawlers and bots
If they’re implemented reasonably, web crawlers aren’t the issue. The problems with them mostly stem from laziness and cost cutting. Web crawlers by AI comapnies frequently DDoS entire services, especially Git forges like Gitlab or Forgejo. Not “intentionally”, but because these crawlers will blindly request every URL on a service, no matter the actual content. This is cheaper for the AI company to implement this way, and scan through the data later. But this also leads to the service having to render and serve tens of thousands of times as much content as is actually present. They are made to try and hide themselves doing so, which is the biggest reason we see “modern” PoW CAPTCHAs everywhere, like Anubis or go-away.
Robots.txt used to work, because search engines needed there to be an “internet” to provide their services. Web crawlers pre-AI were made knowing that taking down a service made another website go down, which lessened the overall quality of search results.
I’ve had LLM webcrawlers take down my whole server by DDoSing it several times. Pre-LLMs, a git forge would take maybe a couple hundred MB of RAM and be mostly idle while not in use. Nowadays, without a PoW CAPTCHA in front, there are often over 10.000 active concurrent connections to a small, single person Git forge. This makes hosting costs go through the roof for any smaller entity.
I don’t understand how their solution differs from what currently happens at any other big site like reddit.
Must create account and then there’s scraping resistance by captcha and rate limiting.
There’s no need to pave a road to hell if you’re already in hell because you’ve surrendered to bad intentions and now they’re all that’s left. The logical conclusion from acceptance is that it’s in nobody’s interest to put anything on the web (or anything equivalent) and have it become even more of a consumption-only medium than it already has.
Also, what’s happening right now is, in fact, the flow of information being controlled; to primarily flow towards a few powerful entities, that is. You’re neglecting to consider the effects of power differentials. Those powerful entities need to be constrained for the flow of information to actually be free.
Granted, the solution proposed in the blog post seems a bit too technical and high-friction to really be feasibly, but at least people are thinking about it.
@HetareKing @hobata Is there a link to this blog post somewhere? All I’m getting is a headline and a circular link back to the headline.
Unfortunately, all I can say is that the link in the post works for me and it doesn’t appear to be forwarding to anything. Somewhat ironically, it’s on the Internet Archive, though, so maybe that works for you.
I don’t agree, and your conclusions aren’t obvious to me. All I see is that the only problems are getting information, not putting something into web. And we’re not in hell; in fact, we’re in the very best place and time.
In order to be able to get information on the web, people need to put it on the web first. And for that to happen, there needs to be something to motivate them to do so. What those motivations are is going to differ between people and situations, could be a pure desire to contribute to the commons, could be part of how they make their income, could be any other number of things. But if putting something on the web means accepting that you’re going to be helping vile companies achieve their goals and the way most people may see this information is in a perverse form, riddled with falsehoods and with no attribution (or maybe worse, mostly falsehoods attributed to you), and there’s nothing you can do about it, that’s going to put a damper on a lot of those motivations, and the ones that aren’t tend to be the less desirable ones.
And it’s not just information that’s on the web, it’s also collaborative efforts like open source software. Why do people release source code under licenses like the GPL? Because they believe those constraints lead to a better outcome than if they had just put it in the public domain. That their contributions to the commons lead to more contributions to the commons, even from people who may not be inclined or incentivised to do so. If it becomes trivial to undermine those licenses (and for the record, those licenses do get enforced and there have been companies that had to release the source code of their products because they violated the license), that may undermine the reasons for many to contribute to the project.
You can be all cool and cynical about how social contracts are made up and whatnot, but let’s be honest here; if someone beats you to a pulp because they didn’t like the way you looked at them, you’re not going to just coolly accept your broken nose and displaced ribs as just the way things work.
Not my problem if people are too lazy to go to actual sites. I couldn’t care less if they decide to use dogshit tools to bypass my page. Why on earth would that damper my motivation to make a page? Intelligent people will still see my page.
People publish things for all sorts of reasons, sure, but the main one is that they think the information has value. Whether some corporation profits from it is completely irrelevant, the data doesn’t care who benefits. The author doesn’t matter either, what matters is the information itself. Licenses just try to put fences around what should be free, and most of the time they only get in the way. People can follow them or ignore them, and life goes on. The scene understands that better than anyone else. It’s the purest form of the web: information shared for its own sake, without permission, limits, or fake moral theater.
Fortunately this quixotic proposal is just some guy’s blog post.
It’s probably a little bit like when people used to copy tapes of their favourite artists in the 80s. And then that was just for private use and home projects.
But this is next level.







