The Web Scraping Consent Model Was Always Broken. AI Just Made It Obvious.

E_coli42@lemmy.world · 2 months ago

The Web Scraping Consent Model Was Always Broken. AI Just Made It Obvious.

hobata@lemmy.ml · edit-2 2 months ago

I completely disagree. The road to hell is paved with good intentions. What he proposes is just another flavor of technofascism, control disguised as ethics. These so-called humane scraping barriers will end up blocking humans, not machines. We have seen this before with CAPTCHAs, reCAPTCHAs, and all those “human verification” gimmicks, always bypassed by bots and always annoying to real people. Personally, I have nothing against crawlers and bots; they should do their shady jobs. Trying to wall yourself off is just meh. And those fancy licenses or “ethical use” terms won’t change anything either. The web is not the United States, and nobody really cares about someone’s imaginary social contracts. Maybe it’s time to accept a simple fact: once something goes on the web, it becomes public territory, and no one can still pretend to control the flow of information.

davel@lemmy.ml · 2 months ago

Fortunately this quixotic proposal is just some guy’s blog post.

su_liam@mas.to · 2 months ago

@davel @hobata Is there something to this beyond the headline. I’m mildly annoyed by these posts that just link back to themselves. What the hell is that all about?
All I can take from the headline is, yeah, the consent model is basically Old West with richer Rancher Barons.

E_coli42@lemmy.world · 2 months ago

Its a link to my blog post:

https://agamsingh9.codeberg.page/posts/ai-web-scrapers/

Is the link not working?

Hetare King@piefed.social · edit-2 2 months ago

There’s no need to pave a road to hell if you’re already in hell because you’ve surrendered to bad intentions and now they’re all that’s left. The logical conclusion from acceptance is that it’s in nobody’s interest to put anything on the web (or anything equivalent) and have it become even more of a consumption-only medium than it already has.

Also, what’s happening right now is, in fact, the flow of information being controlled; to primarily flow towards a few powerful entities, that is. You’re neglecting to consider the effects of power differentials. Those powerful entities need to be constrained for the flow of information to actually be free.

Granted, the solution proposed in the blog post seems a bit too technical and high-friction to really be feasibly, but at least people are thinking about it.

su_liam@mas.to · 2 months ago

@HetareKing @hobata Is there a link to this blog post somewhere? All I’m getting is a headline and a circular link back to the headline.

Hetare King@piefed.social · 2 months ago

Unfortunately, all I can say is that the link in the post works for me and it doesn’t appear to be forwarding to anything. Somewhat ironically, it’s on the Internet Archive, though, so maybe that works for you.

su_liam@mas.to · 2 months ago

@HetareKing @hobata Yeah, I’m not seeing a lot of good intentions behind surveillance-capitalism, and the road to hell was well-paved with no good intention whatsoever.

hobata@lemmy.ml · 2 months ago

I don’t agree, and your conclusions aren’t obvious to me. All I see is that the only problems are getting information, not putting something into web. And we’re not in hell; in fact, we’re in the very best place and time.

Hetare King@piefed.social · 2 months ago

In order to be able to get information on the web, people need to put it on the web first. And for that to happen, there needs to be something to motivate them to do so. What those motivations are is going to differ between people and situations, could be a pure desire to contribute to the commons, could be part of how they make their income, could be any other number of things. But if putting something on the web means accepting that you’re going to be helping vile companies achieve their goals and the way most people may see this information is in a perverse form, riddled with falsehoods and with no attribution (or maybe worse, mostly falsehoods attributed to you), and there’s nothing you can do about it, that’s going to put a damper on a lot of those motivations, and the ones that aren’t tend to be the less desirable ones.

And it’s not just information that’s on the web, it’s also collaborative efforts like open source software. Why do people release source code under licenses like the GPL? Because they believe those constraints lead to a better outcome than if they had just put it in the public domain. That their contributions to the commons lead to more contributions to the commons, even from people who may not be inclined or incentivised to do so. If it becomes trivial to undermine those licenses (and for the record, those licenses do get enforced and there have been companies that had to release the source code of their products because they violated the license), that may undermine the reasons for many to contribute to the project.

You can be all cool and cynical about how social contracts are made up and whatnot, but let’s be honest here; if someone beats you to a pulp because they didn’t like the way you looked at them, you’re not going to just coolly accept your broken nose and displaced ribs as just the way things work.

hobata@lemmy.ml · 2 months ago

People publish things for all sorts of reasons, sure, but the main one is that they think the information has value. Whether some corporation profits from it is completely irrelevant, the data doesn’t care who benefits. The author doesn’t matter either, what matters is the information itself. Licenses just try to put fences around what should be free, and most of the time they only get in the way. People can follow them or ignore them, and life goes on. The scene understands that better than anyone else. It’s the purest form of the web: information shared for its own sake, without permission, limits, or fake moral theater.

chunes@lemmy.world · 2 months ago

Not my problem if people are too lazy to go to actual sites. I couldn’t care less if they decide to use dogshit tools to bypass my page. Why on earth would that damper my motivation to make a page? Intelligent people will still see my page.

Brummbaer@pawb.social · 2 months ago

I agree with you on some points here. The problem is that these crawlers are hostile to the point of DDOSing sites

So the problem is not that someone archives your public accessible data, the problem is that in doing so either breaks your site or makes you pay for the excess traffic.

I think the web is now broken beyond repair. The commercialisation killed it and the tech monopolies are all that’s left.

So I think small invite only fully encrypted enclaves are all that is left, until someone comes up with a “new Internet”, that can resist the " Techbros", but for now I don’t see that.

Also I don’t see the Fediverse as a solution, it’s just under the radar for now, but if it gets bigger it will be coöpted and sunk.

deadcade@lemmy.deadca.de · 2 months ago

Personally, I have nothing against crawlers and bots

If they’re implemented reasonably, web crawlers aren’t the issue. The problems with them mostly stem from laziness and cost cutting. Web crawlers by AI comapnies frequently DDoS entire services, especially Git forges like Gitlab or Forgejo. Not “intentionally”, but because these crawlers will blindly request every URL on a service, no matter the actual content. This is cheaper for the AI company to implement this way, and scan through the data later. But this also leads to the service having to render and serve tens of thousands of times as much content as is actually present. They are made to try and hide themselves doing so, which is the biggest reason we see “modern” PoW CAPTCHAs everywhere, like Anubis or go-away.

Robots.txt used to work, because search engines needed there to be an “internet” to provide their services. Web crawlers pre-AI were made knowing that taking down a service made another website go down, which lessened the overall quality of search results.

I’ve had LLM webcrawlers take down my whole server by DDoSing it several times. Pre-LLMs, a git forge would take maybe a couple hundred MB of RAM and be mostly idle while not in use. Nowadays, without a PoW CAPTCHA in front, there are often over 10.000 active concurrent connections to a small, single person Git forge. This makes hosting costs go through the roof for any smaller entity.

deluxeparrot@thelemmy.club · 2 months ago

I don’t understand how their solution differs from what currently happens at any other big site like reddit.

Must create account and then there’s scraping resistance by captcha and rate limiting.

E_coli42@lemmy.world · 2 months ago

I am seeing from these comments that my proposed solution was pretty naïve. I intended this blog post to be a sort of thought experiment to challenge assumptions made about the web pre-AI rather than my thought up technical solution to be the main focus.

I might go back and re-write some of this post to gear the focus more towards my main points of the social contract between bot and human shifting (especially with copyleft/ShareAlike), the web becoming less “open”, how this is not a new idea since the DMCA already considers standards for automated access, etc.

su_liam@mas.to · 2 months ago

@hobata @E_coli42 Yeah, I’m not seeing any of that on reading the blog. Controlling who has access to your stuff is not fascism, techno or otherwise. If the proposal was to make it mandatory, it would be very different. I elect to put my product behind mild hoops, or not. You choose to access it, or not.

chunes@lemmy.world · 2 months ago

The ‘web scraping consent model’ as you so bootlickingly put it is simple: if you put something on the internet, anyone can and will look at it. If you don’t agree to that, then we’re better off without you.

makingStuffForFun@lemmy.ml · 2 months ago

It’s probably a little bit like when people used to copy tapes of their favourite artists in the 80s. And then that was just for private use and home projects.

But this is next level.

The Web Scraping Consent Model Was Always Broken. AI Just Made It Obvious.

The Web Scraping Consent Model Was Always Broken. AI Just Made It Obvious.

The Web Scraping Consent Model Was Always Broken. AI Just Made It Obvious. — Agam's Personal Blog