Large language model (LLM) Web Scrapers

List overview All Threads
Download

newer

older

Huntsville Microsystem HMI-200...

Need a Connectix quick cam?

Bill Degnan

16 Sep 2025 16 Sep '25

4:53 p.m.

For those of you who run vintage computing-related info sites, have you noticed all of the LLM scraper activity? AI services are using the LLM scrapers to populate their knowledge bases. At any given moment 5-10 of them are active on vintagecomputer.net. It’s funny, when I ask an AI about something vintage computing-related, something obscure, I can trick into giving me an answer from my own site. I have actually had to modify the site code to manage the traffic, to improve efficiency. But they’re not going after just my site, these scrapers are absorbing copies of the entire WWW. I wonder how long the WWW will remain open, it would be a bummer if I found copies of my site elsewhere. Bill

Show replies by date

Paul Koning

16 Sep 16 Sep

5:41 p.m.

...

On Sep 16, 2025, at 7:53 PM, Bill Degnan via cctalk <cctalk(a)classiccmp.org> wrote: For those of you who run vintage computing-related info sites, have you noticed all of the LLM scraper activity? AI services are using the LLM scrapers to populate their knowledge bases. At any given moment 5-10 of them are active on vintagecomputer.net.

If they are honest they will obey robots.txt and you can use that to stop them. If it doesn't you can report them to the authorities... :-) paul

The Doctor

17 Sep 17 Sep

10:17 a.m.

On Tuesday, September 16th, 2025 at 17:51, Paul Koning via cctalk <cctalk(a)classiccmp.org> wrote:

...

If they are honest they will obey robots.txt and you can use that to stop them. If it doesn't

robots.txt doesn't seem to help. If anything it can hurt in this regard. Analyzing my site logs I noticed at least one scraper botnet pulling /robots.txt on its initial connect and then immediately pulling down everything listed in it. Which is also its downfall if you seed the file with traps like a tarpit or a data compression bomb. The Doctor [412/724/301/703/415/510] WWW: https://drwho.virtadpt.net/ Get thee down. Be thou funky.

Wayne S

16 Sep 16 Sep

5:53 p.m.

I did notice the scraping. I toyed with the idea of putting ludicrous text files up that a normal user would not see and see which bot got them. Sent from my iPhone

...

On Sep 16, 2025, at 17:02, Bill Degnan via cctalk <cctalk(a)classiccmp.org> wrote: For those of you who run vintage computing-related info sites, have you noticed all of the LLM scraper activity? AI services are using the LLM scrapers to populate their knowledge bases. At any given moment 5-10 of them are active on vintagecomputer.net. It’s funny, when I ask an AI about something vintage computing-related, something obscure, I can trick into giving me an answer from my own site. I have actually had to modify the site code to manage the traffic, to improve efficiency. But they’re not going after just my site, these scrapers are absorbing copies of the entire WWW. I wonder how long the WWW will remain open, it would be a bummer if I found copies of my site elsewhere. Bill

Wayne S

5:55 p.m.

They do not observe robots .txt Sent from my iPhone

...

On Sep 16, 2025, at 17:53, Wayne S <wayne.sudol(a)hotmail.com> wrote: I did notice the scraping. I toyed with the idea of putting ludicrous text files up that a normal user would not see and see which bot got them. Sent from my iPhone > On Sep 16, 2025, at 17:02, Bill Degnan via cctalk <cctalk(a)classiccmp.org> wrote: > > For those of you who run vintage computing-related info sites, have you > noticed all of the LLM scraper activity? AI services are using the LLM > scrapers to populate their knowledge bases. > > At any given moment 5-10 of them are active on vintagecomputer.net. It’s > funny, when I ask an AI about something vintage computing-related, > something obscure, I can trick into giving me an answer from my own site. > > I have actually had to modify the site code to manage the traffic, to > improve efficiency. > > But they’re not going after just my site, these scrapers are absorbing > copies of the entire WWW. > > I wonder how long the WWW will remain open, it would be a bummer if I found > copies of my site elsewhere. > > Bill

Paul Koning

17 Sep 17 Sep

6:33 a.m.

A web crawler that does not obey robots.txt is not a law abiding outfit. Best would be to block it entirely. If they are that dismissive of honesty, they are also unlikely to pay attention to such matters as copyright and intellectual property ownership. paul

...

On Sep 16, 2025, at 8:55 PM, Wayne S via cctalk <cctalk(a)classiccmp.org> wrote: They do not observe robots .txt Sent from my iPhone > On Sep 16, 2025, at 17:53, Wayne S <wayne.sudol(a)hotmail.com> wrote: > > I did notice the scraping. > I toyed with the idea of putting ludicrous text files up that a normal user would not see and see which bot got them. > > Sent from my iPhone > >> On Sep 16, 2025, at 17:02, Bill Degnan via cctalk <cctalk(a)classiccmp.org> wrote: >> >> For those of you who run vintage computing-related info sites, have you >> noticed all of the LLM scraper activity? AI services are using the LLM >> scrapers to populate their knowledge bases. >> >> At any given moment 5-10 of them are active on vintagecomputer.net. It’s >> funny, when I ask an AI about something vintage computing-related, >> something obscure, I can trick into giving me an answer from my own site. >> >> I have actually had to modify the site code to manage the traffic, to >> improve efficiency. >> >> But they’re not going after just my site, these scrapers are absorbing >> copies of the entire WWW. >> >> I wonder how long the WWW will remain open, it would be a bummer if I found >> copies of my site elsewhere. >> >> Bill

Doug McIntyre

7:38 a.m.

On Wed, Sep 17, 2025 at 09:33:25AM -0400, Paul Koning via cctalk wrote:

...

So, you want to block the whole of the Internet, including every AI company that all ignore robots.txt? All of the AI companies have been already sued by the book publishers for outright pirating all of Z-Lib and Scilib. Several have settled. Meta admited, we only torrented the books, not shared them. (not how that works). Besides Cloudflare (which has a vested interest in this already), the AI constant scraping has prompted solutions such as https://anubis.techaro.lol/ forcing browsers to do proof-of-work to connect to websites to protect their content.

Cameron Kaiser

9:17 a.m.

...

Besides Cloudflare (which has a vested interest in this already), the AI constant scraping has prompted solutions such as https://anubis.techaro.lol/ forcing browsers to do proof-of-work to connect to websites to protect their content.

Xe has done the Lord's work with Anubis, but I'm not a fan in principle, for two reasons: 1. The bots will get smarter. It's an inevitable arms' race. 2. Specific to our hobby/concern/obsession, it demands more of users' browsers. I have people who prefer Floodgap because I don't enforce HTTPS and the site is still largely browseable by Netscape Navigator 3.0 (and by design, which I selected as an arbitrary cutoff, though older browsers like early Mosaic still somewhat work). You don't even need to provide it a Host: header. It will work with browsers that don't even understand charsets in a MIME type. I can't really ask those users to do a JavaScript proof-of-work. I've started putting high bandwidth items exclusively under Gopher and making the Gopher proxy extremely strict. This has the nice side effect of increasing, slowly, the installed base of Gopher clients. There are a few Gopher bots out there but they're usually single people, easily tracked down, and typically apologetic when they realize it went out of control. The AI bots do none of that, so I have no compunction about blackholing them in huge netblocks. If they were willing to work with sites, they wouldn't constantly toss out new IPs and user agents like germophobes and wet wipes. -- ------------------------------------ personal: http://www.cameronkaiser.com/ -- Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckaiser(a)floodgap.com -- You only live twice. -------------------------------------------------------

Bill Degnan

9:51 a.m.

...

<snip> I've started putting high bandwidth items exclusively under Gopher and making the Gopher proxy extremely strict. This has the nice side effect of increasing, slowly, the installed base of Gopher clients. There are a few Gopher bots out there but they're usually single people, easily tracked down, and typically apologetic when they realize it went out of control. The AI bots do none of that, so I have no compunction about blackholing them in huge netblocks. If they were willing to work with sites, they wouldn't constantly toss out new IPs and user agents like germophobes and wet wipes.

that's an interesting approach for sure. although gopher is probably not a long-term solution, I suppose it does not matter as sooner or later the WWW will be regional webs, protected by corps, governments and netizens themselves. Parts of it will be AI swamps where seldom one ventures and returns unaffected

The Doctor

10:32 a.m.

On Wednesday, September 17th, 2025 at 10:01, Bill Degnan via cctalk <cctalk(a)classiccmp.org> wrote:

...

although gopher is probably not a long-term solution, I suppose it does not matter as sooner or later the WWW will be regional webs, protected by corps, governments and netizens themselves. Parts of it will be AI swamps where seldom one ventures and returns unaffected

I think you might be right. This (https://www.wired.com/story/geedge-networks-mass-censorship-leak/) has quite a few of us concerned. I know some folks who've basically null routed every network out there that has crawlers (from the Big G all the way down to LLM-scraper-as-a-service companies) so that you can only see anything if you're on a residential network. Some other folks went Tor hidden service-only with their stuff (which makes reading their RSS feeds tricky, to say the least). I don't know how this is going to shake out, but it's not going to be fun for anybody. The Doctor [412/724/301/703/415/510] WWW: https://drwho.virtadpt.net/ Get thee down. Be thou funky.

Paul Koning

9:59 a.m.

...

On Sep 17, 2025, at 10:38 AM, Doug McIntyre via cctalk <cctalk(a)classiccmp.org> wrote: On Wed, Sep 17, 2025 at 09:33:25AM -0400, Paul Koning via cctalk wrote:

So, you want to block the whole of the Internet, including every AI company that all ignore robots.txt?

Yes. paul

The Doctor

10:33 a.m.

On Wednesday, September 17th, 2025 at 10:07, Paul Koning via cctalk <cctalk(a)classiccmp.org> wrote:

...

So, you want to block the whole of the Internet, including every AI company that all ignore robots.txt?

Yes.

So you're just going to take your stuff down? I mean, that'll work but that seems like giving up. The Doctor [412/724/301/703/415/510] WWW: https://drwho.virtadpt.net/ Get thee down. Be thou funky.

Bill Degnan

8:29 a.m.

Paul In a perfect world yes. Here's a trick you guys can use... Generate a robots.txt and add a few pages not to crawl. Assuming bad bots will ignore, one of the "d onot crawl" pages will have a trigger that blocks the ip address of the session. You would need the ability to communicate the IP address of the offending bot to a process that does the blocking. There are various ways to do that. On Wed, Sep 17, 2025 at 9:46 AM Paul Koning via cctalk < cctalk(a)classiccmp.org> wrote:

...

On Sep 16, 2025, at 8:55 PM, Wayne S via cctalk <cctalk(a)classiccmp.org>

wrote:

They do not observe robots .txt Sent from my iPhone > On Sep 16, 2025, at 17:53, Wayne S <wayne.sudol(a)hotmail.com> wrote: > > I did notice the scraping. > I toyed with the idea of putting ludicrous text files up that a normal

user would not see and see which bot got them.

> > Sent from my iPhone > >> On Sep 16, 2025, at 17:02, Bill Degnan via cctalk <

cctalk(a)classiccmp.org> wrote:

>> >> For those of you who run vintage computing-related info sites, have

you

>> noticed all of the LLM scraper activity? AI services are using the

LLM

>> scrapers to populate their knowledge bases. >> >> At any given moment 5-10 of them are active on vintagecomputer.net.

It’s

>> funny, when I ask an AI about something vintage computing-related, >> something obscure, I can trick into giving me an answer from my own

site.

>> >> I have actually had to modify the site code to manage the traffic, to >> improve efficiency. >> >> But they’re not going after just my site, these scrapers are absorbing >> copies of the entire WWW. >> >> I wonder how long the WWW will remain open, it would be a bummer if I

found

>> copies of my site elsewhere. >> >> Bill

Ethan O'Toole

8:55 a.m.

...

A forum related to laser show syatems I use from time to time was getting hit by scrapers coming from 300,000 unique IP addresses. I hit a different issue, they were completing captcha and spamming wiki. - Ethan

...

-- : Ethan O'Toole

The Doctor

10:22 a.m.

On Wednesday, September 17th, 2025 at 09:11, Ethan O'Toole via cctalk <cctalk(a)classiccmp.org> wrote:

...

The same thing happened to me at $dayjob. We wound up putting a bunch of our stuff (wiki, documentation repository, code forge) behind a VPN because the LLM crawler bots were eating up so much bandwidth we couldn't use them for day to day stuff. Which really sucks because they were supposed to be public to begin with. The Doctor [412/724/301/703/415/510] WWW: https://drwho.virtadpt.net/ Get thee down. Be thou funky.

Sean Conner

1:38 p.m.

It was thus said that the Great Paul Koning via cctalk once stated:

...

That's what I did for one bot that identified itself as: Mozilla/5.0 (compatible; Thinkbot/0.5.8; +In_the_test_phase,_if_the_Thinkbot_brings_you_trouble,_please_block_its_IP_address._Thank_you.) only it doesn't come from a single IP address, but thousands. I ended up blocking over 450,000 addresses at the firewall level. Details here: https://boston.conman.org/2025/08/21.1 -spc

Cameron Kaiser

16 Sep 16 Sep

5:55 p.m.

...

For those of you who run vintage computing-related info sites, have you noticed all of the LLM scraper activity? AI services are using the LLM scrapers to populate their knowledge bases.

A massive, massive IP filter. There has been some collateral damage, but unfortunately I don't think this is avoidable. They're a plague. -- ------------------------------------ personal: http://www.cameronkaiser.com/ -- Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckaiser(a)floodgap.com -- Time is an illusion. Lunch time, doubly so. -- Douglas Adams ---------------

Bill Degnan

7:39 p.m.

I set my server to limit requests per hour from the same IP to slow them down, and I have code to detect bots and redirect their sessions to a low impact catch page. It’s not that hard to control, but lately I have noticed old tricks no longer work as well. AI arms race. But I always believed publishing publicly would eventually cause the content to enter the public domain. On Tue, Sep 16, 2025 at 9:02 PM Cameron Kaiser via cctalk < cctalk(a)classiccmp.org> wrote:

...

For those of you who run vintage computing-related info sites, have you noticed all of the LLM scraper activity? AI services are using the LLM scrapers to populate their knowledge bases.

Mark Linimon

17 Sep 17 Sep

2:10 a.m.

FreeBSD was forced to deploy multiple instances of Anubis to even be able to allow access to the src/ports/doc repositories, and the bug tracking system. mcl

Fred Cisin

9:52 a.m.

Would it be feasable: Once an one has accessed anything, that address can't access anything else for one millisecond; after the next access, the delay is two milliseconds, then four, eight, sixteen, etc. After ten accesses, it is waiting a second for the next one; after twenty-seven, it is waiting a day; after 37, it is waiting a year, . . . Human users would not be inconvenienced, and wouldn't even notice for the first few dozen. How many human users would be accessing more than dozens? -- Grumpy Ol' Fred cisin(a)xenosoft.com

The Doctor

10:15 a.m.

On Tuesday, September 16th, 2025 at 17:01, Bill Degnan via cctalk <cctalk(a)classiccmp.org> wrote:

...

I wonder how long the WWW will remain open, it would be a bummer if I found copies of my site elsewhere.

I've been thinking about this myself. It does not please me. What web server do you use for your site? I've got some pretty robust but easy to admin countermeasures set up on my own website that I'd be happy to share if there is interest. The Doctor [412/724/301/703/415/510] WWW: https://drwho.virtadpt.net/ Get thee down. Be thou funky.

Bill Degnan

11:14 a.m.

On Wed, Sep 17, 2025 at 1:27 PM The Doctor via cctalk <cctalk(a)classiccmp.org> wrote:

...

On Tuesday, September 16th, 2025 at 17:01, Bill Degnan via cctalk < cctalk(a)classiccmp.org> wrote:

I wonder how long the WWW will remain open, it would be a bummer if I

found

copies of my site elsewhere.

I run a web services company, vintagecomputer.net is internally-supported. vintagecomputer.net has been dealing with some sort of scrapers for 20 years. The site is privately hosted and has web scraping control measures, built to detect a whole array of bot activity. Rather than block, I believe it's better to detect and log, and then determine how best to manage new types of bot probing and scraping on an ongoing basis, it's a great way to learn white hat hacking.

Scott Baker

1:25 p.m.

The way I look at it, for my personal site and content, 1. I'm not going to win the arms race. It's not my area of expertise. I can't put hours into devising anti-bot countermeasures. 2. If I do try to implement countermeasures to prevent bots, I will likely also end up impacting/inconveniencing some legitimate users. 3. Ultimately my goal is to help people, so if my content ends up training an AI model,. and that model ends up helping people, then I'm indirectly meeting my goal. 4. Many of the AIs are now citing their sources, and that means I get some level of attribution and recognition. 5. Some archivals, such as wayback machine, I find to be extremely useful to vintage computer research. People die. Providers shut down. A lot of knowledge has been lost. I'll be happy if my content eventually outlives me. I wish there would be more focus on (4). Everyone deserves recognition of their work and their content. I'd support legislation to require that sources are cited/acknowledged when AI results are returned. I think there's some risk of "content laundering", i.e. a bot is trained from your content, someone publishes an AI-generated article, and the next bot is trained from that AI-generated content, losing the original attribution. Without discipline, it can turn into a bunch of slop that nobody knows where it came from, or the accuracy of the information. Scott On Wed, Sep 17, 2025 at 11:31 AM Bill Degnan via cctalk < cctalk(a)classiccmp.org> wrote:

...

On Wed, Sep 17, 2025 at 1:27 PM The Doctor via cctalk < cctalk(a)classiccmp.org> wrote:

On Tuesday, September 16th, 2025 at 17:01, Bill Degnan via cctalk < cctalk(a)classiccmp.org> wrote:

I wonder how long the WWW will remain open, it would be a bummer if I

found

copies of my site elsewhere.

I've been thinking about this myself. It does not please me. What web server do you use for your site? I've got some pretty robust

but

easy to admin countermeasures set up on my own website that I'd be happy to share if there is interest.

days inactive

days old

cctalk@classiccmp.org

Manage subscription

22 comments

11 participants

tags (0)

participants (11)

Bill Degnan
Cameron Kaiser
Doug McIntyre
Ethan O'Toole
Fred Cisin
Mark Linimon
Paul Koning
Scott Baker
Sean Conner
The Doctor
Wayne S