Paul
In a perfect world yes. Here's a trick you guys can use... Generate a
robots.txt and add a few pages not to crawl. Assuming bad bots will
ignore, one of the "d onot crawl" pages will have a trigger that blocks the
ip address of the session. You would need the ability to communicate the
IP address of the offending bot to a process that does the blocking.
There are various ways to do that.
On Wed, Sep 17, 2025 at 9:46 AM Paul Koning via cctalk <
cctalk(a)classiccmp.org> wrote:
A web crawler that does not obey robots.txt is not a
law abiding outfit.
Best would be to block it entirely. If they are that dismissive of
honesty, they are also unlikely to pay attention to such matters as
copyright and intellectual property ownership.
paul
On Sep 16, 2025, at 8:55 PM, Wayne S via cctalk
<cctalk(a)classiccmp.org>
wrote:
They do not observe robots .txt
Sent from my iPhone
> On Sep 16, 2025, at 17:53, Wayne S <wayne.sudol(a)hotmail.com> wrote:
>
> I did notice the scraping.
> I toyed with the idea of putting ludicrous text files up that a normal
user would not see and see which bot got them.
>
> Sent from my iPhone
>
>> On Sep 16, 2025, at 17:02, Bill Degnan via cctalk <
cctalk(a)classiccmp.org> wrote:
>>
>> For those of you who run vintage computing-related info sites, have
you
>> noticed all of the LLM scraper activity?
AI services are using the
LLM
>> scrapers to populate their knowledge
bases.
>>
>> At any given moment 5-10 of them are active on
vintagecomputer.net.
It’s
>> funny, when I ask an AI about something
vintage computing-related,
>> something obscure, I can trick into giving me an answer from my own
site.
>>
>> I have actually had to modify the site code to manage the traffic, to
>> improve efficiency.
>>
>> But they’re not going after just my site, these scrapers are absorbing
>> copies of the entire WWW.
>>
>> I wonder how long the WWW will remain open, it would be a bummer if I
found
>> copies of my site elsewhere.
>>
>> Bill