[cctalk] Re: Large language model (LLM) Web Scrapers

17 Sep 2025

The way I look at it, for my personal site and content,
1. I'm not going to win the arms race. It's not my area of expertise. I
can't put hours into devising anti-bot countermeasures.
2. If I do try to implement countermeasures to prevent bots, I will likely
also end up impacting/inconveniencing some legitimate users.
3. Ultimately my goal is to help people, so if my content ends up training
an AI model,. and that model ends up helping people, then I'm indirectly
meeting my goal.
4. Many of the AIs are now citing their sources, and that means I get some
level of attribution and recognition.
5. Some archivals, such as wayback machine, I find to be extremely useful
to vintage computer research. People die. Providers shut down. A lot of
knowledge has been lost. I'll be happy if my content eventually outlives me.
I wish there would be more focus on (4). Everyone deserves recognition of
their work and their content. I'd support legislation to require that
sources are cited/acknowledged when AI results are returned. I think
there's some risk of "content laundering", i.e. a bot is trained from your
content, someone publishes an AI-generated article, and the next bot is
trained from that AI-generated content, losing the original attribution.
Without discipline, it can turn into a bunch of slop that nobody knows
where it came from, or the accuracy of the information.
Scott
On Wed, Sep 17, 2025 at 11:31 AM Bill Degnan via cctalk <
cctalk(a)classiccmp.org&gt; wrote:
...
  On Wed, Sep 17, 2025 at 1:27 PM The Doctor via cctalk
<
 cctalk(a)classiccmp.org&gt;
 wrote:
  On Tuesday, September 16th, 2025 at 17:01, Bill
Degnan via cctalk <
 cctalk(a)classiccmp.org&gt; wrote:
  I wonder how long the WWW will remain open, it
would be a bummer if I
  found
  copies of my site elsewhere.

 I've been thinking about this myself.  It does not please me.
 What web server do you use for your site?  I've got some pretty robust
  but
  easy to admin
 countermeasures set up on my own website that I'd be happy to share if
 there is interest.

 I run a web services company, vintagecomputer.net is internally-supported.
  vintagecomputer.net has been dealing with some sort of scrapers for 20
 years.  The site is privately hosted and has web scraping control measures,
 built to detect a whole array of bot activity.  Rather than block, I
 believe it's better to detect and log, and then determine how best to
 manage new types of bot probing and scraping on an ongoing basis, it's a
 great way to learn white hat hacking.

2025

2024

2023

2022

[cctalk] Re: Large language model (LLM) Web Scrapers