Alex, your posts come over with the “flag” set (i get a red “flag” on my iphone). Did you
mean to flag all your responses for some reason ?
Sent from my iPhone
On Feb 3, 2025, at 13:15, Alexander Schreiber via
cctalk <cctalk(a)classiccmp.org> wrote:
On Mon, Feb 03, 2025 at 03:54:31PM -0500, Paul Koning via cctalk wrote:
On Feb 3, 2025, at 3:40 PM, Alexander Schreiber
via cctalk
<cctalk(a)classiccmp.org> wrote:
... On top of that: A lot of those LLMs are build on theft at an epically
large scale. They hovered up everything in sight (and then some) without
even pretending to care about intellectual property rights - e.g. the NY
Times has taken OpenAI to court because they managed to make the OpenAI
LLMs spit out long verbatim fragments of NY Times content. The hilarious
part is that DeepSeek essentially stole from OpenAI that which OpenAI
previously stole from everyone else and OpenAI is very angry about the lack
of honor among thieves or something ;-)
Excellent point. I tend to refer to LLMs as "derived work generators" to
point out the copyright problems that are fundamental to what they do.
I just call them "bullshit generators", based on Harry Frankfurt's
"On
Bullshit".
I also tend to wonder about web hoovering as a
training scheme, given that a
lot of web content is fiction. And I don't mean "misinformation", I just
mean novels and the like. What happens to an LLM that inhales "The Martian"
or "Ringworld" ?
That's probably a lot less harmless than what already happened: More than
one model had to be pulled back and deleted (as well as the corpus it was
trained from) because its makers had unknowingly hovered up CSAM content,
trained the model with it and it was cheerfully spitting that filth out again.
If you blindly hover up the entire Internet, you're going find stuff that
you probably don't want to have on your systems.
Kind regards,
Alex.
--
"Opportunity is missed by most people because it is dressed in overalls and
looks like work." -- Thomas A. Edison