On Feb 3, 2025, at 3:40 PM, Alexander Schreiber via
cctalk <cctalk(a)classiccmp.org> wrote:
...
On top of that: A lot of those LLMs are build on theft at an epically large
scale. They hovered up everything in sight (and then some) without even
pretending to care about intellectual property rights - e.g. the NY Times
has taken OpenAI to court because they managed to make the OpenAI LLMs
spit out long verbatim fragments of NY Times content. The hilarious part
is that DeepSeek essentially stole from OpenAI that which OpenAI previously
stole from everyone else and OpenAI is very angry about the lack of honor
among thieves or something ;-)
Excellent point. I tend to refer to LLMs as "derived work generators" to point
out the copyright problems that are fundamental to what they do.
I also tend to wonder about web hoovering as a training scheme, given that a lot of web
content is fiction. And I don't mean "misinformation", I just mean novels
and the like. What happens to an LLM that inhales "The Martian" or
"Ringworld" ?
paul