Scanning docs for bitsavers
Grant Taylor
cctalk at gtaylor.tnetconsulting.net
Mon Dec 2 22:06:29 CST 2019
On 12/2/19 8:20 PM, Alexandre Souza via cctalk wrote:
> I cannot understand your problems with PDF files.
My problem with PDFs starts where most people stop using them.
Take the average PDF of text, try to copy and paste the text into a text
file. (That may work.)
Now try to edit a piece of the text, such as taking part of a line out,
or adding to a line. (You can probably do that too.)
Now fix the line wrapping to get the margins back to where they should
be. (This will likely be a nightmare without a good text editor to
reflow the text.)
All of the text I get out of PDFs is (at best) discrete lines that are
unassociated with other lines. They just happen to be next to each other.
Conversely, if I copy text off of a web page or out of many programs, I
can paste into an editor, make my desired changes, and the line
re-wrapping is already done for me. This works for non-PDF sources
because it's a continuous line of text that can be re-wrapped and re-used.
In my opinion, PDFs are the last place that computer usable data goes.
Because getting anything out of a PDF as a data source is next to
impossible.
Sure, you, a human, can read it and consume the data.
Try importing a simple table from a PDF and working with the data in
something like a spreadsheet. You can't do it. The raw data is there.
But you can't readily use it.
This is why I say that a PDF is the end of the line for data.
I view it as effectively impossible to take data out of a PDF and do
anything with it without first needing to reconstitute it before I can
use it.
> I've created lots and lots of PDFs, with treated and untreated scanned
> material. All of them are very readable and in use for years.
Sure, you, a human, can quite easily read it. But you are not
processing the data the way that I'm talking about.
> Of course, garbage in, garbage out.
I'm not talking about GIGO.
> I take the utmost care in my scans to have good enough source files,
> so I can create great PDFs.
>
> Of course, Guy's commens are very informative and I'll learn more from it.
> But I still believe in good preservation using PDF files. FOR ME it is the
> best we have in encapsulating info. Forget HTMLs.
I find HTML to be IMMENSELY easier to extract data from.
> Please, take a look at this PDF, and tell me: Isn't that good enough for
> preservation/use?
It's good enough for humans to use.
But it suffers from the same problem that I'm describing.
Try copying the text and pasting it into a wider or narrower document.
What happens to the line wrapping or margins? Based on my experience,
they are crap.
With HTML, I can copy content and paste it into a wider or narrower
window without any problem.
Data is originated somewhere. Something is done to it. It's
manipulated, reformatted, processed, displayed and / or printed, and
ultimately consumed. In my experience, PDF files are the end of that
chain. There is no good way to get text out of a PDF.
Take (part of) the first paragraph of your sample PDF: What's easier to
re-use in a new document:
This (direct copy and paste):
--8<--
Os transceptores Control modelo TAC-45 (versão de 10 a 45 Watts) e
TAC-70 (versão de 10 a
70 Watts) foram um marco na radiocomunicação comercial brasileira.
Lançados em 1983,
consistiam num transceptor dividido em dois blocos: o corpo do rádio e
um cabeçote de
comando, onde ficavam os comandos de volume, squelch e o seletor de 4
canais.
-->8--
Or this:
--8<--
Os transceptores Control modelo TAC-45 (versão de 10 a 45 Watts) e
TAC-70 (versão de 10 a 70 Watts) foram um marco na radiocomunicação
comercial brasileira. Lançados em 1983, consistiam num transceptor
dividido em dois blocos: o corpo do rádio e um cabeçote de comando, onde
ficavam os comandos de volume, squelch e o seletor de 4 canais.
-->8--
With format=flowed, the second copy will re-scale ti any window width. I
can also triple click to select the entire paragraph, something I can't
do with the first copy. Heck, I can't even reliably do anything with
sentence in the first copy. It's all broken lines. The second copy is
a continuous string that makes up (part of) the paragraph.
Which format would you like to work with if you need to extract text
from a file and use in something else? Something that you have to
repair the damage introduced by the file format? Or something that
preserves the text integrity?
--
Grant. . . .
unix || die
--
Grant. . . .
unix || die
More information about the cctech
mailing list