Scanning docs for bitsavers

Grant Taylor cctalk at gtaylor.tnetconsulting.net
Mon Dec 2 22:06:29 CST 2019


On 12/2/19 8:20 PM, Alexandre Souza via cctalk wrote:
> I cannot understand your problems with PDF files.

My problem with PDFs starts where most people stop using them.

Take the average PDF of text, try to copy and paste the text into a text 
file.  (That may work.)

Now try to edit a piece of the text, such as taking part of a line out, 
or adding to a line.  (You can probably do that too.)

Now fix the line wrapping to get the margins back to where they should 
be.  (This will likely be a nightmare without a good text editor to 
reflow the text.)

All of the text I get out of PDFs is (at best) discrete lines that are 
unassociated with other lines.  They just happen to be next to each other.

Conversely, if I copy text off of a web page or out of many programs, I 
can paste into an editor, make my desired changes, and the line 
re-wrapping is already done for me.  This works for non-PDF sources 
because it's a continuous line of text that can be re-wrapped and re-used.

In my opinion, PDFs are the last place that computer usable data goes. 
Because getting anything out of a PDF as a data source is next to 
impossible.

Sure, you, a human, can read it and consume the data.

Try importing a simple table from a PDF and working with the data in 
something like a spreadsheet.  You can't do it.  The raw data is there. 
   But you can't readily use it.

This is why I say that a PDF is the end of the line for data.

I view it as effectively impossible to take data out of a PDF and do 
anything with it without first needing to reconstitute it before I can 
use it.

> I've created lots and lots of PDFs, with treated and untreated scanned
> material. All of them are very readable and in use for years.

Sure, you, a human, can quite easily read it.  But you are not 
processing the data the way that I'm talking about.

> Of course, garbage in, garbage out.

I'm not talking about GIGO.

> I take the utmost care in my scans to have good enough source files, 
> so I can create great PDFs.
> 
> Of course, Guy's commens are very informative and I'll learn more from it.
> But I still believe in good preservation using PDF files. FOR ME it is the
> best we have in encapsulating info. Forget HTMLs.

I find HTML to be IMMENSELY easier to extract data from.

> Please, take a look at this PDF, and tell me: Isn't that good enough for
> preservation/use?

It's good enough for humans to use.

But it suffers from the same problem that I'm describing.

Try copying the text and pasting it into a wider or narrower document. 
What happens to the line wrapping or margins?  Based on my experience, 
they are crap.

With HTML, I can copy content and paste it into a wider or narrower 
window without any problem.

Data is originated somewhere.  Something is done to it.  It's 
manipulated, reformatted, processed, displayed and / or printed, and 
ultimately consumed.  In my experience, PDF files are the end of that 
chain.  There is no good way to get text out of a PDF.

Take (part of) the first paragraph of your sample PDF:  What's easier to 
re-use in a new document:

This (direct copy and paste):
--8<--
Os transceptores Control modelo TAC-45 (versão de 10 a 45 Watts) e 
TAC-70 (versão de 10 a
70 Watts) foram um marco na radiocomunicação comercial brasileira. 
Lançados em 1983,
consistiam num transceptor dividido em dois blocos: o corpo do rádio e 
um cabeçote de
comando, onde ficavam os comandos de volume, squelch e o seletor de 4 
canais.
-->8--

Or this:
--8<--
Os transceptores Control modelo TAC-45 (versão de 10 a 45 Watts) e 
TAC-70 (versão de 10 a 70 Watts) foram um marco na radiocomunicação 
comercial brasileira. Lançados em 1983, consistiam num transceptor 
dividido em dois blocos: o corpo do rádio e um cabeçote de comando, onde 
ficavam os comandos de volume, squelch e o seletor de 4 canais.
-->8--

With format=flowed, the second copy will re-scale ti any window width. I 
can also triple click to select the entire paragraph, something I can't 
do with the first copy.  Heck, I can't even reliably do anything with 
sentence in the first copy.  It's all broken lines.  The second copy is 
a continuous string that makes up (part of) the paragraph.

Which format would you like to work with if you need to extract text 
from a file and use in something else?  Something that you have to 
repair the damage introduced by the file format?  Or something that 
preserves the text integrity?



-- 
Grant. . . .
unix || die





-- 
Grant. . . .
unix || die


More information about the cctalk mailing list