Lossy compression vs. archiving and OCR (was Re: Many things)

31 Jan 2005

Jim Battle wrote about DjVu:
...
  So for OCR purposes, I don't think this type of
compression really hurts
 -- it replaces one plausible "e" image with another one. 
No, that's exactly the kind of BS you DO NOT WANT for a file that you
plan to OCR.  What if you've got a mathematical formuala that has some
latin "e" letters and some greek epsilons in it?  Or perhaps normal
and italic "e" letters?  DjVu may well think they are "close enough",
while a good OCR program might be able to tell them apart accurately.
The point of wanting lossless compression is that even if a good
OCR program today can't tell them apart accurately, a good OCR program
ten years from now might.
But if you use lossy compression now, you are likely discarding
information that the OCR program will need.
Eric

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Lossy compression vs. archiving and OCR (was Re: Many things)