Does anyone read/study Don Knuth's "Art ..." (TAOCP) any more? Sorting
using tape drives ... what fun! Memory, what memory ... don't need no
stinkin' memory!
On Mon, Jun 3, 2013 at 11:41 PM, Fred Cisin <cisin at xenosoft.com> wrote:
> One of
the administrators at the college that I worked at for 30 years
> (until last week!) creates overly elaborate imagery for simple content
> ("curriculum committee Thursday 10:30, room 451"), then prints it out,
> SCANS that, and attaches the scanned image to an email with Subject:
> "FYI" and body of "See the attachment".
On Mon, 3 Jun
2013, Matthew D Stock wrote:
Fred, the folks at my university can do one
better. They send email
with trivial messages in MS word format (bad), and actually attach two
copies of the doc... one in a serif font and one in a sans serif font.
I have no idea why.
multiple variants in the document still don't compare with printing that
document, optically scanning the printout, and attaching that picture of
the printout.
When I
taught beginning Data Structures And Algorithms, I always got a
few students who rejected the idea of being efficient ("throw hardware
at it"), and specifically objected to spending an entire 3 hour class
session on how to create algorithms for sorting and searching datasets
too big to fit into memory ("just get a bigger computer and load it
all into a single array in memory!")
Funny, because when I was interviewing
for some positions in some "big
data" companies (Google, Bloomberg, etc), they specifically ask about
these kinds of skills because for all the horsepower they have access
to, there's always more data.
That was a principle that I TRIED to teach.
Some exercises:
Q:"There is a RIDICULOUSLY LARGE file of structured records.
. . . [some discussion of record format] . . .
Write an efficient program to find the highest and lowest value records."
In casual discussion of the problem, I referred to multi-TeraByte, and
running the program on one of the generic PCs in the lab.
Lots of college CS majors wanted to read the file into an array in memory,
RUN A [standard library] SORT function, and then print out the first and
last records of the array in memory!
A few got really bent when I didn't give them a perfect score.
A: Read 1st record, and assume that it is largest and smallest. In a
single pass through the file, read each record, if it is smaller than
smallest or larger than largest, then it becomes the new smallest or
largest.
Q: "A company has an ENORMOUS customer database file, that is kept IN
ORDER. Every day they have new customer records to add to the database.
Explain the best algorithms to run at the end of the day to produce the
new even larger file."
Many students insisted that the correct way to do it was to append the new
data to the end of the file, read the file into memory, and then SORT it.
Any problems with that approach should be handled by "get a big enough
computer".
A: separate the new data into a different file. Sort it (IF small
enough). You now have two ordered files, not necessarily the same sizes.
Read one record from each file. Select the appropriate record and write
it to a new file and read a new record from whichever file that one had
come from. repeat until all records have been written to the new file.
(one pass involving one read, one compare, one write, for each record)
Q: Retrospective conversion of library card catalog into Online Public
Access Catalog, by scanning and OCR'ing the card catalog. The card
catalog was ALMOST in correct order (occasionally a drawer would get
dropped and manually put back in order).
What sort algorithms can take advantage of existing order, and do the
minimum amount of work for data that is almost in order?
Qb: Discuss how large the dataset must be, and how much disorder for
the relative benefits of a Shell-Metzner sort V a properly written Shaker
sort (with completion tests, diminishing pass length, etc.)
Q: a retrieval or search system gets enormous number of hits that need to
be presented to the user in sequence. Discuss the SIMPLEST ways to reduce
the latency from retrieval of records to displaying the FIRST PAGE (25?)
hits. (further ordering can proceed while the user is happily browsing
that first page)
--
Grumpy Ol' Fred cisin at
xenosoft.com