"File types"

29 Aug 2006

Don, you're just seeing one side of the picture.

...
 From the PoV of the programmer just seeking the easiest
way of coding a problem on a 8-bit byte oriented machine, it is indeed true that an
undifferentiated sequence of bytes has considerable advantages.

However, in the historical world - and even nowadays in parts of the real
world - other considerations can occur. Most notably in the field of
efficiency and performance.

It is a truism throughout computing that an important skill is the selection
of the correct level of abstraction (and often of indirection) for the
problem.

For underutilised machines (eg personal desktops) programmer ease is
typically the primary requirement. For the heavily scheduled batch machines
of 40 years ago and for real-time applications one may need to get nearer
the metal.

...
  Andy Holt wrote:
   Yes, I
*know* this has been done other ways in the past.
 What I am trying to figure out is the rationale behind
 why it has (apparently) migrated into the file *name*. 
 That, I think, was a necessary side effect of the original Unix design
 decision that "a file is a sequence of characters" without  special
propertis
  that are known to the operating system. 
 IMO, a file *is* an untyped string of *bytes*.  The OS shouldn't
 care about it's representation (none of this "text mode" vs.
 "binary mode" crap).  It's "attributes" should solely be things
 like size, creation time, ACLs, etc. 
An old problem of differing size characters - typically on 36-bit-word
machines where character sizes might be 6, 7, 8, or 9 bits is now
reappearing with unicode.

I could argue that application program ought to be "blind" to the
representation of characters in a simple serial text file. In unix the "dd"
program tries, with modest success, to handle the problem. In the '60s and
'70s the saving in file space by using fixed-length records without storing
"newline" (or whatever) could be vital.

...

 So, as far as the OS was concerned, files might be serial, sequential,
 indexed sequential, random (and perhaps other organisations)  with fixed or
  variable record sizes. (see the DCB card in OS360
JCL); there may be a 
 IMO, this was a mistake.  It forces the OS to know too much
 about the applications that run on it -- instead of being a
 resource manager.  I.e. it should implement mechanisms, not
 policy. 
There is a problem here - if you only present an abstraction of the hardware
to the programmer you have no means of using information about the
underlying hardware to gain performance. 40 years ago there were large books
on how to design indexed-sequential files ... and for good reason. If your
carefully optimised layout gets abstracted-away from under you performance
can drop by orders of magnitude.

...

  complex set of access permissions (not just
read/write/modify, or even
 access control list, but possibly password-controlled access or  time of day
  limited also); and there probably also are a
large set of  backup options as
  well.

 These could make if very difficult, for example, to write a  COBOL program
  whose output was source for the FORTRAN compiler
[and even  harder to do the
  reverse - COBOL, at least, could probably handle
most file types) 
 Exactly.  Or, any "unforeseen" file types... Oh yes - your agument also
has its points.
But try feeding a unicode source text file into your copy of GCC and see
what happens.

Andy

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

"File types"