[thread forked as this line is really not what *I* am trying
to pursue... :< ]
Andy Holt wrote:
Don, you're just seeing one side of the picture.
From the PoV of the programmer just seeking the easiest way of coding a
problem on a 8-bit byte oriented machine, it is indeed true that an
undifferentiated sequence of bytes has considerable advantages.
Sorry, if my choice of the word "bytes" was a distraction.
Call them "storage units". I dislike the term "characters"
as it tends to imply additional semantics that many files don't
follow (or, should we have "Characters" and "characters"? :> )
However, in the historical world - and even nowadays
in parts of the real
world - other considerations can occur. Most notably in the field of
efficiency and performance.
It is a truism throughout computing that an important skill is the selection
of the correct level of abstraction (and often of indirection) for the
problem.
For underutilised machines (eg personal desktops) programmer ease is
typically the primary requirement. For the heavily scheduled batch machines
of 40 years ago and for real-time applications one may need to get nearer
the metal.
Sure. But what role should the OS play in that (as pertaining to
applying/enforcing structure on *files* within the filesystems
that it maintains)? Where does the boundary between application
and OS lie? If the OS promotes these lower level structures
to a more visible interface (enforcement), then what happens
to the application when the underlying hardware is changed?
Does it suddenly need to know about some new scheme -- that may
not have been in existence when the application was originally
crafted?
E.g., I can understand a database indexed using Btrees
later converted to use a different scheme -- the DBMS
handles this. If the DBMS wants to exploit features of
the underlying hardware (e.g., tablespaces to disperse
among multiple spindles), then it can do so. Because these
things are important to *it*. I just don't think this belongs
*in* the OS (if the OS wants to keep a descriptor ACCESSIBLE
BY APPLICATIONS that allow those applications to adapt their
strategies to those aspects portrayed in that descriptor,
that makes perfect sense).
I include this sort of support in my systems. "Little things"
like cache size, MMU page size, etc. -- so my code can adapt
to changes in the underlying hardware without major rewrites.
(e.g., I process audio in page size chunks so I can move those
chunks quickly between processes -- if the page size changes,
it is a big efficiency hit/gain for me so I want to know about
it and react accordingly).
Andy Holt
wrote:
Yes, I
*know* this has been done other ways in the past.
What I am trying to figure out is the rationale behind
why it has (apparently) migrated into the file *name*.
That, I think, was a
necessary side effect of the original Unix design
decision that "a file is a sequence of characters" without
special
propertis
that are known to the operating system.
IMO, a file *is* an untyped string of *bytes*. The OS shouldn't
care about it's representation (none of this "text mode" vs.
"binary mode" crap). It's "attributes" should solely be things
like size, creation time, ACLs, etc.
An old problem of differing size characters - typically on 36-bit-word
machines where character sizes might be 6, 7, 8, or 9 bits is now
reappearing with unicode.
Yes. But, my point is, the *application* should be responsible
for handling this. I don't want the OS to arbitrarily decide:
"this is a text file represented in USASCII -- I will automatically
convert it to UTF-16 for this application...". There are far
too many cases where conversions/coercions can't be done
unambiguously. LEave those policy decisions to the application
writers (or library designers).
I could argue that application program ought to be
"blind" to the
representation of characters in a simple serial text file. In unix the "dd"
program tries, with modest success, to handle the problem. In the '60s and
'70s the saving in file space by using fixed-length records without storing
"newline" (or whatever) could be vital.
Sure. I pack eight 5 bit characters into a five byte struct to
save bits. But, *I* do that -- not the OS.
If you want to deal with "memory" (disk) as 87 byte records,
then use setbuf(), et al. and hope the folks who wrote the
library did so FOR THAT MACHINE in such a way as to take
advantage of the fact that the underlying hardware uses
87 byte records.
So, as far as the OS was concerned, files might be
serial, sequential,
indexed sequential, random (and perhaps other organisations)
with fixed or
variable record sizes. (see the DCB card in OS360
JCL); there may be a
IMO, this was a mistake. It forces the OS to know too much
about the applications that run on it -- instead of being a
resource manager. I.e. it should implement mechanisms, not
policy.
There is a problem here - if you only present an abstraction of the hardware
to the programmer you have no means of using information about the
underlying hardware to gain performance. 40 years ago there were large books
You don't have to *only* present the abstraction. As I said, you
can augment that with information about the hardware in an
auxillary structure. A structure that a program *can* choose to
examine to impact it's performance -- but that it can also
choose to *ignore* withOUT peril!
on how to design indexed-sequential files ... and for
good reason. If your
carefully optimised layout gets abstracted-away from under you performance
can drop by orders of magnitude.
The technology is still applicable (and still used in RDBMS's).
But, you will note that systems that rely on/exploit these
things are tied more intimately to their underlying implementations.
E.g., a change in the OS's filesystem implementation can
render "good designs" suboptimal. If done well, the application
can be tweeked to recover some/all of that original performance.
complex set of access permissions (not just
read/write/modify, or even
access control list, but possibly password-controlled access or
time of day
limited also); and there probably also are a
large set of
backup options as
well.
These could make if very difficult, for example, to write a
COBOL program
whose output was source for the FORTRAN compiler
[and even
harder to do the
reverse - COBOL, at least, could probably handle
most file types)
Exactly. Or, any "unforeseen" file types...
Oh
yes - your agument also has its points.
But try feeding a unicode source text file into your copy of GCC and see
what happens.
That's *exactly* my point! GCC *should* vomit. It shouldn't
expect the OS to magically convert that Unicode file into
ASCII (what does UTF+2836 map to? '('? ')'? '['??). Instead,
if *it* (GCC) can handle Unicode, it should. Otherwise it should
choke (gracefully).
I wouldn't expect an aria when an arbitrary .gz file is fed
to a .wav file player! :>
(sigh)
But, none of this answers the query of "Why file type information
has migrated into the file name"... :> (persistent, eh? :>)
To me, having "something" (IMO, *above* the filesystem) that
tracks the "type" of a file -- for the benefit of users *of*
that file -- is a desireable adjunct *to* the OS.
[though, having made these arguments, I am building support
for that mechanism *within* my OS -- since it is a key service
that all apps *must* use, in my case :< ]