Role of OS in filesystem (was "File types")

29 Aug 2006

[thread forked as this line is really not what *I* am trying
to pursue... :< ]

Andy Holt wrote:
...
  Don, you're just seeing one side of the picture.

 From the PoV of the programmer just seeking the easiest way of coding a
 problem on a 8-bit byte oriented machine, it is indeed true that an
 undifferentiated sequence of bytes has considerable advantages. 
Sorry, if my choice of the word "bytes" was a distraction.
Call them "storage units".  I dislike the term "characters"
as it tends to imply additional semantics that many files don't
follow (or, should we have "Characters" and "characters"?  :> )

...
  However, in the historical world - and even nowadays
in parts of the real
 world - other considerations can occur. Most notably in the field of
 efficiency and performance.

 It is a truism throughout computing that an important skill is the selection
 of the correct level of abstraction (and often of indirection) for the
 problem.

 For underutilised machines (eg personal desktops) programmer ease is
 typically the primary requirement. For the heavily scheduled batch machines
 of 40 years ago and for real-time applications one may need to get nearer
 the metal. 
Sure.  But what role should the OS play in that (as pertaining to
applying/enforcing structure on *files* within the filesystems
that it maintains)?  Where does the boundary between application
and OS lie?  If the OS promotes these lower level structures
to a more visible interface (enforcement), then what happens
to the application when the underlying hardware is changed?
Does it suddenly need to know about some new scheme -- that may
not have been in existence when the application was originally
crafted?

E.g., I can understand a database indexed using Btrees
later converted to use a different scheme -- the DBMS
handles this.  If the DBMS wants to exploit features of
the underlying hardware (e.g., tablespaces to disperse
among multiple spindles), then it can do so.  Because these
things are important to *it*.  I just don't think this belongs
*in* the OS (if the OS wants to keep a descriptor ACCESSIBLE
BY APPLICATIONS that allow those applications to adapt their
strategies to those aspects portrayed in that descriptor,
that makes perfect sense).

I include this sort of support in my systems.  "Little things"
like cache size, MMU page size, etc. -- so my code can adapt
to changes in the underlying hardware without major rewrites.
(e.g., I process audio in page size chunks so I can move those
chunks quickly between processes -- if the page size changes,
it is a big efficiency hit/gain for me so I want to know about
it and react accordingly).

...
   Andy Holt
wrote:
   Yes, I
*know* this has been done other ways in the past.
 What I am trying to figure out is the rationale behind
 why it has (apparently) migrated into the file *name*.  That, I think, was a
necessary side effect of the original Unix design
 decision that "a file is a sequence of characters" without  special
propertis
  that are known to the operating system. 
IMO, a file *is* an untyped string of *bytes*.  The OS shouldn't
 care about it's representation (none of this "text mode" vs.
 "binary mode" crap).  It's "attributes" should solely be things
 like size, creation time, ACLs, etc.  
 An old problem of differing size characters - typically on 36-bit-word
 machines where character sizes might be 6, 7, 8, or 9 bits is now
 reappearing with unicode. 
Yes.  But, my point is, the *application* should be responsible
for handling this.  I don't want the OS to arbitrarily decide:
"this is a text file represented in USASCII -- I will automatically
convert it to UTF-16 for this application...".  There are far
too many cases where conversions/coercions can't be done
unambiguously.  LEave those policy decisions to the application
writers (or library designers).

...
  I could argue that application program ought to be
"blind" to the
 representation of characters in a simple serial text file. In unix the "dd"
 program tries, with modest success, to handle the problem. In the '60s and
 '70s the saving in file space by using fixed-length records without storing
 "newline" (or whatever) could be vital. 
Sure.  I pack eight 5 bit characters into a five byte struct to
save bits.  But, *I* do that -- not the OS.

If you want to deal with "memory" (disk) as 87 byte records,
then use setbuf(), et al. and hope the folks who wrote the
library did so FOR THAT MACHINE in such a way as to take
advantage of the fact that the underlying hardware uses
87 byte records.

...
    So, as far as the OS was concerned, files might be
serial, sequential,
 indexed sequential, random (and perhaps other organisations)  with fixed or
  variable record sizes. (see the DCB card in OS360
JCL); there may be a  IMO, this was a mistake.  It forces the OS to know too much
 about the applications that run on it -- instead of being a
 resource manager.  I.e. it should implement mechanisms, not
 policy.  
 There is a problem here - if you only present an abstraction of the hardware
 to the programmer you have no means of using information about the
 underlying hardware to gain performance. 40 years ago there were large books 
You don't have to *only* present the abstraction.  As I said, you
can augment that with information about the hardware in an
auxillary structure.  A structure that a program *can* choose to
examine to impact it's performance -- but that it can also
choose to *ignore* withOUT peril!

...
  on how to design indexed-sequential files ... and for
good reason. If your
 carefully optimised layout gets abstracted-away from under you performance
 can drop by orders of magnitude. 
The technology is still applicable (and still used in RDBMS's).
But, you will note that systems that rely on/exploit these
things are tied more intimately to their underlying implementations.
E.g., a change in the OS's filesystem implementation can
render "good designs" suboptimal.  If done well, the application
can be tweeked to recover some/all of that original performance.

...
    complex set of access permissions (not just
read/write/modify, or even
 access control list, but possibly password-controlled access or  time of day
  limited also); and there probably also are a
large set of  backup options as
  well.

 These could make if very difficult, for example, to write a  COBOL program
  whose output was source for the FORTRAN compiler
[and even  harder to do the
  reverse - COBOL, at least, could probably handle
most file types) 
 Exactly.  Or, any "unforeseen" file types...  Oh
yes - your agument also has its points.
 But try feeding a unicode source text file into your copy of GCC and see
 what happens. 
That's *exactly* my point!  GCC *should* vomit.  It shouldn't
expect the OS to magically convert that Unicode file into
ASCII (what does UTF+2836 map to?  '('?  ')'? '['??).  Instead,
if *it* (GCC) can handle Unicode, it should.  Otherwise it should
choke (gracefully).

I wouldn't expect an aria when an arbitrary .gz file is fed
to a .wav file player!  :>

(sigh)

But, none of this answers the query of "Why file type information
has migrated into the file name"...  :>  (persistent, eh?  :>)
To me, having "something" (IMO, *above* the filesystem) that
tracks the "type" of a file -- for the benefit of users *of*
that file -- is a desireable adjunct *to* the OS.

[though, having made these arguments, I am building support
for that mechanism *within* my OS -- since it is a key service
that all apps *must* use, in my case  :<  ]

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Role of OS in filesystem (was "File types")