DELUA technical manual, VAX diagnostic

Peter Coghlan cctech at beyondthepale.ie
Tue Nov 18 17:30:13 CST 2014


>
>> What is "crumbling" about the DELUA?
>
>There has been an ongoing issue with the network service for several years,
>but we haven't had the time to go after it.  (It appeared to be an issue
>with CMU/IP, and there were not enough software cycles among the 3 people
>on the museum project at that time.)  What happens is the following, taken
>from the latest console log:
>

As far as I know, CMU/IP was a vallient attempt to create a TCP/IP stack that
runs exclusively in user mode.  I've never known it to work satisfactorily for
any length of time in any sort of production environment.  Typical problems
include processes hanging or exiting for reasons that are difficult to debug
and harder to solve.

If the version of VMS and available disk space allows, maybe you could install
Multinet from Process Software?  Even recent versions of Multinet will install
on VAX/VMS as far back as V5.0.  Other possibilities include TCPWare and TCPIP
Services.  I have little experience of either of the latter but in my opinion
they wouldn't have to do much to beat CMU/IP for reliability, performance,
usability and documentation.  And I'm sure none of them would make the
interactive response slow and painful the way CMU/IP used to either.

There is a certain argument for preserving the user experience in a museum
environment but maybe there are some experiences that are best not preserved :-)

>
>    %%%%%%%%%%%  OPCOM  17-NOV-2014 20:24:09.63  %%%%%%%%%%%
>    Message from user SYSTEM on ROSIE
>    IPACP: XE $QIO error (send),RC=0000001C
>

You can generally find out what a VMS status code is telling you by just giving
it as argument to an EXIT command at the dollar prompt.  Here's the result from
OpenVMS Alpha V8.3 - some things just don't change:

$ EXIT %X1C
%SYSTEM-F-EXQUOTA, process quota exceeded

IIRC this was a fairly standard CMU/IP failure mode.  Unfortunately, VMS has
lots of different process quotas and any one of them could be running out.
Prime suspect is pagefile quota (virtual memory).  If the process hasn't
exited, SHOW PROCESS /QUOTA /ID=<process id of IPACP process> may show what
the problem is but my bet is that if the relevant quota is increased, it will
run out again.

>
>Eventually, no connections to the VAX can be completed.  A shutdown and
>reboot (VMS = WNT) would clear it up for a few days--which made it look
>like a memory leak or something similar.
>

I remember having that sort of grief with CMU/IP on a VAX 6410 more than 20
years ago.  The problems persisted until it was possible to convince the
management to spend the money on a proper TCP/IP stack that ran in kernel
mode.  That cured it completely and made life so much easier.

>
>In the past month and a half, it's gotten more frequent; Friday evening,
>the system went south (for the Brits, west) after only 2 hours and was
>unavailable to our users, such as they are, all weekend.  This smells much
>more like a hardware failure than software, so I posted my query about the
>VAX diagnostic and the tech manual.
>

I would suspect CMU/IP before the hardware.  The increased frequency of the
problems may be due to differing conditions on your network.

If the network adaptor is really having hardware problems, it will probably
be making entries in the error log.  Use SHOW ERROR to make a quick check for
devices which are clocking up errors and ANALYZE /ERROR_LOG to format the error
log in human readable form.  HELP ANALYZE should give hints on what command
qualifiers to use to select the error log entries of interest.  If you've
already got all sorts of stuff in the error log that you are not interested
in, you can RENAME SYS$ERRORLOG:ERRLOG.SYS ERRLOG.OLD for example and the
system will start a new error log the next time it has something to log or in
a few minutes if nothing happens.  If there is nothing of interest in the
renamed error log, you can delete it after the new log is started if it is
large and disk space is an issue for example.

If your network adaptor is attached to a network transceiver that doesn't have
SQE test enabled, you will clock up errors similar to "collision detect carrier
check failed" every few seconds.  This is highly unlikely to represent a real
problem and can be ignored if you can put up with the irritation.  Using a
tranceiver with SQE test enabled should get rid of it.

Regards,
Peter Coghlan.


More information about the cctech mailing list