Lots of good tips there, Peter.
One thing, though.
I don't think that the error code from the $QIO in the OPCOM log is a
VMS exit code. But I might be wrong on that.
But that could do with some more examining.
Things like SQE were things I was also thinking about. Checking the
actual system errors logs would be an important first step.
Johnny
On 2014-11-18 15:30, Peter Coghlan wrote:
What is "crumbling" about the DELUA?
There has been an ongoing issue with the network service for several years,
but we haven't had the time to go after it. (It appeared to be an issue
with CMU/IP, and there were not enough software cycles among the 3 people
on the museum project at that time.) What happens is the following, taken
from the latest console log:
As far as I know, CMU/IP was a vallient attempt to create a TCP/IP stack that
runs exclusively in user mode. I've never known it to work satisfactorily for
any length of time in any sort of production environment. Typical problems
include processes hanging or exiting for reasons that are difficult to debug
and harder to solve.
If the version of VMS and available disk space allows, maybe you could install
Multinet from Process Software? Even recent versions of Multinet will install
on VAX/VMS as far back as V5.0. Other possibilities include TCPWare and TCPIP
Services. I have little experience of either of the latter but in my opinion
they wouldn't have to do much to beat CMU/IP for reliability, performance,
usability and documentation. And I'm sure none of them would make the
interactive response slow and painful the way CMU/IP used to either.
There is a certain argument for preserving the user experience in a museum
environment but maybe there are some experiences that are best not preserved :-)
%%%%%%%%%%% OPCOM 17-NOV-2014 20:24:09.63 %%%%%%%%%%%
Message from user SYSTEM on ROSIE
IPACP: XE $QIO error (send),RC=0000001C
You can generally find out what a VMS status code is telling you by just giving
it as argument to an EXIT command at the dollar prompt. Here's the result from
OpenVMS Alpha V8.3 - some things just don't change:
$ EXIT %X1C
%SYSTEM-F-EXQUOTA, process quota exceeded
IIRC this was a fairly standard CMU/IP failure mode. Unfortunately, VMS has
lots of different process quotas and any one of them could be running out.
Prime suspect is pagefile quota (virtual memory). If the process hasn't
exited, SHOW PROCESS /QUOTA /ID=<process id of IPACP process> may show what
the problem is but my bet is that if the relevant quota is increased, it will
run out again.
Eventually, no connections to the VAX can be completed. A shutdown and
reboot (VMS = WNT) would clear it up for a few days--which made it look
like a memory leak or something similar.
I remember having that sort of grief with CMU/IP on a VAX 6410 more than 20
years ago. The problems persisted until it was possible to convince the
management to spend the money on a proper TCP/IP stack that ran in kernel
mode. That cured it completely and made life so much easier.
In the past month and a half, it's gotten more frequent; Friday evening,
the system went south (for the Brits, west) after only 2 hours and was
unavailable to our users, such as they are, all weekend. This smells much
more like a hardware failure than software, so I posted my query about the
VAX diagnostic and the tech manual.
I would suspect CMU/IP before the hardware. The increased frequency of the
problems may be due to differing conditions on your network.
If the network adaptor is really having hardware problems, it will probably
be making entries in the error log. Use SHOW ERROR to make a quick check for
devices which are clocking up errors and ANALYZE /ERROR_LOG to format the error
log in human readable form. HELP ANALYZE should give hints on what command
qualifiers to use to select the error log entries of interest. If you've
already got all sorts of stuff in the error log that you are not interested
in, you can RENAME SYS$ERRORLOG:ERRLOG.SYS ERRLOG.OLD for example and the
system will start a new error log the next time it has something to log or in
a few minutes if nothing happens. If there is nothing of interest in the
renamed error log, you can delete it after the new log is started if it is
large and disk space is an issue for example.
If your network adaptor is attached to a network transceiver that doesn't have
SQE test enabled, you will clock up errors similar to "collision detect carrier
check failed" every few seconds. This is highly unlikely to represent a real
problem and can be ignored if you can put up with the irritation. Using a
tranceiver with SQE test enabled should get rid of it.
Regards,
Peter Coghlan.