DELUA technical manual, VAX diagnostic

Tue Nov 18 23:18:24 CST 2014

Lots of good tips there, Peter.

One thing, though.
I don't think that the error code from the $QIO in the OPCOM log is a 
VMS exit code. But I might be wrong on that.
But that could do with some more examining.
Things like SQE were things I was also thinking about. Checking the 
actual system errors logs would be an important first step.

	Johnny

On 2014-11-18 15:30, Peter Coghlan wrote:
>>
>>> What is "crumbling" about the DELUA?
>>
>> There has been an ongoing issue with the network service for several years,
>> but we haven't had the time to go after it.  (It appeared to be an issue
>> with CMU/IP, and there were not enough software cycles among the 3 people
>> on the museum project at that time.)  What happens is the following, taken
>>from the latest console log:
>>
>
> As far as I know, CMU/IP was a vallient attempt to create a TCP/IP stack that
> runs exclusively in user mode.  I've never known it to work satisfactorily for
> any length of time in any sort of production environment.  Typical problems
> include processes hanging or exiting for reasons that are difficult to debug
> and harder to solve.
>
> If the version of VMS and available disk space allows, maybe you could install
> Multinet from Process Software?  Even recent versions of Multinet will install
> on VAX/VMS as far back as V5.0.  Other possibilities include TCPWare and TCPIP
> Services.  I have little experience of either of the latter but in my opinion
> they wouldn't have to do much to beat CMU/IP for reliability, performance,
> usability and documentation.  And I'm sure none of them would make the
> interactive response slow and painful the way CMU/IP used to either.
>
> There is a certain argument for preserving the user experience in a museum
> environment but maybe there are some experiences that are best not preserved :-)
>
>>
>>     %%%%%%%%%%%  OPCOM  17-NOV-2014 20:24:09.63  %%%%%%%%%%%
>>     Message from user SYSTEM on ROSIE
>>     IPACP: XE $QIO error (send),RC=0000001C
>>
>
> You can generally find out what a VMS status code is telling you by just giving
> it as argument to an EXIT command at the dollar prompt.  Here's the result from
> OpenVMS Alpha V8.3 - some things just don't change:
>
> $ EXIT %X1C
> %SYSTEM-F-EXQUOTA, process quota exceeded
>
> IIRC this was a fairly standard CMU/IP failure mode.  Unfortunately, VMS has
> lots of different process quotas and any one of them could be running out.
> Prime suspect is pagefile quota (virtual memory).  If the process hasn't
> exited, SHOW PROCESS /QUOTA /ID=<process id of IPACP process> may show what
> the problem is but my bet is that if the relevant quota is increased, it will
> run out again.
>
>>
>> Eventually, no connections to the VAX can be completed.  A shutdown and
>> reboot (VMS = WNT) would clear it up for a few days--which made it look
>> like a memory leak or something similar.
>>
>
> I remember having that sort of grief with CMU/IP on a VAX 6410 more than 20
> years ago.  The problems persisted until it was possible to convince the
> management to spend the money on a proper TCP/IP stack that ran in kernel
> mode.  That cured it completely and made life so much easier.
>
>>
>> In the past month and a half, it's gotten more frequent; Friday evening,
>> the system went south (for the Brits, west) after only 2 hours and was
>> unavailable to our users, such as they are, all weekend.  This smells much
>> more like a hardware failure than software, so I posted my query about the
>> VAX diagnostic and the tech manual.
>>
>
> I would suspect CMU/IP before the hardware.  The increased frequency of the
> problems may be due to differing conditions on your network.
>
> If the network adaptor is really having hardware problems, it will probably
> be making entries in the error log.  Use SHOW ERROR to make a quick check for
> devices which are clocking up errors and ANALYZE /ERROR_LOG to format the error
> log in human readable form.  HELP ANALYZE should give hints on what command
> qualifiers to use to select the error log entries of interest.  If you've
> already got all sorts of stuff in the error log that you are not interested
> in, you can RENAME SYS$ERRORLOG:ERRLOG.SYS ERRLOG.OLD for example and the
> system will start a new error log the next time it has something to log or in
> a few minutes if nothing happens.  If there is nothing of interest in the
> renamed error log, you can delete it after the new log is started if it is
> large and disk space is an issue for example.
>
> If your network adaptor is attached to a network transceiver that doesn't have
> SQE test enabled, you will clock up errors similar to "collision detect carrier
> check failed" every few seconds.  This is highly unlikely to represent a real
> problem and can be ignored if you can put up with the irritation.  Using a
> tranceiver with SQE test enabled should get rid of it.
>
> Regards,
> Peter Coghlan.
>