DELUA technical manual, VAX diagnostic
Johnny Billquist
bqt at update.uu.se
Tue Nov 18 23:18:24 CST 2014
Lots of good tips there, Peter.
One thing, though.
I don't think that the error code from the $QIO in the OPCOM log is a
VMS exit code. But I might be wrong on that.
But that could do with some more examining.
Things like SQE were things I was also thinking about. Checking the
actual system errors logs would be an important first step.
Johnny
On 2014-11-18 15:30, Peter Coghlan wrote:
>>
>>> What is "crumbling" about the DELUA?
>>
>> There has been an ongoing issue with the network service for several years,
>> but we haven't had the time to go after it. (It appeared to be an issue
>> with CMU/IP, and there were not enough software cycles among the 3 people
>> on the museum project at that time.) What happens is the following, taken
>>from the latest console log:
>>
>
> As far as I know, CMU/IP was a vallient attempt to create a TCP/IP stack that
> runs exclusively in user mode. I've never known it to work satisfactorily for
> any length of time in any sort of production environment. Typical problems
> include processes hanging or exiting for reasons that are difficult to debug
> and harder to solve.
>
> If the version of VMS and available disk space allows, maybe you could install
> Multinet from Process Software? Even recent versions of Multinet will install
> on VAX/VMS as far back as V5.0. Other possibilities include TCPWare and TCPIP
> Services. I have little experience of either of the latter but in my opinion
> they wouldn't have to do much to beat CMU/IP for reliability, performance,
> usability and documentation. And I'm sure none of them would make the
> interactive response slow and painful the way CMU/IP used to either.
>
> There is a certain argument for preserving the user experience in a museum
> environment but maybe there are some experiences that are best not preserved :-)
>
>>
>> %%%%%%%%%%% OPCOM 17-NOV-2014 20:24:09.63 %%%%%%%%%%%
>> Message from user SYSTEM on ROSIE
>> IPACP: XE $QIO error (send),RC=0000001C
>>
>
> You can generally find out what a VMS status code is telling you by just giving
> it as argument to an EXIT command at the dollar prompt. Here's the result from
> OpenVMS Alpha V8.3 - some things just don't change:
>
> $ EXIT %X1C
> %SYSTEM-F-EXQUOTA, process quota exceeded
>
> IIRC this was a fairly standard CMU/IP failure mode. Unfortunately, VMS has
> lots of different process quotas and any one of them could be running out.
> Prime suspect is pagefile quota (virtual memory). If the process hasn't
> exited, SHOW PROCESS /QUOTA /ID=<process id of IPACP process> may show what
> the problem is but my bet is that if the relevant quota is increased, it will
> run out again.
>
>>
>> Eventually, no connections to the VAX can be completed. A shutdown and
>> reboot (VMS = WNT) would clear it up for a few days--which made it look
>> like a memory leak or something similar.
>>
>
> I remember having that sort of grief with CMU/IP on a VAX 6410 more than 20
> years ago. The problems persisted until it was possible to convince the
> management to spend the money on a proper TCP/IP stack that ran in kernel
> mode. That cured it completely and made life so much easier.
>
>>
>> In the past month and a half, it's gotten more frequent; Friday evening,
>> the system went south (for the Brits, west) after only 2 hours and was
>> unavailable to our users, such as they are, all weekend. This smells much
>> more like a hardware failure than software, so I posted my query about the
>> VAX diagnostic and the tech manual.
>>
>
> I would suspect CMU/IP before the hardware. The increased frequency of the
> problems may be due to differing conditions on your network.
>
> If the network adaptor is really having hardware problems, it will probably
> be making entries in the error log. Use SHOW ERROR to make a quick check for
> devices which are clocking up errors and ANALYZE /ERROR_LOG to format the error
> log in human readable form. HELP ANALYZE should give hints on what command
> qualifiers to use to select the error log entries of interest. If you've
> already got all sorts of stuff in the error log that you are not interested
> in, you can RENAME SYS$ERRORLOG:ERRLOG.SYS ERRLOG.OLD for example and the
> system will start a new error log the next time it has something to log or in
> a few minutes if nothing happens. If there is nothing of interest in the
> renamed error log, you can delete it after the new log is started if it is
> large and disk space is an issue for example.
>
> If your network adaptor is attached to a network transceiver that doesn't have
> SQE test enabled, you will clock up errors similar to "collision detect carrier
> check failed" every few seconds. This is highly unlikely to represent a real
> problem and can be ignored if you can put up with the irritation. Using a
> tranceiver with SQE test enabled should get rid of it.
>
> Regards,
> Peter Coghlan.
>
More information about the cctech
mailing list