[packman] PMBS down

Stefan Botter jsj at jsj.dyndns.org
Sat Aug 15 11:36:49 CEST 2020


Hi Manfred, hi Packmans,

Am Samstag, 15. August 2020, 09:33:36 CEST schrieb Manfred Hollstein:
> I don't know if this is caused by a planned downtime, but
> pmbs.links2linux.org cannot be reached at the moment.
> 
> Can you please take a look?

That was not really planned, but ...

I had the long-standing issue with the defective network adapter in one 
of my hosts, which caused the frequent downtimes several months ago. 
Some of it was caused by the ISCSI-connection to my storage devices, 
which are now connected via NFSv4.1 - and rock-solid so far.
For a time now I had several issues, which came with the territory of 
running uncertified hardware (Dell gen11 Server and VMware 6.7U3 - 
certified up to 6.0, Intel PRO/1000 ET, certified and supported until 
VMware 6.7U1). The problem surfaced while putting heavy load on the 
network interfaces on the Intel card, for instance a live-migration from 
one host to the other. The interface simply stopped, and could only be 
revived by unplugging/plugging the cable or shutting down and re-
enabling the port on the switch. Known problem with the card.

I bought 2 HP NC375T Quad-port cards and replaced the Intel ET last 
Wednesday. It looked good at first glance, but since Wednesday 
everything was slow. 

I mean: S-L-O-W.

I thought it might by due to over-committing the buildwk[1-4] workers 
with 12 vCPU on 8core physical CPUs (with HT), and slow memory 
assignment in the NUMA architecture of the hosts (CPUs have "private" 
RAM, and can access RAM from different CPUs more slowly).
To check this I wanted to shut down all VMs, reboot the hosts and boot 
everything up. Should not take more than 15 minutes.

Unfortuntely i found out, that there still is a problem with the 
additional network adapters. During boot the cards cannot be trained, 
and the system stops there. A warm reboot can fix this. 

Here might still be a problem with the network cards, or a problem with 
the PCIe riser - the machines are about 8.5 years old.

After booting up the first machine I could not migrate the VMs.
I reconfigured the the VMotion interfaces to the on-board network cards 
and could then migrate.
I have reconfigured both hosts now to just rely on  the 4 on-board 
cards. I have only 3 cards connected to the switched, so there might be 
some congestion until Monday, when I can plug in the 4th card and 
reconfigure.

Meanwhile PMBS is up again, and I will add the workers buildwk3 and -4 
in the next minutes and check, how the system behaves.

Additionally I ordered another 2 quad-port cards, now Broadcom, and will 
try with them sometime next week, or whenever they arrive.


Sorry for the unexpected downtime, it was indeed 3 hours.
Now enough with the chit-chat, go back to work :)


Greetings,

Stefan
-- 
Stefan Botter zu Hause
Bremen
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.links2linux.de/pipermail/packman/attachments/20200815/0dba38f2/attachment.sig>


More information about the Packman mailing list