[packman] PMBS down
jsj at jsj.dyndns.org
Sat Aug 15 11:36:49 CEST 2020
Hi Manfred, hi Packmans,
Am Samstag, 15. August 2020, 09:33:36 CEST schrieb Manfred Hollstein:
> I don't know if this is caused by a planned downtime, but
> pmbs.links2linux.org cannot be reached at the moment.
> Can you please take a look?
That was not really planned, but ...
I had the long-standing issue with the defective network adapter in one
of my hosts, which caused the frequent downtimes several months ago.
Some of it was caused by the ISCSI-connection to my storage devices,
which are now connected via NFSv4.1 - and rock-solid so far.
For a time now I had several issues, which came with the territory of
running uncertified hardware (Dell gen11 Server and VMware 6.7U3 -
certified up to 6.0, Intel PRO/1000 ET, certified and supported until
VMware 6.7U1). The problem surfaced while putting heavy load on the
network interfaces on the Intel card, for instance a live-migration from
one host to the other. The interface simply stopped, and could only be
revived by unplugging/plugging the cable or shutting down and re-
enabling the port on the switch. Known problem with the card.
I bought 2 HP NC375T Quad-port cards and replaced the Intel ET last
Wednesday. It looked good at first glance, but since Wednesday
everything was slow.
I mean: S-L-O-W.
I thought it might by due to over-committing the buildwk[1-4] workers
with 12 vCPU on 8core physical CPUs (with HT), and slow memory
assignment in the NUMA architecture of the hosts (CPUs have "private"
RAM, and can access RAM from different CPUs more slowly).
To check this I wanted to shut down all VMs, reboot the hosts and boot
everything up. Should not take more than 15 minutes.
Unfortuntely i found out, that there still is a problem with the
additional network adapters. During boot the cards cannot be trained,
and the system stops there. A warm reboot can fix this.
Here might still be a problem with the network cards, or a problem with
the PCIe riser - the machines are about 8.5 years old.
After booting up the first machine I could not migrate the VMs.
I reconfigured the the VMotion interfaces to the on-board network cards
and could then migrate.
I have reconfigured both hosts now to just rely on the 4 on-board
cards. I have only 3 cards connected to the switched, so there might be
some congestion until Monday, when I can plug in the 4th card and
Meanwhile PMBS is up again, and I will add the workers buildwk3 and -4
in the next minutes and check, how the system behaves.
Additionally I ordered another 2 quad-port cards, now Broadcom, and will
try with them sometime next week, or whenever they arrive.
Sorry for the unexpected downtime, it was indeed 3 hours.
Now enough with the chit-chat, go back to work :)
Stefan Botter zu Hause
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 488 bytes
Desc: This is a digitally signed message part.
More information about the Packman