Create an account on the HP Community to personalize your profile and ask a question
08-16-2020 01:06 PM
I am having issues with an LSI 9201 16e SAS HBA card (PCI2.0). I've tried the following
- Upgrading firmware and BIOS on the card
- Changing PCI slot
- Disabling MSI and MSI-X interrupts in my OS (FreeBSD 12.1)
The issue is that the card appears to become unresponsive on moderate disk load. It typically lasts for some time without load, but when moderate load is applied I have observed it become unresponsive in as little as an hour. This results in a reboot (panic) because the OS is unable to reset the controller. The card is well cooled and never hot to the touch.
I wanted to check whether there might be some kind of compatibility issue between the card and this HP system board.
mps0: <Avago Technologies (LSI) SAS2116> port 0x2000-0x20ff mem 0xf8040000-0xf8043fff,0xf8000000-0xf803ffff irq 42 at device 0.0 numa-domain 1 on pci11 mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
[...] (da8:mps0:0:8:0): WRITE(10). CDB: 2a 00 24 42 79 b0 00 00 08 00 (da8:mps0:0:8:0): CAM status: CAM subsystem is busy (da8:mps0:0:8:0): Error 5, Retries exhausted mps0: mpssas_action_scsiio: Freezing devq for target ID 10 mps0: (da10:mps0:0:10:0): WRITE(10). CDB: 2a 00 24 42 79 b0 00 00 08 00 mpssas_action_scsiio: Freezing devq for target ID 8 (da10:mps0:0:10:0): CAM status: CAM subsystem is busy (da8:mps0:0:8:0): WRITE(10). CDB: 2a 00 24 42 82 58 00 00 08 00 (da10:mps0:0:10:0): Error 5, Retries exhausted (da8:mps0:0:8:0): CAM status: CAM subsystem is busy mps0: mpssas_action_scsiio: Freezing devq for target ID 9 (da8:mps0:0:8:0): Retrying command, 3 more tries remain (da9:mps0:0:9:0): WRITE(10). CDB: 2a 00 24 42 81 88 00 00 08 00 mps0: (da9:mps0:0:9:0): CAM status: CAM subsystem is busy mpssas_action_scsiio: Freezing devq for target ID 10 (da9:mps0:0:9:0): Error 5, Retries exhausted (da10:mps0:0:10:0): WRITE(10). CDB: 2a 00 24 42 82 58 00 00 08 00 (xpt0:mps0:0:9:0): SMID 1 task mgmt 0xfffffe000451e158 timed out (da10:mps0:0:10:0): CAM status: CAM subsystem is busy mps0: (da10:mps0:0:10:0): Retrying command, 3 more tries remain Reinitializing controller panic: mps_reinit hard reset failed with error 60
08-16-2020 11:39 PM - edited 08-16-2020 11:51 PM
the SAS card that HP officially supported for this workstation was the 8888ELP
these cards are quite cheap on ebay, so you might want to simply swap out the card and see what happens
with that said, the 9201 is a "generic" LSI single chip "ROC" card that came in many different variants 9211/9212 being the most common all of the lowend 92xx cards use the same LSI chip, and they lack a dedicated xor chip for raid 5 calculations and have no onboard cache as such they are unsuitable for anything but light duty raid 0/1
if you need raid 5 and/or a card with sustained I/O and a large queue depth the 9201 is not a card to consider/use
are you using the card in raid or just as more SAS/SATA ports? IE-which firmware is on the card IR or IT ?
Update: forgot to mention, check/replace the cabling from the card to drive(s) and also try a different port on the card
i've seen bad cables and borked (failing/defective) ports on LSI cards
08-17-2020 10:21 AM
I don't actually know much about storage but I would be surprised if this load itself is becoming a problem for the card. I called it "moderate", but the 3 disks which have almost all of that load are just 2.5" 5400rpm 500gb consumer-grade laptop drives. I haven't observed them closely so I don't know if there has been a problem with those drives handling it, such as whether their queues fill up (and, I guess that could be the origin of the CAM messages in my trace output, though not the origin of the reinitialisation failure). On the other hand, I think this crash also happens far below peak load. In any case I assumed that those drives would be a bottleneck before anything became a problem for the card.
I will note that the card does have 16 total disks connected. I think the card is in IT mode, since it is operating as a JBOD, presenting each drive directly as a 'da' device to the OS. The only option I am familiar with on the BIOS is the 'Boot mode' which can be either PC only, BIOS only, PC/BIOS, Disabled; I have tried it on a few of those with the same crash.
I will take a look for a suitable 8888ELP card and see about trying to put drives on that instead, starting with the loaded ones.
I will also try shuffling the cables around.
There is also the question of why the card is not responding to the reinitialisation command. I guess the card itself could be crashing, if for example I am wrong about the load and it is too much. This failure is what is causing the system crash. Can this be caused by the bad ports / cable?
Also, I was curious whether I might run into any issues plugging these three consumer SATA drives (mentioned above) into the SAS connectors on the motherboard, since the SATA connectors are occupied. I had removed all the drives from the SAS connectors, because I appeared to be having some issues with my partition labels being corrupted or not visible, though many of these drives are >2TB. If the 500GB drives are OK to plug into SAS, that might provide more time between crashes (so a partial workaround) but also might give some clues since we would be taking the drives with the presumed problematic load off of the HBA card.
08-17-2020 10:46 AM - edited 08-17-2020 10:51 AM
if you don't know what mode firmware is installed on your LSI card.......STOP and do nothing else until you do know
simply entering the cards bios will give you this information, if it offers to create a Raid in it's bios you have "IR" firmware installed
based on you statement that you have 11 disks in JBOD suggests you should be using "IT" firmware
the LSI 9201 card can do JBOD while running Raid firmware, but it's not recommended as the raid firmware adds a layer onto the drives I/O and can cause major issues with software raid with products like "ZFS"
the LSI 8888elp card is a raid only card (no JBOD) using external ports (ELP=ext ports)
i recommend you look into buying a LSI 9201/9211/9212 card off ebay that has the "IT" firmware already installed or
a adaptec ASR-71605 make sure the adaptec comes with the optional battery backup/cache module you will also need the correct sff-8087 FORWARD BREAKOUT CABLE(S), card to drive(s)
do not use the reverse type unless you are connecting the card to a backplane
08-20-2020 04:42 PM
OK, I confirmed that it is IT firmware. For reference here is output from the sas2flash utility.
Adapter Selected is a LSI SAS: SAS2116_1(B1) Controller Number : 0 Controller : SAS2116_1(B1) PCI Address : 00:58:00:00 SAS Address : 500062b-2-00de-1940 NVDATA Version (Default) : 14.01.00.07 NVDATA Version (Persistent) : 14.01.00.07 Firmware Product ID : 0x2213 (IT) Firmware Version : 20.00.07.00 NVDATA Vendor : LSI NVDATA Product ID : SAS9201-16e BIOS Version : 07.39.02.00 UEFI BSD Version : 07.27.01.00 FCODE Version : N/A Board Name : SAS9201-16e
08-20-2020 05:59 PM - edited 08-20-2020 06:01 PM
again, the LSI 9201/9211/9210 /9240 cards are all the same basically the same hardware, and although there were 3 different ROC chips used all have the same basic feature set (LSI-2008 IT or IR firmware) / 2116 "IT" firmware only / 2308 pcie 3.0)
i don't use linux or unix so i can't comment on the OS's ability to do sustained I/O to numerous disks at the same time
rember,.....these cards are LOW END "basic" cards used to do raid 0/1 and/or expand a systems SAS/SATA ports
depending on what you are doing/using the cards for you may be exceeding the cards I/O ability or the OS's ability to provide a uninterrupted data flow or the application itself may be at fault
consider adding another 9201 to split the load between two cards or as i mentioned replace the card with one that has better specs like the adaptec card mentioned above
as for a hardware issue with the xw9400....................don't think so as this system came out quite a few years ago and if there were any hardware issues they would have been found and corrected/documented a long time ago