cancel
Showing results for 
Search instead for 
Did you mean: 
_Igor_
New member
1 0 0 0
Message 1 of 2
540
Flag Post

Z620 2xCPU 96Gb: Kernel reports for MCE memory errors after two days of running

HP Recommended
Z620
Linux

Hi everyone,

 

HP Z620 (158A) with 2x E5-2670 and 96Gb (12x 8gb) of ram, latest BIOS, Ubuntu Linux.

 

I have strange behavior with the system that I use as homelab server and maybe someone solved that before.

 

After about two days kernel reports for MCE errors like this: 

EDAC MC0: 32024 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x1044 offset:0x840 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:2 rank:0)

 

Reboot helps and it can run for two days again.

 

I tried to cycle memory modules, but the bank # in the message remains the same. I think modules are completelly fine at this moment.

 

Memtest86 reports no errors.

 

If I remove raiser board with the second CPU with memory on it (it's detachable) - no errors reported for a week (not tested more). But it looks overkill for me to loose one CPU to beat that issue.

 

My next step was to add 'mce=ignore_ce' to boot options. It doesn't report MCE errors as designed. But after two days I noticed that overall system performance degraded drastically. 

For example an app starts in 16 secs on fresh system and after two days it takes 101 seconds to start. The system idle was 99% before I start it.

 

Now I puzzled, what to try next?

1 REPLY 1
DGroves
Level 11
4,711 4,685 443 995
Message 2 of 2
Flag Post
HP Recommended

just because memtest reports no errors does not mean the memory module(s) are ok some errors are not reported

 

These errors occur when the Error Detection and Correction (EDAC) module reads the registers from the chipset. You may not notice any memory or CPU errors in the ESM/BMC/IPMI/iDRAC log because the registers are read-once and when enabled, EDAC will get them first.

 

try swapping the memory in  CPU 0 ,  memory channel 1, dimm 0, with a "new" module and see if the error changes 

 

i do not recomend just swapping modules in this case use a new good module and  if nessary start with fst mem channel

first dim socket and proceed testing all modules if nessary

 

 

Was this reply helpful? Yes No
Warning Be alert for scammers posting fake support phone numbers and/or email addresses on the community. If you think you have received a fake HP Support message, please report it to us by clicking on "Flag Post".
† The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation