Привет,
изчерпвам се от troubleshooting опции, за това питам за акъл.
Dell Precision 5810. Сменен проецсор, сменена памет, сменен диск. Debian 12, нова инсталация. Всичко е up-to-date - BIOS, пакети и т.н.
От време на време, в dmesg и в # journalctl -f се появява следното:
Jun 17 22:13:16 debian kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 17 22:13:16 debian kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 13: 8c00004a000800c0
Jun 17 22:13:16 debian kernel: EDAC sbridge MC0: TSC 1859b6d484e3
Jun 17 22:13:16 debian kernel: EDAC sbridge MC0: ADDR 1605ad6000
Jun 17 22:13:16 debian kernel: EDAC sbridge MC0: MISC 90000008000948c
Jun 17 22:13:16 debian kernel: EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1718651596 SOCKET 0 APIC 0
Jun 17 22:13:16 debian kernel: EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:0 page:0x1605ad6 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:1 channel_mask:1 rank:255 )---
[10074.494184] mce: [Hardware Error]: Machine check events logged
[10074.494192] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[10074.494195] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 13: 8c00004a000800c0
[10074.494200] EDAC sbridge MC0: TSC 1859b6d484e3
[10074.494202] EDAC sbridge MC0: ADDR 1605ad6000
[10074.494204] EDAC sbridge MC0: MISC 90000008000948c
[10074.494206] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1718651596 SOCKET 0 APIC 0
[10074.494225] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:0 page:0x1605ad6 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:1 channel_mask:1 rank:255 )
после следват и дъмпове:
Jun 17 22:35:43 debian kernel: ------------[ cut here ]------------
Jun 17 22:35:43 debian kernel: WARNING: CPU: 4 PID: 8050 at drivers/iommu/dma-iommu.c:1041 iommu_dma_unmap_page+0x79/0x90
Jun 17 22:35:43 debian kernel: Modules linked in: ccm rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device qrtr cmac algif_hash algif_skcipher af_alg bnep binfmt_misc squashfs intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp btusb btrtl kvm_intel btbcm btintel btmtk kvm bluetooth irqbypass rtl8188ee rtl_pci snd_hda_codec_realtek rtlwifi ghash_clmulni_intel sha256_ssse3 sha1_ssse3 snd_hda_codec_generic mac80211 snd_hda_codec_hdmi ledtrig_audio aesni_intel snd_hda_intel crypto_simd libarc4 cryptd jitterentropy_rng snd_intel_dspcfg snd_intel_sdw_acpi sha512_ssse3 rapl sha512_generic cfg80211 snd_hda_codec ctr snd_hda_core intel_cstate snd_hwdep mei_wdt drbg snd_pcm dell_smm_hwmon ansi_cprng iTCO_wdt snd_timer intel_pmc_bxt dell_smbios mei_me iTCO_vendor_support ecdh_generic snd dcdbas mei intel_wmi_thunderbolt dell_wmi_descriptor wmi_bmof pcspkr rfkill ecc soundcore watchdog intel_uncore joydev sg evdev msr
Jun 17 22:35:43 debian kernel: parport_pc ppdev lp parport dm_mod fuse loop efi_pstore configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic hid_generic usbhid hid nouveau sd_mod t10_pi sr_mod cdrom crc64_rocksoft crc64 crc_t10dif crct10dif_generic video i2c_algo_bit drm_display_helper cec rc_core ahci drm_ttm_helper libahci ttm libata drm_kms_helper xhci_pci ehci_pci xhci_hcd ehci_hcd mxm_wmi drm scsi_mod crct10dif_pclmul usbcore crct10dif_common crc32_pclmul e1000e crc32c_intel i2c_i801 lpc_ich i2c_smbus scsi_common usb_common wmi button
Jun 17 22:35:43 debian kernel: CPU: 4 PID: 8050 Comm: kworker/u48:1 Not tainted 6.1.0-18-amd64 #1 Debian 6.1.76-1
Jun 17 22:35:43 debian kernel: Hardware name: Dell Inc. Precision Tower 5810/0K240Y, BIOS A34 10/19/2020
Jun 17 22:35:43 debian kernel: Workqueue: phy0 ieee80211_iface_work [mac80211]
Jun 17 22:35:43 debian kernel: RIP: 0010:iommu_dma_unmap_page+0x79/0x90
Jun 17 22:35:43 debian kernel: Code: 2b 48 3b 28 72 26 48 3b 68 08 73 20 4d 89 f8 44 89 f1 4c 89 ea 48 89 ee 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 a7 b9 a6 ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 66 0f 1f 44 00
Jun 17 22:35:43 debian kernel: RSP: 0018:ffffb41f6108f938 EFLAGS: 00010046
Jun 17 22:35:43 debian kernel: RAX: 0000000000000000 RBX: ffff8fb4c29150d0 RCX: 0000000000000012
Jun 17 22:35:43 debian kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
Jun 17 22:35:43 debian kernel: RBP: ffffb41f6108fa10 R08: 0000000000000002 R09: 0000000000000000
Jun 17 22:35:43 debian kernel: R10: 00000000ffffffa0 R11: 0000000000000000 R12: 0000000000000000
Jun 17 22:35:43 debian kernel: R13: 0000000000000300 R14: 0000000000000001 R15: 0000000000000000
Jun 17 22:35:43 debian kernel: FS: 0000000000000000(0000) GS:ffff8fd36f900000(0000) knlGS:0000000000000000
Jun 17 22:35:43 debian kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 17 22:35:43 debian kernel: CR2: 00007f5f0802a048 CR3: 0000000109424002 CR4: 00000000003706e0
Jun 17 22:35:43 debian kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 17 22:35:43 debian kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun 17 22:35:43 debian kernel: Call Trace:
Jun 17 22:35:43 debian kernel: <TASK>
Jun 17 22:35:43 debian kernel: ? __warn+0x7d/0xc0
Jun 17 22:35:43 debian kernel: ? iommu_dma_unmap_page+0x79/0x90
Jun 17 22:35:43 debian kernel: ? report_bug+0xe2/0x150
Jun 17 22:35:43 debian kernel: ? handle_bug+0x41/0x70
Jun 17 22:35:43 debian kernel: ? exc_invalid_op+0x13/0x60
Jun 17 22:35:43 debian kernel: ? asm_exc_invalid_op+0x16/0x20
Jun 17 22:35:43 debian kernel: ? iommu_dma_unmap_page+0x79/0x90
Jun 17 22:35:43 debian kernel: ? iommu_dma_unmap_page+0x2e/0x90
Jun 17 22:35:43 debian kernel: rtl88ee_set_hw_reg+0x12fe/0x16f0 [rtl8188ee]
Jun 17 22:35:43 debian kernel: ? rtl88ee_set_check_bssid+0xb6/0x120 [rtl8188ee]
Jun 17 22:35:43 debian kernel: rtl_op_bss_info_changed+0x1c8/0x8a0 [rtlwifi]
Jun 17 22:35:43 debian kernel: ? ieee80211_recalc_chanctx_min_def+0x14/0x60 [mac80211]
Jun 17 22:35:43 debian kernel: ieee80211_bss_info_change_notify+0xcf/0x2a0 [mac80211]
Jun 17 22:35:43 debian kernel: ieee80211_rx_mgmt_assoc_resp.cold+0x1946/0x198d [mac80211]
Jun 17 22:35:43 debian kernel: ieee80211_sta_rx_queued_mgmt+0x2d6/0x820 [mac80211]
Jun 17 22:35:43 debian kernel: ? psi_group_change+0x145/0x360
Jun 17 22:35:43 debian kernel: ieee80211_iface_work+0x325/0x440 [mac80211]
Jun 17 22:35:43 debian kernel: process_one_work+0x1c7/0x380
Jun 17 22:35:43 debian kernel: worker_thread+0x4d/0x380
Jun 17 22:35:43 debian kernel: ? rescuer_thread+0x3a0/0x3a0
Jun 17 22:35:43 debian kernel: kthread+0xda/0x100
Jun 17 22:35:43 debian kernel: ? kthread_complete_and_exit+0x20/0x20
Jun 17 22:35:43 debian kernel: ret_from_fork+0x22/0x30
Jun 17 22:35:43 debian kernel: </TASK>
Jun 17 22:35:43 debian kernel: ---[ end trace 0000000000000000 ]---
----
# uname -a
Linux debian 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
----
На пръв поглед проблем с паметта, която (евентуално) прави проблем с wifi картата.
Обаче:
# ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors.
----
# ras-mc-ctl --error-count
Label CE UE
CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 0 0
CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 0 0
CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 0 0
CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 0 0
-----
Тестовете от BIOS-a на Dell не показват грешки (последна версия е BIOS-a).
memtest86+ не показва грешки.
---
За да стане по-интересно, имам стари плочки памет 4х 4G, 2133 Mhz - като ги сложа грешки няма.
Като сложа 4х 32G, 2400 Mhz - започват проблемите. Сменям плочките като позиция, няма промяна. Вадя плочки една по една - с 1х 32 работи добре, с 2х 32 работи добре, с 3х 32 работи добре, с 4х дава грешките по-горе.
Официално системата поддържа 256 G РАМ на 2400 Mhz.
някакви идеи как да хвана проблема от къде идва?
Едната ми теория е, че има проблем с 4-тия слот на паметта на 2400 Mhz ... но нямам доказателства.
четох за проблем с edac модулите -
https://www.dell.com/support/kbdoc/en-us/000177028/edac-errors-in-messages-log-in-redhat-enterprise-linux-rhel-and-poweredge като ги blacklist-на в
# cat edac_blacklist.conf
blacklist i5000_edac
blacklist igen6_edac
blacklist i7core_edac
blacklist pnd2_edac
blacklist i82975x_edac
blacklist skx_edac
blacklist ie31200_edac
blacklist sb_edac
blacklist i5400_edac
blacklist e752x_edac
blacklist i10nm_edac
blacklist edac_mce_amd
blacklist i5100_edac
blacklist i3200_edac
blacklist i7300_edac
blacklist i3000_edac
blacklist x38_edac
blacklist amd64_edac
е една идея по-добре с грешките, но без тези модули не работи ras-mc-ctl .....
--
Какво съм пробвал (освен описаното до тук):
- преинсталирах Debian-a
- смених PCI слота на wifi картата
- сменях позиции на плочките памет
- сложих FreeBSD - пак има проблеми в dmesg, даже забива
- махнах wifi картата и карах на кабел - няма дъмповете свързани с wifi-a. Също (ми се струва), че грепките на паметта са по-малко.
Какво не съм пробвал:
- да сваля процесора да видя дали няма огънати пинове които да накъсяват.
по някаква странна причина ubuntu 24.04 не зарежда от USB (дори и 8 GB флашка, защото доказано не зарежда 16G). От същата флашка зарежда debian (в BIOS-a e на legacy).
компа го имам от месец, не се изкл проблем с дъното, с чипсета).
Няма проблем с прегряването:
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +43.0°C (high = +85.0°C, crit = +95.0°C)
Core 0: +39.0°C (high = +85.0°C, crit = +95.0°C)
Core 1: +39.0°C (high = +85.0°C, crit = +95.0°C)
Core 2: +38.0°C (high = +85.0°C, crit = +95.0°C)
Core 3: +38.0°C (high = +85.0°C, crit = +95.0°C)
Core 4: +40.0°C (high = +85.0°C, crit = +95.0°C)
Core 5: +39.0°C (high = +85.0°C, crit = +95.0°C)
Core 8: +38.0°C (high = +85.0°C, crit = +95.0°C)
Core 9: +39.0°C (high = +85.0°C, crit = +95.0°C)
Core 10: +40.0°C (high = +85.0°C, crit = +95.0°C)
Core 11: +39.0°C (high = +85.0°C, crit = +95.0°C)
Core 12: +39.0°C (high = +85.0°C, crit = +95.0°C)
Core 13: +39.0°C (high = +85.0°C, crit = +95.0°C)
nouveau-pci-0300
Adapter: PCI adapter
GPU core: 1.01 V (min = +0.60 V, max = +1.20 V)
fan1: 1530 RPM
temp1: +47.0°C (high = +95.0°C, hyst = +3.0°C)
(crit = +105.0°C, hyst = +5.0°C)
(emerg = +135.0°C, hyst = +5.0°C)
dell_smm-isa-0000
Adapter: ISA adapter
Processor Fan: 1056 RPM (min = 0 RPM, max = 4520 RPM)
Other Fan: 0 RPM
Other Fan: 1001 RPM (min = 0 RPM, max = 5000 RPM)
CPU: +43.0°C
SODIMM: +26.0°C
SODIMM: +42.0°C
SODIMM: +40.0°C
---
Всяка идея за локализиране на проблема е донре дошла