Отпечатай - Странен проблем с xfs или bad sectors?!

Титла: Странен проблем с xfs или bad sectors?!
Публикувано от: amarth в Aug 10, 2011, 19:45

Здравейте,
проверявам си аз какво има в /raid0/mysql/ и ето на какво се натъкнах:

serv07:~# ls -la /raid0/mysql/
ls: reading directory /raid0/mysql/: Input/output error
total 0 , а във въпросната директория имаше няколко бази.

малко инфо за машината:
Debian 5.0.1, kernel 2.6.20.2

serv07:~# fdisk -l
Disk /dev/hda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0xae7a3a28

Device Boot Start End Blocks Id System
/dev/hda1 1 1246 10008463+ fd Linux raid autodetect
/dev/hda2 1247 1369 987997+ 82 Linux swap / Solaris
/dev/hda3 1370 19457 145291860 fd Linux raid autodetect

Disk /dev/hdc: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/hdc1 1 1246 10008463+ fd Linux raid autodetect
/dev/hdc2 1247 1369 987997+ 82 Linux swap / Solaris
/dev/hdc3 1370 19457 145291860 fd Linux raid autodetect

Disk /dev/md3: 297.5 GB, 297557557248 bytes
2 heads, 4 sectors/track, 72645888 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md2: 10.2 GB, 10248585216 bytes
2 heads, 4 sectors/track, 2502096 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk identifier: 0x00000000

serv07:~# cat /etc/fstab
proc /proc proc defaults 0 0
/dev/md2 / xfs defaults 0 1
/dev/md3 /raid0 xfs defaults 0 2
/dev/hda2 none swap sw 0 0
/dev/hdc2 none swap sw 0 0

serv07:~# mdadm --detail /dev/md3
/dev/md3:
Version : 00.90
Creation Time : Fri Nov 17 19:09:14 2006
Raid Level : raid0
Array Size : 290583552 (277.12 GiB 297.56 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 3
Persistence : Superblock is persistent

Update Time : Wed Aug 10 16:53:23 2011
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Chunk Size : 64K

UUID : e047348c:50fcbfd0:2636a61a:93c132ec
Events : 0.15

Number Major Minor RaidDevice State
0 3 3 0 active sync /dev/hda3
1 22 3 1 active sync /dev/hdc3

резултата от checka на md3:
serv07:~# xfs_check /dev/md3
can't read btree block 1/14884
can't read block 0 for directory inode 139741471
no . entry for directory 139741471
no .. entry for directory 139741471
/usr/sbin/xfs_check: line 28: 4107 Segmentation fault xfs_db$DBOPTS -i -p xfs_check -c "check$OPTS" $1

пробвах и с repair но нямаше ефек :(
serv07:~# xfs_repair /dev/md3
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
xfs_repair: read failed: Input/output error

fatal error -- can't read btree block 1/14884

Райда си работи, проблема е само с тази директория, т.е. всички дани в /raid0/* са ок с изключение на тези в /raid0/mysql/

Ако някой има идеи как мога да възстановя това което се намира в /raid0/mysql/ ще съм му много благодарен.

Титла: Re: Странен проблем с xfs или bad sectors?!
Публикувано от: vyrgozunqk в Aug 10, 2011, 20:41

Най-лесният ( но не много точен вариант ) е да погледнеш инфото от смарт-а на диска... Да видим тогава какво ще каже... Но да на лоши сектори мирише.

Отделно не съм запознат с XFS, но ако fsck го поддържа, можеш да пробваш да го пуснеш, обикновено пищи на лоши сектори и предлага даже да пробва, да ги презапише, ако не успее да презапише блока диска си ги релокира...

Титла: Re: Странен проблем с xfs или bad sectors?!
Публикувано от: aleximilian в Aug 10, 2011, 23:14

Преди известно време имах едни подобни мелодрами.
Ако можеш да направиш копие на диска с dd или dd_rescue и после виж какво ще успее да ти помогне testdisk.

Титла: Re: Странен проблем с xfs или bad sectors?!
Публикувано от: amarth в Aug 11, 2011, 01:47

fsck не се поддържа, ето и резултата от смарт:
serv07:~# smartctl -a /dev/hdc
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
Device Model: ST3160023A
Serial Number: 5JS3RRW0
Firmware Version: 8.01
User Capacity: 160,041,885,696 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2
Local Time is: Thu Aug 11 01:15:57 2011 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 111) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 054 053 006 Pre-fail Always - 4393832
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 3
5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 47
7 Seek_Error_Rate 0x000f 079 060 030 Pre-fail Always - 86523270
9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 20049
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 134
194 Temperature_Celsius 0x0022 037 059 000 Old_age Always - 37
195 Hardware_ECC_Recovered 0x001a 054 052 000 Old_age Always - 4393832
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 10
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 10
199 UDMA_CRC_Error_Count 0x003e 200 199 000 Old_age Always - 2
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 017 170 000 Old_age Always - 83

SMART Error Log Version: 1
ATA Error Count: 2890 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2890 occurred at disk power-on lifetime: 20044 hours (835 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 91 c6 79 e0 Error: UNC 8 sectors at LBA = 0x0079c691 = 7980689

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 91 c6 79 e0 00 20:31:43.955 READ DMA EXT
25 00 08 91 c6 79 e0 00 20:31:39.968 READ DMA EXT
25 00 10 89 e1 64 e0 00 20:31:39.954 READ DMA EXT
25 00 68 19 e1 64 e0 00 20:31:39.951 READ DMA EXT
25 00 80 99 e0 64 e0 00 20:31:39.948 READ DMA EXT

Error 2889 occurred at disk power-on lifetime: 20044 hours (835 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 91 c6 79 e0 Error: UNC 8 sectors at LBA = 0x0079c691 = 7980689

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 91 c6 79 e0 00 20:31:39.884 READ DMA EXT
25 00 10 89 e1 64 e0 00 20:31:39.968 READ DMA EXT
25 00 68 19 e1 64 e0 00 20:31:39.954 READ DMA EXT
25 00 80 99 e0 64 e0 00 20:31:39.951 READ DMA EXT
25 00 80 19 e0 64 e0 00 20:31:39.948 READ DMA EXT

Error 2888 occurred at disk power-on lifetime: 20044 hours (835 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 b9 9d 65 e0 Error: UNC 8 sectors at LBA = 0x00659db9 = 6659513

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 b9 9d 65 e0 00 20:31:31.108 READ DMA EXT
ea 00 00 00 00 00 00 00 20:31:31.099 FLUSH CACHE EXIT
35 00 08 bf 6e 31 e0 00 20:31:31.091 WRITE DMA EXT
ea 00 00 00 00 00 00 00 20:31:31.084 FLUSH CACHE EXIT
25 00 08 b9 9d 65 e0 00 20:31:31.084 READ DMA EXT

Error 2887 occurred at disk power-on lifetime: 20044 hours (835 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 b9 9d 65 e0 Error: UNC 8 sectors at LBA = 0x00659db9 = 6659513

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 b9 9d 65 e0 00 20:31:31.108 READ DMA EXT
25 00 08 09 cc 64 e0 00 20:31:31.099 READ DMA EXT
25 00 08 71 f3 64 e0 00 20:31:31.091 READ DMA EXT
25 00 08 11 cc 64 e0 00 20:31:31.084 READ DMA EXT
25 00 08 e9 cf 64 e0 00 20:31:31.084 READ DMA EXT

Error 2886 occurred at disk power-on lifetime: 20044 hours (835 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 91 c6 79 e0 Error: UNC 8 sectors at LBA = 0x0079c691 = 7980689

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 91 c6 79 e0 00 20:15:51.154 READ DMA EXT
ea 00 00 00 00 00 00 00 20:15:51.150 FLUSH CACHE EXIT
35 00 08 bf 6e 31 e0 00 20:15:51.147 WRITE DMA EXT
ea 00 00 00 00 00 00 00 20:15:51.141 FLUSH CACHE EXIT
25 00 08 91 c6 79 e0 00 20:15:51.138 READ DMA EXT

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

и от /dev/sda:
serv07:~# smartctl -a /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

...

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 063 054 006 Pre-fail Always - 155371110
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 12
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 4
7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 312341655
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 5072
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 185
194 Temperature_Celsius 0x0022 031 060 000 Old_age Always - 31
195 Hardware_ECC_Recovered 0x001a 063 054 000 Old_age Always - 155371110
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 199 000 Old_age Always - 2
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0

...

Error 16 occurred at disk power-on lifetime: 5045 hours (210 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 d4 25 2c e0 Error: UNC at LBA = 0x002c25d4 = 2893268

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 00 3f 25 2c e0 00 00:12:26.047 READ DMA EXT
25 00 00 3f 21 2c e0 00 00:12:26.020 READ DMA EXT
25 00 00 3f 1d 2c e0 00 00:12:26.427 READ DMA EXT
25 00 00 3f 19 2c e0 00 00:12:26.400 READ DMA EXT
25 00 00 3f 15 2c e0 00 00:12:26.374 READ DMA EXT

Error 15 occurred at disk power-on lifetime: 4998 hours (208 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 b5 75 46 e0 Error: UNC at LBA = 0x004675b5 = 4617653

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ae 75 46 e0 00 14:09:14.685 READ DMA EXT
25 00 10 a6 75 46 e0 00 14:09:14.681 READ DMA EXT
25 00 08 9e 75 46 e0 00 14:09:14.680 READ DMA EXT
25 00 40 9e f3 46 e0 00 14:09:14.676 READ DMA EXT
25 00 08 4e 75 46 e0 00 14:09:14.669 READ DMA EXT

Error 14 occurred at disk power-on lifetime: 4998 hours (208 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 b5 75 46 e0 Error: UNC at LBA = 0x004675b5 = 4617653

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 10 a6 75 46 e0 00 14:09:14.685 READ DMA EXT
25 00 08 9e 75 46 e0 00 14:09:14.681 READ DMA EXT
25 00 40 9e f3 46 e0 00 14:09:14.680 READ DMA EXT
25 00 08 4e 75 46 e0 00 14:09:14.676 READ DMA EXT
25 00 08 3e 75 46 e0 00 14:09:14.669 READ DMA EXT

Error 13 occurred at disk power-on lifetime: 4998 hours (208 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 b5 75 46 e0 Error: UNC at LBA = 0x004675b5 = 4617653

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ae 75 46 e0 00 14:08:46.386 READ DMA EXT
25 00 10 a6 75 46 e0 00 14:08:42.163 READ DMA EXT
25 00 18 9e 75 46 e0 00 14:08:42.150 READ DMA EXT
25 00 78 b7 a0 63 e0 00 14:08:42.149 READ DMA EXT
25 00 38 e6 3d 46 e0 00 14:08:42.142 READ DMA EXT

Error 12 occurred at disk power-on lifetime: 4998 hours (208 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 b5 75 46 e0 Error: UNC at LBA = 0x004675b5 = 4617653

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 10 a6 75 46 e0 00 14:08:46.386 READ DMA EXT
25 00 18 9e 75 46 e0 00 14:08:42.163 READ DMA EXT
25 00 78 b7 a0 63 e0 00 14:08:42.150 READ DMA EXT
25 00 38 e6 3d 46 e0 00 14:08:42.149 READ DMA EXT
25 00 08 de 3d 46 e0 00 14:08:42.142 READ DMA EXT

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Не мисля че dd_rescue може да ми помогне понеже въпросната директория /raid0/mysql/ е в raid0.

Титла: Re: Странен проблем с xfs или bad sectors?!
Публикувано от: Naka в Aug 11, 2011, 10:54

Хa... ти уби коня с база данни върху raid 0 !

този диск ST3160023A го изхърляй.

Тези двата реда показват че има 20 бад сектора. (10 от единият ред и още 10 при вторият)

Код:

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       10
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       10

а този ред:

Код:

  5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       47

показва че е имало преди още 47 лоши сектора, които диска е успял да ги замени с здрави.
Диска отдавна е почнал да си заминава а и е на 20000часа.

Титла: Re: Странен проблем с xfs или bad sectors?!
Публикувано от: amarth в Aug 11, 2011, 12:10

A дали има някакъв начин да се възстановят данните?

Титла: Re: Странен проблем с xfs или bad sectors?!
Публикувано от: Naka в Aug 11, 2011, 15:03

Не.
Там където не ще да чете (където някой файл попада върху бад сектор) няма как - информацията е вече загубена.

Но тези файлове които попадат върху здрави сектори могът да се архивират или копират на друго място (друг носител извън масива).

Единственото спасение е да копираш всички файлове които все още може да се прочетат на някой друг хард или да ги архивираш.

-----------------
Чудя се как не ти е изхвърлил още масива. Ама не знам какво е поведението при райд 0. Може при райд0, при лоши сектори да не се изхърля диска, щото тогава оставаш съвсем без данни. ???

Титла: Re: Странен проблем с xfs или bad sectors?!
Публикувано от: vyrgozunqk в Aug 11, 2011, 17:29

Пробвай да отвориш диска през "mc", копира побитово до колкото помня и просто ще прескочи "сбърканите битчета" в/у кофтито сектори, така съм спасявал няколко пъти снимки и други неща, макар и после да не се рендират изцяло някой от снимките примерно, а до половината... Нищо не пречи да пробваш...

Титла: Re: Странен проблем с xfs или bad sectors?!
Публикувано от: laskov в Aug 11, 2011, 22:16

http://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html ($2)

Титла: Re: Странен проблем с xfs или bad sectors?!
Публикувано от: amarth в Aug 12, 2011, 00:56

Нямаше успех с "mc"... преди малко пуснах ddrescue и сутринта ще видим какъв е резултата.

Титла: Re: Странен проблем с xfs или bad sectors?!
Публикувано от: amarth в Aug 15, 2011, 14:44

Направих копие на /dev/md3 с dd_rescue, а след това пробвах да възстановя данните с testdisk, photorec и mondo rescue...
за жалост се оказа безуспешно :(

Linux за българи: Форуми

Linux секция за начинаещи => Настройка на програми => Темата е започната от: amarth в Aug 10, 2011, 19:45