Тема: Бавен фейловър (Прочетена 3292 пъти)

K1r0 · « -: Feb 19, 2014, 15:52 »

Имам два блейда 460 G6.

Qlogic Mezzazine

Вързани към две еви едната в едно DC другата в друго.

Вдигаме буфера на интерконнекта и два от четири активни пътя падат.

Проблема: няма файловър на останалите два пътя:

Feb 19 09:22:36 xxxx kernel: qla2xxx 0000:06:00.0: scsi(1:7:2): Abort command issued -- 1 120b9f352 2002.
Feb 19 09:22:37 xxxx kernel: qla2xxx 0000:06:00.0: scsi(1:5:13): Abort command issued -- 1 120b9f354 2002.
Feb 19 09:27:38 xxxx kernel: INFO: task multipathd:12624 blocked for more than 120 seconds.
Feb 19 09:27:38 xxxx kernel: INFO: task multipathd:13946 blocked for more than 120 seconds.
Feb 19 09:27:38 xxxx kernel: INFO: task multipathd:13947 blocked for more than 120 seconds.
Feb 19 09:27:38 xxxx kernel: INFO: task multipathd:13948 blocked for more than 120 seconds.
Feb 19 09:27:38 xxxx kernel: INFO: task multipathd:13950 blocked for more than 120 seconds.
Feb 19 09:27:38 xxxx kernel: INFO: task multipathd:13951 blocked for more than 120 seconds.
Feb 19 09:27:38 xxxx kernel: INFO: task oracle:9114 blocked for more than 120 seconds.
Feb 19 09:29:02 xxxx kernel: qla2xxx 0000:06:00.0: scsi(1:7:14): Abort command issued -- 1 120bc2d8f 2002.
...
Feb 19 09:29:07 xxxx kernel: qla2xxx 0000:06:00.0: scsi(1:7:2): Abort command issued -- 1 120bc2d42 2002.
Feb 19 09:29:08 xxxx kernel: qla2xxx 0000:06:00.0: scsi(1:7:2): Abort command issued -- 1 120bc2d49 2002.
Feb 19 09:29:09 xxxx kernel: qla2xxx 0000:06:00.0: scsi(1:7:2): Abort command issued -- 1 120bc2d50 2002.
....
Feb 19 09:29:25 xxxx kernel: qla2xxx 0000:06:00.0: scsi(1:6:14): Abort command issued -- 1 120bc2d9a 2002.
Feb 19 09:29:25 xxxx kernel: sd 1:0:7:14: timing out command, waited 10s
Feb 19 09:29:25 xxxx kernel: sd 1:0:7:13: timing out command, waited 10s
Feb 19 09:29:25 xxxx kernel: sd 1:0:7:5: timing out command, waited 10s
Feb 19 09:29:25 xxxx kernel: sd 1:0:7:2: timing out command, waited 10s
Feb 19 09:29:25 xxxx kernel: sd 1:0:7:1: timing out command, waited 10s
Feb 19 09:29:25 xxxx kernel: sd 1:0:6:14: timing out command, waited 300s
Feb 19 09:29:25 xxxx multipathd: EVA20_220_SAVE_001: load table [0 629145600 multipath 1 queue_if_no_path 0 2 1 round-robin 0 4 1 68:48 100 67:80 100 65:176 100 8:144 100 round-robin 0 4 1 67:192 100 66:224 100 65:80 100 8:240 100]
Feb 19 09:39:19 xxxx kernel: qla2xxx 0000:06:00.0: scsi(1:6:13): Abort command issued -- 1 120bc4f4b 2002.
Feb 19 09:41:35 xxxx kernel: qla2xxx 0000:06:00.0: scsi(1:6:13): Abort command issued -- 1 120bc54af 2002.

Моят шибан въпрос е:

Feb 19 09:29:25 xxxxx kernel: sd 1:0:6:14: timing out command, waited 300s

Защо го има това говно?!?

Според документацията трябва да имам време за фейловър:

time = (no_path_retry + 1)* polling_interval

no_path_retry
A numeric value for this attribute specifies the number of times the system should attempt to use a failed path before disabling queueing.
A value of fail indicates immediate failure, without queueing.
A value of queue indicates that queueing should not stop until the path is fixed.
The default value is 0.

Конфигурацията на Mpath e:

defaults {
udev_dir /dev
polling_interval 10
fast_io_fail_tmo 5
dev_loss_tmo 65
checker_timeout 10
path_selector "round-robin 0"
path_grouping_policy multibus
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio const
path_checker tur
rr_min_io 100
rr_min_io_rq 3
rr_weight uniform
failback immediate
no_path_retry 12
user_friendly_names yes
}

#For EVA4x00/EVA6x00/EVA8x00/P6300/P6500
device {
vendor "HP"
product "HSV2[01]0|HSV3[046]0|HSV4[05]0"
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio alua
hardware_handler "0"
path_selector "round-robin 0"
path_grouping_policy group_by_prio
failback immediate
rr_weight uniform
rr_min_io 100
no_path_retry 18
path_checker tur

}

Това са кворумните дискове:

   multipath {
   wwid   zzzz
   alias EVA20_db_quorum_1
      no_path_retry fail
   }

   multipath {
   wwid   yyyy
   alias   EVA21_db_quorum_2
      no_path_retry fail
   }

   multipath {
       wwid   xxxx
   alias   EVA12_db_quorum_3
      no_path_retry fail
   }

Ето малко инфо за единия от ева дисковете:

multipath -ll xxxx
EVA21_db_quorum_2 (xxxx) dm-0 HP,HSV450
size=2.0G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 2:0:11:13 sdcj 69:112 active ready running
| |- 2:0:9:13 sdcb 68:240 active ready running
| |- 1:0:9:13 sdak 66:64 active ready running
| `- 1:0:8:13 sdaf 65:240 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
|- 2:0:10:13 sdcf 69:48 active ready running
|- 2:0:8:13 sdbx 68:176 active ready running
|- 1:0:11:13 sdbw 68:160 active ready running
`- 1:0:10:13 sdbk 67:224 active ready running

Ето как ми умря и базата:

2014-02-19 09:20:29.054
[cssd(925)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:EVA20_db_QUORUM_1 will be considered not functional in 99330 milliseconds
2014-02-19 09:22:23.620
[cssd(925)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:EVA20_db_QUORUM_1 will be considered not functional in 99560 milliseconds
2014-02-19 09:26:25.890
[cssd(925)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:EVA20_db_QUORUM_1 will be considered not functional in 99720 milliseconds
2014-02-19 09:26:29.894
[cssd(925)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:EVA12_db_QUORUM_3 will be considered not functional in 99570 milliseconds
2014-02-19 09:26:29.894
[cssd(925)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:EVA21_db_QUORUM_2 will be considered not functional in 99600 milliseconds
2014-02-19 09:27:16.566
[cssd(925)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:EVA20_db_QUORUM_1 will be considered not functional in 49050 milliseconds
2014-02-19 09:27:19.951
[cssd(925)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:EVA12_db_QUORUM_3 will be considered not functional in 49510 milliseconds
2014-02-19 09:27:19.952
[cssd(925)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:EVA21_db_QUORUM_2 will be considered not functional in 49540 milliseconds
2014-02-19 09:27:45.980
[cssd(925)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:EVA20_db_QUORUM_1 will be considered not functional in 19640 milliseconds
2014-02-19 09:27:49.972
[cssd(925)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:EVA12_db_QUORUM_3 will be considered not functional in 19500 milliseconds
2014-02-19 09:27:49.972
[cssd(925)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:EVA21_db_QUORUM_2 will be considered not functional in 19530 milliseconds
2014-02-19 09:28:05.836
[cssd(925)]CRS-1604:CSSD voting file is offline: ORCL:EVA20_db_QUORUM_1; details at (:CSSNM00058:) in /opt/clusterware/11.2.0.3/grid/log/xxxx/cssd/ocssd.log.
2014-02-19 09:28:10.007
[cssd(925)]CRS-1604:CSSD voting file is offline: ORCL:EVA12_db_QUORUM_3; details at (:CSSNM00058:) in /opt/clusterware/11.2.0.3/grid/log/xxxx/cssd/ocssd.log.
2014-02-19 09:28:10.007
[cssd(925)]CRS-1604:CSSD voting file is offline: ORCL:EVA21_db_QUORUM_2; details at (:CSSNM00058:) in /opt/clusterware/11.2.0.3/grid/log/xxxx/cssd/ocssd.log.
2014-02-19 09:28:10.008

gat3way · « **Отговор #1 -:** Feb 19, 2014, 16:54 »

Като го бутнеш от multibus на failover работи ли?

K1r0 · « **Отговор #2 -:** Feb 19, 2014, 17:19 »

Само default а е мултибъс. Както се вижда от mutlipath-ll пътищата се групират с ALUA ( имам 4 оптимизирани и 4 не-оптимизирани пътища ), което идва от конфигурацията за EVA

gat3way · « **Отговор #3 -:** Feb 19, 2014, 17:37 »

А,да. Ами не знам - не съм пипал санове от 4 години. Виж дали някой retry параметър не може да се смъкне. Също драйверите линукските ли са или от PSP? Имам някакви мъгляви спомени от едно време че слагахме тези от proliant support pack-а, щото редхатските правеха мизерии, но беше отдавна и не помня каква беше драмата.

K1r0 · « **Отговор #4 -:** Feb 19, 2014, 17:57 »

За голямо нещастие машината е с Oracle Linux, което значи че SPP ( сега му викат Support Pack for Proliant ) може само да слага фирмуеър, но не и драйвери.

Драйверите идват от Oracle ( демек с ядрото ).

Ся ми идват две мисъли в тиквата:

а) qla2xxx не репортва Abort Command като грешка на по горния леър. Т.е. така както го разбирам каквото и да тунинговъм няма да има значение.
б) Реших да направя един тест: пуснах едно dd от диск и:

1) dd работеше няколко секунди
2) пуснах един безкраен relogin цикъл на един HBA порт
3) dd спря да работи за точно 5 секунди ( fast_io_fail_tmo 5 )
4) ОS отчете SCSI error
5) dm-multipath маркира единия от пътищата като счупен
6) dd продължи да работи с останалите 2 пътя (2 пътя през два порта, единия порт излиза от фабриката ),

Според мен нещо в драйвера не у ред. Понеже си пачнах съвсем скоро до OL 5.10 ( да се чете RHEL 5.10) , не го очаквах...

Но като се замисля е логично, ако не се репортва error нагоре да не се активира fast_io_fail_tmo

ОБАЧЕ!

В такъв случай трябва на всеки polling interval секунди да се изпълнява path checker и ако пропадне no_path_fail пъти да се маркира пътя като умрял. Но вместо това, multipathd блокира в D status и се използва най големия възможен timeout - 300s = 5 опита от dev_timeout

Това е супер тъпо!

П.П за жалост не може да направим същата простотия от сутринта щото ще ни убият ( в най-добрия сценарий )

gat3way · « **Отговор #5 -:** Feb 19, 2014, 19:32 »

По мое време за да инсталираме PSP върху Centos, беше достатъчно просто да бутнем /etc/redhat-release. Не знам оракълците доколко са кривнали от редхатския път ако въобще са го направили. Обаче на live система това е доста рискован експеримент.

K1r0 · « **Отговор #6 -:** Feb 24, 2014, 12:51 »

Здрасти пак от мен...

Джиткам си аз из мултипатинга тая събота и си мисля, чакай да разгледам малко кода ( OpenSource overrated... )

И си направих следата карта на пачнатата от Ред Хат версия 0.4.9-64.0.6.

( Не е много ясна... ама не съм dev )

------------------------ MULTIPATHD - USERSPACE LEVEL ----------------------

Source file: multipathd/main.c

Routine check_path:
   Calls get_state

   if new state differs from old state:

      Print the message for the patch checker ( why there is a change? )
      If the new state is "OFFLINE":
         Tell device-mapper to mark the path as failed

      If the new state is "ONLINE"
         Tell device-mapper to mark the path as up


   If there was not change it the path status:

      If state is ONLINE:
         Double the time for the next_check until it reaches max_checkint(4*time)

      If state is OFFLINE;
         Put in the log the report from the patch checker



   Calls update_prio - to update path priority.

   Updates priority groups

   If failover is required and configured in multipath.conf, it is performed.

Routine update_prio:

   If all paths in the priority group should be updated:
      #This is done only when a new path is up.
      Call pathinfo for each path in the priority group of the provided path

   Call pathinfo for the provided path.
   pathinfo(pp, conf->hwtable, DI_PRIO);



Source file: libmultipath/discovery.c

Routine: get_state:
   if there is no assigned checker for the path:
      Initialize checker

   if no checker_timeout has been configured in multipath.conf:
      Sets checker timeout to the SCSI timeout for the particular device.


   Calls checker_check routine

Routine: pathinfo:

   Fetches information about the provided path from sysfs

   If priority grouping policy is used and the path is up or its priority is unknown:
      #Means that we will try to get the priority of a path even if it is down.
      Call get_prio



Routine get_prio:
   if path is down(path_offline(pp) == PATH_DOWN), return failure

   If there is no priority callout, then select one

   Call prio_getprio with corresponing priority_callout: prio_getprio(pp->prio, pp);


Source file: cat libmultipath/prio.c

Routine: add_prio
   Initialize priority callout object
   Resolve rouines from library

Routine: prio_getprio:
   Calls library routine getprio

Source file: libmultipath/checkers.c

Routine: add_checker:
   Creates a structure for the checker object
   Resolves routines from library

Routine checker_check:
   Calls library routine libcheck_check

Source file: libmultipath/checkers/tur.c

Routine: libcheck_check:
   Checks if checker supports sync or async mode

   if sync:
      Calls tur_check
      Returns path status

   if async:
      Checks if thread is still running
      if yes:
         return the path status
      if no:
         Set-up new thread
         Assign thread completition time = checker_timeout
         New thread calls tur_check


Routine: tur_check:
   Does TEST UNIT READY SPC3 command ( SCSI Timeout = checker_timeout )
   Returns path status

Source file: libmultipath/prioritizers/alua.c

Routine: getprio
   Calls get_target_port_group_support
   Calls get_target_port_group
   Calls get_asymmetric_access_state
   Analyses the output

Source file: libmultipath/prioritizers/alua_rtpg.c

Routine do_inquiry
   Timeout ( 300s - as coded )
   ioctl(fd, SG_IO, &hdr)

Routine do_rtpg
   Timeout ( 300s - as coded )
   ioctl(fd, SG_IO, &hdr)

Интересно е че multipathd проверява пътищата асинхронно (процедура check_path в main.c ). Т.е. не би трябвало да блокира заради tur.

С patch-ове RHBZ-725541-01-async-tur-checker.patch и RHBZ-565933-checker-timeout.patch би трябвало хората да са си решили проблема

Обаче май не баш...

checkerloop в main.c вика check_path, което в случая когато имаме ALUA вика и path_info, което може да блокира и то още как! Защото scsi timeout е коднат на 300s и не може да се промени.

От друга страна fast_io_tail_tmo е проактивна метрика ( която би ни спасила... ). Т.е. HBA ще я използва само когато сметне 4е нещо не е наред и се разлогне от отдалечения порт.

Но какво ще стане ако вместо загуби връзка с порта има performance проблем?

Примерно кредитите на DC интерконекта се изчерпят и имам по голямо latency на някой път... и опашките започнат да се пълнят...

Кофти е, че имам твърде малко хардуер за тестване.

Забравих да си задам въпроса:

Може ли някои който има Red Hat и EVA да го тества като ограничи iops-a? ( ако има един примерно 2 пътя, и направи със cgroup за единия от пътищата примерно 10iops, тогава би трябвало едно dd да блокира пътя, но multipathd няма да направи failover ). И двете ми машинки са доста важни за да си играя с тях

Или някой да погледне кода на multipathd да потвърди бъга?

Автор Тема: Бавен фейловър (Прочетена 3292 пъти)

K1r0

Бавен фейловър

gat3way

Re: Бавен фейловър

K1r0

Re: Бавен фейловър

gat3way

Re: Бавен фейловър

K1r0

Re: Бавен фейловър

gat3way

Re: Бавен фейловър

K1r0

Re: Бавен фейловър