Topic
1 reply Latest Post - ‏2012-12-12T14:38:53Z by SystemAdmin
SystemAdmin
SystemAdmin
46 Posts
ACCEPTED ANSWER

Pinned topic Adapters down in llstatus -a

‏2012-12-11T11:43:07Z |
When I issue the llstatus -a command on our cluster all the network adapters are in "ErrDown" state, eg.:

...

comp01.zilina.vs.savba.sk
iba0(InfiniBand,,,,-1,0,0 rCxt Blks,0,NOT READY)
network1833865768265265971s(striped,,,,-1,0/0,0 rCxt Blks,0,ErrDown)
network18338657682652659712(aggregate,,,,-1,0/0,0 rCxt Blks,0,ErrDown)
ib1(InfiniBand,comp01-ib1,192.168.102.1,,-1,64,0 rCxt Blks,0,ErrDown,2)
ib0(InfiniBand,comp01-ib0,192.168.101.1,,-1,64,0 rCxt Blks,0,ErrDown,1)
en0(ethernet,comp01,172.20.101.1,,ErrDown)
en1(ethernet,comp01-en1,10.0.0.1,,ErrDown)

...

I suspect PNSD from doing this, as can be seen from the StartLog files, for example:

12/11 11:43:07 TI-5 HB: LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): calling nrt_query_adapter_names(...) with adapter_type=2.
12/11 11:43:07 TI-5 HB: LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): max_windows of this type adapters = 0
12/11 11:43:07 TI-5 HB: LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): calling nrt_query_adapter_info(...) with adapter_type=2, adapter_name=en0.
12/11 11:43:07 TI-5 LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): HB: num_ports =1
12/11 11:43:07 TI-5 LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): HB: PNSD reported value for if_name(en0)
12/11 11:43:07 TI-5 LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): HB: PNSD: adapter state is 0, interface state is 0.
12/11 11:43:07 TI-5 LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): HB: PNSD: adapter state finally is 0.
12/11 11:43:07 TI-5 LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): HB: Find adapter config idx =0 , if_name =en0
12/11 11:43:07 TI-5 LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): Removing adapter en0 from adapter config list.
12/11 11:43:07 TI-5 HB: LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): adapter: en0 IOCTL_status: 1, PNSD_status: 0
12/11 11:43:07 TI-5 HB: LlAdapterConfig::getAdapterPNSD(LlAdapterConfigListPtr): adapter: en0, FINAL_status: 0
12/11 11:43:07 TI-5 PNSD en0
12/11 11:43:07 TI-5
adapter_name = en0
device_name = en0
adapter_type = 2
opstate = 0
adapter_ipv4_addr = 172.20.101.1
adapter_ipv4_netmask = 255.255.0.0
adapter_ipv6_addr = ::
adapter_ipv6_netmask = ::
* port_number = 1
* logical_id = 0
* special = 0
* network_id = 0
* node_number = -1
* rcontext_block_count = 0
* window_count = 0
* window_list = []

Does somebody know what might one do to fix this?

All the best,

Lukas Demovic
Updated on 2012-12-12T14:38:53Z at 2012-12-12T14:38:53Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    46 Posts
    ACCEPTED ANSWER

    Re: Adapters down in llstatus -a

    ‏2012-12-12T14:38:53Z  in response to SystemAdmin
    OK, so i have an update now.

    After updating the rsct.lapi.rte package to level 3.1.6.7, the adapter in llstatus -a came to READY state.

    However, when I try to run a parallel job in the US mode, I got the following error message:

    ATTENTION: 0031-408 2 tasks allocated by Resource Manager, continuing...
    MPI-LAPI ERROR: lapi_init() failed with rc(680)
    ERROR: 0031-309 Connect failed during message passing initialization, task 1, reason: 680 Communication subsystem internal error: OpenIB verbs failure.
    ERROR: 0031-007 Error initializing communication subsystem: return code -1
    ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in task 1
    MPI-LAPI ERROR: lapi_init() failed with rc(680)
    ERROR: 0031-309 Connect failed during message passing initialization, task 0, reason: 680 Communication subsystem internal error: OpenIB verbs failure.
    ERROR: 0031-007 Error initializing communication subsystem: return code -1
    ERROR: 0031-250 task 0: Terminated

    Does somebody have any clue on this?

    Thanks.