OpenMPI in SGE fails when not observed

I know the topic is weird but so is my problem. On our cluster we have SGE with OpenMPI compiled for tight integration. When I set it up it worked just fine in my tests and so far there have been no complaints until recently. Thing is: When I submit a job using the OpenMPI PE and run my binary using mpirun it fails.

The error messages are like

fully.quallified.host.name - daemon did not report back when launched and

[hostname:\d{5}] [[63730,0],\d{1,2}] routed:binomial: Connection to lifeline [[63730,0],0] lost

thats even for something simple like mpirun -np 40 --pernode hostname

now here’s the weird part: if I turn on verbose output for plm_base it works: mpirun -np 40 --mca plm_base_verbose 5 --pernode hostname does work!!! The loads of debuging output this produces on stderr don’t contain any indication of a problem whatsoever.

I’ve tried this multiple times and I can always reproduce this so I’m quite positive that this isn’t just a fluke. Problem is: I’m quite puzzled now.

I certainly miss something, so here’s the questions:

  1. Does setting the verbosity in this case also silently set other parameters?
  2. What else could cause this weird behaviour?

Best Regards.

Edit: configuration of relevant PE:

pe_name           ompi-gcc
slots             2000
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Nothing fancy there… OpenMPI is compiled for thight integration and detects everything it needs… Nevertheless it doesn’t work with qrsh i.e. it only works when disabling qrsh for rsh…

Answer

Nevermind. After some trial and error with the other parameters of plm I found that setting plm_rsh_disable_qrsh fixes the problem. However that doesn’t explain why setting its base verbosity to something other than 0 also fixed the problem. This is the part I still don’t get.

Attribution
Source : Link , Question Author : luxifer , Answer Author : luxifer

Leave a Comment