MPAS | HPC file system failure or MPI failure
When having simulation on new cluster, we have issue about the failure of multiple nodes.
- node up infinite 2 idle node-[1,2]
- submit by 1 node, it is successful.
- submit by 2 nodes, it hangs.
- node-[1,3] fails as well.
stops here, no respond and still occupy nodes, 1
2Using io_type Parallel-NetCDF (CDF-5, large variable support) for mesh stream
** Attempting to bootstrap MPAS framework using stream: input
Or,
1 |
|
Or,
1 |
|
Solution:
1 |
|
to allow extra filesystem and use tcp
that be slower a bit.
MPAS | HPC file system failure or MPI failure
https://waipangsze.github.io/2024/10/29/HPC-file-system-failure/