Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
maestro-core
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container registry
Model registry
Analyze
Contributor analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
maestro
maestro-core
Commits
e297551e
Commit
e297551e
authored
Sep 25, 2020
by
Utz-Uwe Haus
Browse files
Options
Downloads
Patches
Plain Diff
add GNI and ulimit info to README
parent
b954b5b0
Branches
Branches containing commit
Tags
Tags containing commit
2 merge requests
!3
Jsc ci update
,
!2
update JSC-CI branch to devel
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
README.md
+70
-0
70 additions, 0 deletions
README.md
with
70 additions
and
0 deletions
README.md
+
70
−
0
View file @
e297551e
...
...
@@ -19,6 +19,20 @@ make check
to build and run the test examples.
This may take some time.
Limits
------
Maestro-core needs quite a few file descriptors and also wants to lock pages
into memory for RDMA purposes. We try to give a diagnostic message if errors
are triggered that may be due to resource constraints. Still, we recommend
```
ulimit -n 1024
ulimit -l 256
```
to set at least 1024 file descriptors and 256k of RDMA space.
Local multithreaded demo (MVP1)
-------------------------------
...
...
@@ -46,6 +60,62 @@ The pool manager interlock demonstration `./tests/check_pm_interlock.sh` is
automatically launched with make check.
fabric provider choice/ High-Performance Interconnect usage
-----------------------------------------------------------
Maestro-core is trying hard to isolate the user from the multitude of network provider choices by using
libfabric, and transparently choosing 'the best' connectivity between components. Unfortunately this
functionality is not fully working, due to issues in the upstream libfabric code, and in incomplete testing
of our usage of it.
The safest (and lowest performance) connectivity is provided by the
`sockets`
provider. You can force usage of
that by setting
```
FI_PROVIDER=sockets
```
in your environment. It should work on most any network that can support TCP/IP
networking, including ethernet, IB, and GNI (Aries).
Usage of the
`tcp`
and
`tcp;ofi_rxm`
provider is currently broken, an upstream
issue is open.
On Cray XC systems the GNI (Aries) provider is supported. If you compile with
the
`rdma-credentials`
and
`gni-headers`
modules loaded the GNI provider should
be autoselected if a GNI NIC is found at runtime.
NOTE that GNI NICs on login nodes typically do not work, due to a limitation of
the libfabric/gni driver, so you will have to run your application exclusively
on compute nodes, or manually switch the components running on login nodes to
the sockets provider.
The GNI driver can be forced by setting
```
FI_PROVIDER=gni
```
If you are using GNI you will implicitly be using Cray libdrc, a mechanism to obtain network
authentication tokens. Maestro-core is requesting workflow-level tokens that even
support running multiple components of a workflow from different user IDs. In
some cases the system may run out of tokens, and there is no user-level token
inquiry tool available. If you see failure of GNI startup, try running your application with
```
DRC_DEBUG_LEVEL=DEBUG
```
and look for an error message like
```
LIBDRC:CORE:DEBUG rdmacred.c:658 - finished acquire request, rc=-28
```
If you see this, contact your system admin to clear cached DRC credentials.
Documentation
-------------
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
sign in
to comment