Environment Variables¶
Collective algorithms selection¶
CCL_<coll_name>¶
Syntax
To set a specific algorithm for the whole message size range:
CCL_<coll_name>=<algo_name>
To set a specific algorithm for a specific message size range:
CCL_<coll_name>="<algo_name_1>[:<size_range_1>][;<algo_name_2>:<size_range_2>][;...]"
Where:
<coll_name>
is selected from a list of available collective operations (Available collectives).<algo_name>
is selected from a list of available algorithms for a specific collective operation (Available algorithms).<size_range>
is described by the left and the right size borders in a format<left>-<right>
. Size is specified in bytes. Use reserved wordmax
to specify the maximum message size.
oneCCL internally fills algorithm selection table with sensible defaults. User input complements the selection table.
To see the actual table values set CCL_LOG_LEVEL=info
.
Example
CCL_ALLREDUCE="recursive_doubling:0-8192;rabenseifner:8193-1048576;ring:1048577-max"
Available collectives¶
Available collective operations (<coll_name>
):
ALLGATHERV
ALLREDUCE
ALLTOALL
ALLTOALLV
BARRIER
BCAST
REDUCE
REDUCE_SCATTER
SPARSE_ALLREDUCE
Available algorithms¶
Available algorithms for each collective operation (<algo_name>
):
ALLGATHERV
algorithms¶
|
Based on |
|
Send to all, receive from all |
|
Alltoall-based algorithm |
|
Series of broadcast operations with different root ranks |
ALLREDUCE
algorithms¶
|
Based on |
|
Rabenseifner’s algorithm |
|
May be beneficial for imbalanced workloads |
|
reduce_scatter + allgather ring.
Use |
|
reduce_scatter+allgather ring using RMA communications |
|
Double-tree algorithm |
|
Recursive doubling algorithm |
|
Two-dimensional algorithm (reduce_scatter + allreduce + allgather) |
ALLTOALL
algorithms¶
|
Based on |
|
Send to all, receive from all |
ALLTOALLV
algorithms¶
|
Based on |
|
Send to all, receive from all |
BARRIER
algorithms¶
|
Based on |
|
Ring-based algorithm |
BCAST
algorithms¶
|
Based on |
|
Ring |
|
Double-tree algorithm |
|
Send to all from root rank |
REDUCE
algorithms¶
|
Based on |
|
Rabenseifner’s algorithm |
|
Tree algorithm |
|
Double-tree algorithm |
REDUCE_SCATTER
algorithms¶
|
Based on |
|
Use |
SPARSE_ALLREDUCE
algorithms¶
|
Ring-allreduce based algorithm |
|
Mask matrix based algorithm |
|
3-allgatherv based algorithm |
Note
WARNING: ccl::sparse_allreduce
is experimental and subject to change.
CCL_RS_CHUNK_COUNT¶
Syntax
CCL_RS_CHUNK_COUNT=<value>
Arguments
<value> |
Description |
---|---|
|
Maximum number of chunks. |
Description
Set this environment variable to specify maximum number of chunks for reduce_scatter phase in ring allreduce.
CCL_RS_MIN_CHUNK_SIZE¶
Syntax
CCL_RS_MIN_CHUNK_SIZE=<value>
Arguments
<value> |
Description |
---|---|
|
Minimum number of bytes in chunk. |
Description
Set this environment variable to specify minimum number of bytes in chunk for reduce_scatter phase in ring allreduce. Affects actual value of CCL_RS_CHUNK_COUNT
.
Fusion¶
CCL_FUSION¶
Syntax
CCL_FUSION=<value>
Arguments
<value> |
Description |
---|---|
|
Enable fusion of collective operations |
|
Disable fusion of collective operations (default) |
Description
Set this environment variable to control fusion of collective operations. The real fusion depends on additional settings described below.
CCL_FUSION_BYTES_THRESHOLD¶
Syntax
CCL_FUSION_BYTES_THRESHOLD=<value>
Arguments
<value> |
Description |
---|---|
|
Bytes threshold for a collective operation. If the size of a communication buffer in bytes is less than or equal
to |
Description
Set this environment variable to specify the threshold of the number of bytes for a collective operation to be fused.
CCL_FUSION_COUNT_THRESHOLD¶
Syntax
CCL_FUSION_COUNT_THRESHOLD=<value>
Arguments
<value> |
Description |
---|---|
|
The threshold for the number of collective operations.
oneCCL can fuse together no more than |
Description
Set this environment variable to specify count threshold for a collective operation to be fused.
CCL_FUSION_CYCLE_MS¶
Syntax
CCL_FUSION_CYCLE_MS=<value>
Arguments
<value> |
Description |
---|---|
|
The frequency of checking for collectives operations to be fused, in milliseconds:
|
Description
Set this environment variable to specify the frequency of checking for collectives operations to be fused.
CCL_ATL_TRANSPORT¶
Syntax
CCL_ATL_TRANSPORT=<value>
Arguments
<value> |
Description |
---|---|
|
MPI transport (default). |
|
OFI (Libfabric*) transport. |
Description
Set this environment variable to select the transport for inter-node communications.
CCL_UNORDERED_COLL¶
Syntax
CCL_UNORDERED_COLL=<value>
Arguments
<value> |
Description |
---|---|
|
Enable execution of unordered collectives.
You have to additionally specify |
|
Disable execution of unordered collectives (default). |
Description
Set this environment variable to enable execution of unordered collective operations on different nodes.
CCL_PRIORITY¶
Syntax
CCL_PRIORITY=<value>
Arguments
<value> |
Description |
---|---|
|
You have to explicitly specify priority using |
|
Priority is implicitly increased on each collective call. You do not have to specify priority. |
|
Disable prioritization (default). |
Description
Set this environment variable to control priority mode of collective operations.
CCL_WORKER_COUNT¶
Syntax
CCL_WORKER_COUNT=<value>
Arguments
<value> |
Description |
---|---|
|
The number of worker threads for oneCCL rank ( |
Description
Set this environment variable to specify the number of oneCCL worker threads.
CCL_WORKER_AFFINITY¶
Syntax
CCL_WORKER_AFFINITY=<proclist>
Arguments
<proclist> |
Description |
---|---|
|
Workers are automatically pinned to last cores of pin domain.
Pin domain depends from process launcher.
If |
|
Affinity is explicitly specified for all local workers. |
Description
Set this environment variable to specify cpu affinity for oneCCL worker threads.
CCL_LOG_LEVEL¶
Syntax
CCL_LOG_LEVEL=<value>
Arguments
<value> |
---|
|
|
|
|
|
Description
Set this environment variable to control logging level.
CCL_MAX_SHORT_SIZE¶
Syntax
CCL_MAX_SHORT_SIZE=<value>
Arguments
<value> |
Description |
---|---|
|
Bytes threshold for a collective operation ( |
Description
Set this environment variable to specify the threshold of the number of bytes for a collective operation to be split.
CCL_MNIC¶
Syntax
CCL_MNIC=<value>
Arguments
<value> |
Description |
---|---|
|
Select all NICs available on the node. |
|
Select all NICs local for the NUMA node that corresponds to process pinning. |
|
Disable special NIC selection, use a single default NIC (default). |
Description
Set this environment variable to control multi-NIC selection policy. oneCCL workers will be pinned on selected NICs in a round-robin way.
CCL_MNIC_COUNT¶
Syntax
CCL_MNIC_COUNT=<value>
Arguments
<value> |
Description |
---|---|
|
The maximum number of NICs that should be selected for oneCCL workers. If not specified then equal to the number of oneCCL workers. |
Description
Set this environment variable to specify the maximum number of NICs to be selected. The actual number of NICs selected may be smaller due to limitations on transport level or system configuration.