Sparse collective operations¶

Language models typically feature huge embedding tables within their topology. This makes straight-forward gradient computation followed by allreduce for the whole set of weights not feasible in practice due to both performance and memory footprint reasons. Thus, gradients for such layers are usually computed for a smaller sub-tensor on each iteration, and communication pattern, which is required to average the gradients across processes, does not map well to allreduce API.

To address these scenarios, frameworks usually utilize the allgather primitive, which may be suboptimal if there is a lot of intersections between sub-tensors from different processes.

Latest research paves the way to handling such communication in a more optimal manner, but each of these approaches has its own application area. The ultimate goal of oneCCL is to provide a common API for sparse collective operations that would simplify framework design by allowing under-the-hood implementation of different approaches.

oneCCL can work with sparse tensors represented by two tensors: one for indices and one for values.

Sparse allreduce is a collective communication operation that makes global reduction operation on sparse buffers from all ranks of communicator and distributes result back to all ranks. Sparse buffers are defined by separate index and value buffers.

ccl::event sparse_allreduce(const void* send_ind_buf,
                            size_t send_ind_count,
                            const void* send_val_buf,
                            size_t send_val_count,
                            void* recv_ind_buf,
                            size_t recv_ind_count,
                            void* recv_val_buf,
                            size_t recv_val_count,
                            ccl::datatype ind_dtype,
                            ccl::datatype val_dtype,
                            ccl::reduction rtype,
                            const ccl::communicator& comm,
                            const ccl::stream& stream,
                            const ccl::sparse_allreduce_attr& attr = ccl::default_sparse_allreduce_attr,
                            const ccl::vector_class<ccl::event>& deps = {});

send_ind_buf: the buffer of indices with send_ind_count elements of type ind_dtype
send_ind_count: the number of elements of type ind_type in send_ind_buf
send_val_buf: the buffer of values with send_val_count elements of type val_dtype
send_val_count: the number of elements of type val_type in send_val_buf
recv_ind_buf [out]: the buffer to store reduced indices, unused
recv_ind_count [out]: the number of elements in recv_ind_buf, unused
recv_val_buf [out]: the buffer to store reduced values, unused
recv_val_count [out]: the number of elements in recv_val_buf, unused
ind_dtype: the datatype of elements in send_ind_buf and recv_ind_buf
val_dtype: the the datatype of elements in send_val_buf and recv_val_buf
rtype: the type of the reduction operation to be applied
comm: the communicator that defines a group of ranks for the operation
stream: an optional stream associated with the operation
attr: optional attributes to customize the operation
deps: an optional vector of the events that the operation should depend on
return event: an object to track the progress of the operation

For sparse_allreduce, a completion callback or an allocation callback is required.

Use the following fields in operation attribute:

completion_fn - a completion callback function pointer
alloc_fn - an allocation callback function pointer
fn_ctx- an user context pointer of type void*

Completion callback should follow the signature:

typedef void (*completion_fn)
(
    const void*,   /* idx_buf      */
    size_t,        /* idx_count    */
    ccl::datatype, /* idx_dtype    */
    const void*,   /* val_buf      */
    size_t,        /* val_count    */
    ccl::datatype, /* val_dtype    */
    const void*    /* user_context */
);

Note that idx_buf and val_buf are temporary buffers. Thus, the data from these buffers should be copied. Use user_context for this purpose.

Allocation callback should follow the signature:

typedef void (*alloc_fn)
(
    size_t,        /* idx_count    */
    ccl::datatype, /* idx_dtype    */
    size_t,        /* val_count    */
    ccl::datatype, /* val_dtype    */
    const void*,   /* user_context */
    void**,        /* out_idx_buf  */
    void**         /* out_val_buf  */
);

Warning

ccl::sparse_allreduce is experimental and subject to change.

oneCCL documentation

Sparse collective operations¶