Sarus-slurm

a SPANK slurm Plugin written in Zig

Fawzi Mohamed

Containers

  • Give user control of the user space
  • install any software
  • architecture and kernel remain the same
  • various levels of isolation and control
  • can be completely rootless (unprivileged), the user does not acquire new
  • large ecosystem of tools
  • tries to make execution reproducible by freezing the userspace tools and libraries
  • mount namespace
  • chroot (user has control on the whole filesystem layout)
  • uenv ~> container with a "golden image"
  • container can give the user more freedom/control
  • if one uses that freedom things are not necessarily optimized for the specific hardware
  • hooks help in making hardware optimization accessible to more containers

Container first

  • Make the usage of container more transparent and seamless
  • To the user it should look as much as possible like the normal usage of a cluster
  • but with more control on his environment
    • well reproducible (prescriptive)
    • customizable to take advantage of existing containers, test new releases, and use new containers rebuilt from scratch
    • can have usual home/scratch folder with their data
  • EDF: Environment Definition File
    • our solution to avoid repeating lot of options
  • A toml file that defines
    • image or a docker file
    • mount points,
    • environment variables
    • annotations
  • still in evolution

sample.toml

mounts = ["$SCRATCH:$SCRATCH"]
entrypoint = false
[image]
dockerfile="""
FROM nvcr.io/nvidia/pytorch:22.12-py3
 
# to avoid interaction with apt-get
ENV DEBIAN_FRONTEND=noninteractive
 
RUN apt-get update && apt-get install -y \
    --allow-downgrades --allow-change-held-packages \
    --no-install-recommends \
        build-essential \
        automake \
        autoconf \
        libtool \
        wget \
        libpmi2-0-dev \
        ca-certificates \
    && apt-get clean && rm -rf /var/lib/apt/lists/*
RUN wget -q -O nccl-tests-2.13.6.tar.gz \
    https://github.com/NVIDIA/nccl-tests/archive/refs/\
tags/v2.13.6.tar.gz \
    && tar xf nccl-tests-2.13.6.tar.gz \
    && cd nccl-tests-2.13.6 \
    && MPI=1 MPI_HOME=/opt/hpcx/ompi make -j$(nproc) \
    && cd .. \
    && rm -rf nccl-tests-2.13.6.tar.gz
"""
 
[annotations]
com.hooks.aws_ofi_nccl.enabled = "true"
com.hooks.aws_ofi_nccl.variant = "cuda11"

Workload manager

  • gives the user a way to specify the hardware resources required for his job
  • it is a natural place to specify the environment one wants to use
  • use an EDF to define the environment
    • pull image only once per job, (or build it)
    • possibly cache it in the parallel filesystem
    • give a predictable shared environment (mounts, env, annotations)
    • cleanup all resources at the end
  • to be able to make the usage of the container seamless
    • we need various configurarion points in a job lifetime:
      • setup, run and cleanup) of a job

Slurm Integration

  • Slurm is the work load manager we use
  • it can be extended with C based plugins (SPANK)

challenges

  • C interface
  • loaded and executed in various contexts
  • difficult to test for failures, corner cases, special configurations of slurm
    • for example missing $USER in epilogue
  • dependencies of plugin are more difficult to handle in a container

Pyxis => Sarus-slurm

  • PoC done with Pyxis and enroot

  • pyxis issues

    • written in C
    • not ours
    • no unit tests just integration tests
    • difficult to debug
  • migrate toward something we own and that we can evolve

    • "sarus-slurm" (provisional name)

Goals

  • less error prone language than C
  • just our solution
  • make it more more flexible? Pluggable interface with
    • createContainer (master node)
    • startContainer (1x per node)
    • execTask (nTaskPerNode)
    • stopContainer (1x per node)
    • destroy/cleanupContainer (master node)

Current PoC

  • pyxis drop in replacement
    • find all hidden magic
    • check that everything still works
  • pluggable interface
    • support multiple backends
  • error trace when failing
  • mock test
  • no extra dependencies
  • language with less footguns than C
  • git.cscs.ch/fmohamed/sarus-slurm

Zig

  • is a nice small language, quite suited to low level programming, that aims at replacing C.

  • expose the the low level primitives and simplify their usage

    • does not hide very much
  • debug your application, not your language

    • keep the language simple
  • stay in the same space as C

    • encourage reuse of and from C (and C++)

  • very explicit: no hidden control flow (implicit destructor calls,
    exceptions), and no hidden memory allocations.

    • a bit more verbose
    • simplifies the analysis of what happens looking just a the local context
    • very useful for low level code, and to optimize
  • neither preprocessor, nor macros, but a very nice comptime execution:

    • any function can be called at compile time emulating the target architecture
    • compile time functions can manipulate types
    • Compile time is lazy (no unused function is compiled), and types do ducktyping at compile time.
  • zig can incrementally build a C/C++/Zig codebase, and easily crosscompile.

Community

  • sycl.it Software you can love
  • system programming
    • really undestanding the low level
    • i.e.: glibc support
  • embedded/freestanding as an option
  • cross compilation
  • quick compilation -> real incremental compilation

Getting started

  • zig run file.zig

  • zigtools/zls

  • zig init-exe -> build.zig

  • zig build sytem can build zig, C and C++

  • zig build

  • zig build test

  • https://ziglang.org/learn/

    • language
    • standard library

How is Zig

  • Order independent top level declarations
  • no header files (but h files can be generated)
  • type declaration after name
  • var/const to declare variables
  • const can hold anything, value, functions, struct
  • commas also for last element
  • fn for functions, first arg can be self (like python)
    • instance.f(x) <=> Class.f(instance,x)
  • error handling !type Errors!type
  • expressions can return and still assign a value
  • defer/errdefer to do cleanup
  • unittest
  • error traces available on all targets
  • explicit allocators
test "test_id_list" {
    const a = std.testing.allocator;
    var idList = IdList.init(a);
    defer idList.deinit();
    _ = try idList.addId("pippo");
    try std.testing.expectEqual(idList.ids.items.len, 1);
}
const IdList = struct {
  allocator: std.mem.Allocator,
  ids: std.ArrayListUnmanaged([:0]const u8) = .{},
  pub fn init(allocator: std.mem.Allocator) IdList {
    return .{ .allocator = allocator, }; }
  pub fn deinit(self: *IdList) void {
    for (self.ids.items) |el| self.allocator.free(el);
    return self.ids.deinit(self.allocator); }
  pub fn addId(self: *IdList, value: []const u8) !usize {
    const newId = self.allocator.allocSentinel(
        u8, value.len, 0) catch |err| { return err; };
    errdefer self.allocator.free(newId);
    for (newId, value) |*target, source|
      target.* = if (source == ' ') return error.InvalidId
                 else source;
    try self.ids.append(self.allocator, newId);
    return self.ids.items.len - 1;
} };
const std = @import("std");

C interoperability

pub usingnamespace @cImport({
    @cInclude("slurm/spank.h");
    @cInclude("sys/types.h");
    @cInclude("string.h");
    @cInclude("toml.h");
});
const std = @import("std");
const sarus = @import("sarus_slurm.zig");
const spank = @import("spank.zig");

// the global context used by the plugin
pub var globalContext = sarus.SarusPluginContext{
  .noContext = void{}
};
// define all the c function callbacks
export fn slurm_spank_init(spnk: spank.spank_t,
   ac: c_int, argv: ?[*][*:0]const u8) spank.slurm_err_t {
    return globalContext.slurm_spank_init(spnk, ac, argv);
}
  • import header files
  • pointers
    • [*c] i32: a generic c pointer to 32 bit integers (very under determined)
    • [*] i32: a non null pointer to an array of unknown length
    • `[*:0] i32: a non null pointer to a null terminated array
    • * i32: a non null pointer to a single 32 bit integer
  • ?T: a possibly null Type T (optional)
  • []i32: slice (range with .ptr and .len), should be preferred to bare pointers
  • [128]i32, [128:0]i32: fixed size arrays

sarus-args.zig

const std = @import("std");
const zspank = @import("sarus_spank.zig");
const spankErrorToCi = zspank.spankErrorToCi;
pub const SarusSlurmArgs = struct {
  allocator: std.mem.Allocator,
  environment: ?[:0]const u8 = null,
  /// Sets the name or path of the environment to read
  pub fn set_environment(self: *SarusSlurmArgs,
                         arg_newVal: ?[]const u8) !void {
    if (self.environment) |edf| self.allocator.free(edf);
    self.environment = null;
    if (arg_newVal) |newVal| {
      const newValCopy = try self.allocator
            .allocSentinel(u8, newVal.len, 0);
      @memcpy(newValCopy, newVal);
      self.environment = newValCopy;
    }
  }
  /// callback for environment option
  pub fn cb_spank_option_environment(arg_val: c_int,
        optarg: [*c]const u8, arg_remote: c_int)
           callconv(.C) c_int {
    _ = arg_val;
    _ = arg_remote;
    if (optarg == null or optarg.?[0] == 0) {
      return spankErrorToCi(zspank.SpankError.BAD_ARG);
    }
    const sarus_args: *SarusSlurmArgs =
        globalArgsPrintErr("--environment") catch |err| {
          return spankErrorToCi(err);
        };
    sarus_args.set_environment(sliceTo(optarg, 0))
        catch |err| {
          return spankErrorToCi(err);
        };
    return 0;
  }
};
test "set_args" {
    const a = std.testing.allocator;
    const expect = std.testing.expect;
    var args: SarusSlurmArgs = SarusSlurmArgs.init(a);
    defer args.deinit();

    try args.set_environment("bla");
    try expect(eql(u8, args.environment orelse "", "bla"));
}

Zig?

  • Why Not?
    • Not yet 1.0
    • smaller community
  • Why Zig after all
    • Error traces simplify debugging
    • no dependencies of plugin
    • nicer and safer than C
    • fast recompile/test
    • Everything changes: HW, libraries, OS,...
    • following language changes can be part of the development work
    • zig fmt took care of for (it) |el, index| {} -> for (it, 0..) |el, index| {}
    • we are not interested in experimental features (async / await)
    • some large projects like bun.sh , tigerbeetle.com, match,... use it already

Zig Alternatives

Rust

  • a better C++
  • focused on safety
  • large community
  • cdylib target can create plugins that should work
  • bindings to call C, bindgen can help generating them
  • Stacktraces but no error traces
  • large and complex language

C++

  • improves on C
  • largest community
  • already well known by all
  • larger/complex language
  • stacktrace possible
  • more "hairs" (due to the long history)
  • at least libstdc++ dependency

Thanks