HPC Infrastructure & Scheduler Integration Engineer

HPC Infrastructure & Scheduler Integration Engineer (PBS Focus)

We’re looking for an engineer who can design, build, and operate infrastructure integrations around HPC job schedulers—primarily PBS Professional (PBS Pro/OpenPBS). This role sits at the intersection of systems engineering, automation, and platform integration.

You will own how compute, storage, and orchestration layers connect to the scheduler—ensuring jobs run reliably, scale efficiently, and integrate cleanly with modern tooling (cloud, Kubernetes, MLOps, etc.).

________________________________________

Scheduler Integration & Automation

• Build and maintain integrations with PBS Professional and/or OpenPBS

• Develop hooks, prolog/epilog scripts, and custom scheduling logic

• Automate job lifecycle workflows (submission → execution → teardown)

• Extend scheduler capabilities via APIs, CLI tooling, and event-driven systems

Infrastructure Engineering

• Design and manage HPC environments (bare metal, VM, hybrid cloud)

• Integrate scheduler with:

o High-performance storage (Lustre, NFS, object stores)

o Networking (InfiniBand, Ethernet fabrics)

o Identity systems (LDAP, Kerberos, RBAC)

• Optimize node provisioning, boot workflows, and image management

Platform Integrations

• Bridge HPC schedulers with modern platforms:

o Kubernetes (e.g., batch offload, hybrid scheduling)

o MLOps stacks (e.g., ClearML, Kubeflow)

o Cloud bursting workflows (AWS, Azure, GCP)

• Build tooling for data locality, environment parity, and job portability

Performance & Reliability

• Diagnose scheduling bottlenecks, queue inefficiencies, and node failures

• Tune job placement, resource allocation, and backfill strategies

• Improve system throughput, fairness, and utilization

Observability & Operations

• Implement logging, metrics, and alerting for scheduler and cluster health

• Build dashboards for queue depth, job latency, and resource utilization

• Participate in incident response and root cause analysis

________________________________________

Core Skills

• Strong Linux systems engineering (RHEL/Rocky/SLES)

• Deep experience with HPC schedulers:

o PBS Professional, Torque, Slurm, or similar

• Scripting and automation:

o Python, Bash (required)

o Go or Rust (nice to have)

• Experience with distributed systems and cluster operations

HPC-Specific Experience

• Familiarity with:

o MPI workloads (OpenMPI, MPICH)

o GPU scheduling (NVIDIA stack, MIG/MPS concepts)

o Parallel file systems (Lustre strongly preferred)

• Understanding of job scheduling concepts:

o Queues, priorities, backfill, fairshare, reservations

Infrastructure & Integration

• Experience with:

o Configuration management (Ansible, Puppet, or similar)

o CI/CD pipelines for infrastructure

o APIs and service integration patterns

• Exposure to cloud platforms and hybrid HPC models

• Built custom PBS hooks or scheduler extensions in production

• Designed hybrid HPC + Kubernetes or cloud bursting architectures

• Solved real scaling problems (10k+ cores, multi-petabyte storage)

• Experience with security/compliance in HPC environments (STIGs, NIST, etc.)

• Strong debugging instincts across system layers (network → storage → scheduler)

Start: ASAP

End: End of August (with potential extension depending on outcomes)

Location: Remote, some travel to Sweden (Stockholm or Gothenburg)

  • Locations: Remote, Stockholm
  • Technologies: Bash, Linux, Python
  • Language: English