Stockholm

HPC Engineer

Posted on Thursday, 5th February 2026

IT
Stockholm
Up to £0.00 per annum
Permanent

High-Performance Computing (HPC) Engineer – (Stockholm, Sweden)

Overview

Our High-Growth Technology client is seeking an experienced High-Performance Computing (HPC) Engineer to help design, build, and operate large-scale compute platforms supporting demanding workloads (e.g., AI/ML, simulation, rendering, analytics, and research). You will work closely with infrastructure, platform, and research/engineering teams to deliver reliable, high-throughput systems with strong performance, automation, and observability.

Key Responsibilities

  • Design, deploy, and support HPC clusters (on-prem / colocation / cloud-connected) with a focus on performance, resilience, and scalability.
  • Administer and optimise Linux-based compute environments (provisioning, patching, kernel/driver tuning, user access, hardening).
  • Implement and maintain workload scheduling and cluster management (e.g., Slurm or equivalent), including partitions/queues, fair-share policies, and job efficiency improvements.
  • Support GPU-accelerated environments (where applicable): driver/toolkit management, performance profiling, stability troubleshooting.
  • Build and maintain automation for cluster lifecycle operations (IaC, config management, CI/CD-style ops).
  • Partner with networking and storage teams to ensure high-throughput, low-latency performance across the stack.
  • Own incident response and problem management for HPC services; lead root-cause analysis and preventative improvements.
  • Develop monitoring, logging, and capacity planning to meet throughput and availability targets.
  • Produce clear documentation (runbooks, architecture diagrams, operational standards) and contribute to continuous improvement.

Required Skills & Experience

  • Strong hands-on experience as an HPC Engineer / Linux Systems Engineer / Infrastructure Engineer in performance-critical environments.
  • Deep Linux administration skills (systemd, networking basics, storage, performance tuning, troubleshooting).
  • Experience operating HPC or large-scale compute platforms, including one or more of:
    • Schedulers / cluster managers (Slurm preferred; PBS, LSF, Kubernetes for batch, etc.)
    • GPU compute (NVIDIA drivers/CUDA, NCCL awareness, profiling tools)
    • MPI and distributed compute concepts (OpenMPI/MPICH understanding)
  • Solid scripting/automation skills (Bash, Python; plus Ansible/Terraform or similar).
  • Practical understanding of observability (metrics, logs, tracing), and using monitoring stacks to drive reliability.
  • Good knowledge of storage and data movement patterns used in HPC (parallel file systems and/or high-performance shared storage concepts).
  • Strong communication skills-able to work across platform, network, storage, and application teams.

Desirable / Nice-to-Have

  • Experience with high-speed interconnects (e.g., InfiniBand, RoCE) and low-latency network troubleshooting.
  • Experience with containerised HPC or hybrid HPC workloads (Apptainer/Singularity, Docker where appropriate).
  • Familiarity with security best practices in shared compute environments (least privilege, auditing, secrets handling).
  • Background supporting AI/ML infrastructure at scale (GPU fleet operations, job efficiency, capacity optimisation).

Location & Working Model

  • Stockholm, Sweden (based locally).
  • Working model: Hybrid/On-site depending on operational needs.

What Success Looks Like

  • Stable, high-performance clusters with measurable improvements in throughput, utilisation, and job success rates.
  • Strong automation and repeatability across provisioning, configuration, and operations.
  • Clear operational practices (monitoring, alerting, runbooks) that reduce MTTR and improve reliability.

Next Steps

  • Please send me your most recent CV which aligns with this Job Description and your contact information.

Rami James

Advertised by:

Rami James
Lead Senior Consultant
LinkedIn

Apply for this role