Job description / Role
Our team is looking for a deeply technical HPC cluster administrator to manage a cluster of DGX servers. Be part of the DGX platform SW team to deliver next generation HPC and Deep Learning platforms.
What you'll be doing:
• Administer DGX servers and lab network, automate workload management, system updates and availability, actively communicate with management regarding any problems with the equipment and propose resolution plan, build and install/upgrade new systems that support company's software teams, deploy and maintain cluster management and scheduling tools.
What we need to see:
• You have a BA, BS, or MS in CS, EE, CE or additional equivalent work experience 5 years of previous experience deploying and administrating HPC clusters,
• familiar with resource scheduling managers (Slurm (preferred), LSF, etc) demonstrated ability to script in bash and python.
• Experience with containers (Docker, Singularity, LXC) and container orchestration technologies like Kubernetes.
• Experience in designing and developing scripts/tools for continuous integration and continuous deployments for containers.
• Automating configuration management, infrastructure, and application deployments in a toolset such as Puppet, Chef, Ansible. Deep understanding of operating systems, computer networks, High Performance Applications and Deep Learning Framework (Tensorflow, Pythorch, Caffee, CNTK etc ).
• Ability to work well with developers & test engineers. Passionate dedication to provide quality in support for your users.
Ways to stand out from the crowd:
• Familiarity with GPU usage in Compute Cluster and CUDA
• Experience with knowledge of Deep Learning.
• Experience with NVIDIA software for cluster management and provisioning such as nvsm, dcgm and DeepOps.
About the Company
Parisima specialises in building high performing workforces that improve business performance. Our experience has demonstrated that the most effective organisations view their employees as their most important asset and view Talent Management as a holistic end-to-end complementary process.
Whether it’s a partial or fully outsourced recruitment solution or a focus on a particular area of your talent acquisition cycle, our solutions are tailored to address your specific challenges. We are experts in optimising talent acquisition and resourcing functions to build high performing organisations with high performing individuals.
Through key strategic partnerships, Parisima is the only organisation in the Middle East that specialises in addressing the full employee lifecycle. This includes Hiring (talent acquisition, applicant-tracking systems, assessments for recruitment and development) and Retention (employee engagement surveys, employee recognition and reward programs and executive leadership programs).