Sr. HPC System Administrator with Machine Learning

CoreHive Computing LLC
Aberdeen, MD
Oct 17, 2019
Oct 20, 2019
Full Time
The candiate should demonstrate experience and understanding of the on-site System Administrator shall be engaged in a wide range of system administration/management duties for the IBM Power9 High Performance Computing and Machine Learning (HPC-ML) system. The System Administrator will serve as the IBM HPC-ML system subject matter expert (SME) and will be skilled and experienced in IBM HPC-ML systems management. The work scope will include support of system installation, system configuration, and acceptance testing leading up to full system acceptance. Once the system completes acceptance testing, the system administrator will take on the primary responsibility for maintaining daily system operational efficiency including: user account management, configuration management, problem determination and resolution, remedial maintenance, preventive maintenance, software upgrades, and implementation of monitoring and support tools for realization of higher quality services including increased efficiency and enhanced user productivity. The HPC-ML System Administrator must be skilled/experienced in IBM HPC-ML Servers, IBM ESS/Spectrum Scale Storage and File Systems, IBM HPC-ML software stack, Orchestrators (ICP/OpenShift), Containers (Singularity), Job Schedulers (LSF), Mellanox high performance Infiniband networks and Red Hat Linux. Preferred Candidate will have prior system administration working experience with ARL Aberdeen site and systems. The candidate must be us citizen. Task Description: The System Administrator tasks include but are not limited to: Act as technical liaison between the Government and IBM's support teams to facilitate, optimize, and maintain site specific customizations (eg job scheduler, system network, accounting, and security requirements). Provide maintenance and tuning of the Red Hat operating system (OS), IBM HPC-ML software stack, and IBM Spectrum Scale file systems. Implement Information Assurance (IA) required functionality and perform periodic Comprehensive Security Assessment (CSA) scanning. Provide the Government management team with operational and workload support, being responsible for improving systems availability and efficiency. Provide the Government management team with monthly reports tracking system availability, system level interrupts and user level interrupts. Provide technical leadership and knowledge to the Government User Support Team to help support user application/data porting/migration issues. Assist the Government Team with: Obtaining, managing, deriving, and analyzing of accounting, auditing, performance, and utilization data. Capacity and migration planning of new software and hardware products. Provide assistance with maintenance and tuning of the Government furnished, third party software Take on the lead role and work collaboratively with the IBM field support team for trouble-shooting and problem determination of IBM supplied hardware and software. Assist the Government technical team with trouble-shooting and problem determination as it pertains to HPCMP software. Perform Scheduled System Maintenance: Maintain the Red Hat OS and IBM HPC-ML software and firmware levels. Work with the Government Technical Contact to schedule downtime for preventative maintenance (PM) actions. Perform software and firmware upgrades as directed and appropriate. As appropriate, work with the Government, IBM, third party vendors and other systems integrators to resolve issues associated with system upgrades and PM actions. Leverage Fix Central which allows him to search, select, order, and download fixes for your system. Fixes provide updates to software, licensed internal code, and machine code that fix known problems, add new function, and keep the system, software, and hardware management console operating efficiently. Perform shell scripting to automate and streamline system administration tasks. Allow Government priorities to dictate sequencing of tasks requiring SA support at any particular time. Required skills/Level of Experience: The ideal candidate will have experience administering the aforementioned hardware and software environments supporting machine learning and deep learning frameworks for example, but not limited to: Apache Spark, Hortonworks Data Platform, Caffe, TensorFlow, Torch, Theano, Chainer, MXNet, and DL4J. Also, the ideal candidate will have administration skills for both model training and production level inferencing. The system administrator must be skilled/experienced with the following hardware and software: IBM HPC servers, or equivalent IBM GSS/ESS storage, or equivalent Mellanox Infiniband networking, or equivalent IBM Spectrum Scale solutions, or equivalent IBM Spectrum MPI, or equivalent IBM Spectrum LSF, or equivalent IBM Extreme Cluster Administration Toolkit (xCAT) IBM Cloud Private (ICP) or RedHat OpenShift, or equivalent Singularity or other Kubernetes orchestration software solutions TensorFlow, Caffe, and other analytics SKLM Open Source Libraries Parallel Performance Toolkit IBM XLC and XLF compilers PGI and Open Source compilers ARM DDT RogueWave and TotalView PBSPro OTHER SKILL/REQUIREMENTS: Required skills: Microsoft Excel, Word, Github, or preferred software control mgmt, system IBM Support tools (Slack, Fix Central, etc.)Linux Command Line scripting, OS image building. Nice to have skills: A competitive candidate will have experience working with decision support, big data, enterprise reporting and analytic tools, or other knowledge support systems. Also, a competitive candidate will have experience managing multiple storage management technologies including object, parallel file systems, as well as block storage devices. A competitive candidate will have experience working with multiple operating environments, and an expert level understanding of deploying and developing within multiple Linux distributions. The System Administrator must be: Be a US citizen Have an adjudicated tier 5 background investigation (DOD Top Secret) Must have Information Assurance Technician (IAT) Level II or IAT Level III certification Work 40 hours per week Work full time excluding Government holidays, personal vacation days, personal sick days Work on site at the Government location in Aberdeen, MD

Similar jobs