Jr. GPFS Engineer at NASA Goddard

RedLine Performance Solutions
Greenbelt, MD
Jul 16, 2019
Jul 21, 2019
Engineer, IT, QA Engineer
Full Time
RedLine Performance Solutions (RedLine) has been in the HPC solutions engineering services business for approximately 17 years and is consistently determined to keep the "bar of excellence" quite high for new hires. This enables RedLine to accomplish what other firms cannot and promotes a high level of staff retention. We offer services ranging from full life cycle HPC systems engineering to remote managed services to HPC program analysis. We are located in the Washington, DC area and are looking for a Junior GPFS Engineer to join us for our NASA NACS High Performance Computing contract. Although an experienced HPC Engineer is ideal, we are enthusiastic about bringing in a Linux engineer looking for the next challenge. We are well practiced at helping Linux engineers grow to become HPC Engineers. US citizenship and the ability to obtain a Public Trust security clearance are mandatory requirements for this position. The position is located at a customer site in Greenbelt, MD. Preference is for local candidates, but we will consider relocation as well. This position is a member of an HPC Support team focusing on storage hardware and software for two supercomputing clusters. You will specialize in both the monitoring and management of storage systems and storage-related network management for a large supercomputer. Duties and Responsibilities Storage tasks Hardware installation Hardware testing and daily maintenancemonitoring, LUN configuration and presentation with various controller OS's, filesystem and cluster management with GPFS) Monitor and maintain Discover's storage hardware (spinning disk and NVMe-based) and backend storage network (Fibre Channel) Monitor and maintain Discover's GPFS cluster, including all 3700 clients and 60 NSD servers (plus managers and quorum nodes) Monitor and maintain Discover's 3 high-speed interconnect fabrics (2 FDR InfiniBand and 1 Omni-Path OPA100 fabric, including cables, switches, firmware, and software-level such as the SM's) Address user tickets and resolve issues in various cluster areas Attend meetings with high-priority user groups to keep open channels of communication and address concerns they may have Maintain test and development system to keep it consistent with the production cluster Consult the customer on new cluster hardware purchases (both storage and compute) Assist with benchmarking new products (storage systems and switches) that will potentially be used in production Test and verify hardware such as storage and high-speed fabrics to validate it for production Requirements Bachelor's degree in Computer Science, Management Information Systems or other technical discipline plus 3 years of relevant work experience or equivalent Experience with HPC parallel filesystems (eg, GPFS, Lustre) Experience with storage systems (datametadataIO server configurations in GPFS, spinning disk, SSD, and NVMe) Experience with high-speed interconnect networking (eg, InfiniBand, Omni-Path, Fibre Channel) - cabling, cards, switches, OFEDMOFED, etc. Working knowledge of scripting and programming languages such as C, C++, Fortran Bash, CSH, TSCH, Perl, Python, Ruby. Good organization skills to balance and prioritize work, and ability to multitask Good communication skills to communicate with support personnel, customer, and managers.