Remote- Principal Site Reliability Engineer

Laurel, MD
Sep 24, 2021
Sep 26, 2021
Full Time
Principal Site Reliability Engineer Can be located anywhere in the US Are you interested in building large-scale distributed infrastructure for the cloud? HCGBU - Delivery Platform team is building new Software-as-a-Service technologies that operate at high-scale in a broadly distributed, multi-tenant cloud environment. Oracle's extensive enterprise customer base is looking for rock solid cloud solutions that provide the same reliability and effectiveness that they have come to expect from Oracle. Our customers run their businesses on our cloud, and our mission is to provide them with best in class, foundational cloud services. Oracle's Cloud team is being built with an entrepreneurial spirit that promotes an energetic, creative, and collaborative environment; while ensuring that employees are supported in their career goals and have opportunities for training and education. We appreciate and value commitment to family and enthusiastically encourage work / life balance. What we're looking for: We are looking for a Site Reliability Engineer to join the HCGBU - Delivery Platform team. The ideal candidate is technically strong, and able to persevere through complexity and ambiguity - They've directly worked on services that are highly available, scalable, and redundant. Automation is a core tenet for everything they do. They understand that simple systems are easier to operate and troubleshoot. They can balance speed with iteration and incremental improvements. They've made life easier for other developers and have motivated their teams to make both process and service improvements. If you are passionate about taking ownership of big technical challenges and producing software solutions that have broad, significant impacts - come join our team! Candidates should have broad working knowledge across multiple domains, but we love to see specialization as well. The basics we expect are: Networking, Linux Systems Engineering, Software Engineering/Automation, Database Services (big data technologies) and Distributed Systems. In this role, you will: As a Site Reliability Engineer (SRE), within the HCGBU - Delivery Platform team, you will assist in designing and maintaining hosting, process, transform, and analyze operational processes. Your first mission will be to work closely with our software developers and Cloud architects to define a sustainable operational model for HCGBU services. This includes mechanisms to scale the systems by way of easy-to-use tooling and automation. You will work in concert with developers to evolve systems/products for better scalability, reliability and enable developer velocity. You will also author and maintain operational run books to help reduce mean Time of Incidents (TOI), and be responsible for managing and triaging operational tickets pertaining to the data platform services. Emphasis on driving prioritization and execution of work based on business impact is a must. Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Develop designs, architectures, standards, and methods for large-scale distributed systems. Design, implement and integrate monitoring solutions to pursue the high reliability. Investigate and implement approaches which makes system high alliable and fault tolerant. Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning. Work with other engineers within the HCGBU - Delivery Platform team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Articulate technical characteristics of services and technology areas and guide development teams to engineer and add capabilities to internal Oracle services. Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs). Utilize a deep understanding of service topology and the dependencies required to troubleshoot issues and define mitigations. Understand and explain the effect of product architecture decisions on distributed systems. Serve as part of a 24x7 On Call rotation in support of the HCGBU - Delivery Platform. Professional curiosity and a desire to a develop deep understanding of services and technologies. Mandatory Qualifications: Bachelor's or Master's degree in Computer Science or equivalent related field experience Experience with Python, Ruby, bash, and other scripting programming Experience working with fault tolerant, highly available, high throughput, distributed, scalable systems Aptitude to be a good team player and the desire to learn and implement new Cloud technologies as needed Excellent organizational, verbal, and written communication skills Preferred Qualifications: 5+ years of experience in two or more of the following Software development/operations Developing/operating large scale distributed services/applications System Administration including Linux internals, TCP/IP, DNS, Load balancing technologies Container administration and development utilizing Kubernetes, Docker, Mesos, or similar Infrastructure automation through Terraform, Chef, Ansible, Puppet or similar Big Data Infrastructure including Hadoop, Spark, NoSQL, Object Storage, or similar Experience with TCP/IP and socket programming Knowledge of cloud compute technologies, network monitoring, data processing and analytics Experience with CI/CD pipelines Proficiency in working with git

Similar jobs