Site Reliability Engineer

Armedia LLC
Vienna, VA
Sep 16, 2021
Sep 22, 2021
Engineer, IT, QA Engineer
Full Time
We are looking to reinvent our system engineering practice into one modeled on Google?s Site Reliability Engineering (SRE) discipline that will ensure that our customer hosted systems and our applications have reliability and uptime appropriate to customer and our users' needs. We want to implement a fast rate of continual improvement while keeping a close watch on current capacity and performance. ? We are looking for an individual that has strong problem-solving abilities, can do attitude, initiative and drive to conceive and complete complex automation tasks, writes clear and concise documentation, and can articulate solutions to various roles within the company and with clients. that has a solid software development background and is passionate about coding infrastructure using programming languages and orchestration tools; the objective being to significantly reduce and perhaps eliminate manual effort for complex deployments, along with reducing manual mundane work (?toil?). that can demonstrate having automated solutions, can conceptualize a complex solution, start small and then progressively implement additional code to get the full solution in place. ? We have a mixture of Windows and Linux that run as virtual machines on VMware vSphere, AWS, and soon to include oVirt/Red Hat Virtualization. Our preferred Linux distribution is CentOS and RHEL. We are steadily improving automation for deployment of VMs within vSphere and will include oVirt once operational, along with EC2 and other infrastructure on AWS. We are improving monitoring for our infrastructure and applications that includes coverage for application synthetic testing, expiring resources such as accounts and certificates, and system security scanning. We have several solutions that require certification against FedRAMP, HITRUST, ISO 27001, and PCI DSS. ? We have deployments that range in scope and complexity. The more complex deployments consist of provisioning virtual infrastructure (private cloud, servers/virtual machines, network ACLs/rules, and reverse proxy/load balancers/sentry points), installing ?shared services? (PKI, Directory Services, Satellite/WSUS, application/system monitoring, log analysis, intrusion detection, vulnerability scanning, compliance checking), installing dependent application components, installing applications, along with restoring datasets containing solution state captured from another deployment. ? We have less complex deployments such as provisioning virtual infrastructure that uses existing shared services, updating application versions in solution stacks, applying system and application hardening, and updating application releases. ? We are developing a CI/CD model for our ArkCase case management platform. Some of our deployments are partially automated, some are manual, and the remainder are a mix. ? Position Duties and Responsibilities: Windows Automating deployment and management of servers based on templates Implementing and maintaining centralized patch management for workstations and servers Implementing and maintaining centralized antivirus/endpoint protection for workstations and servers Using PowerShell for automation of frequent repetitive manual tasks related to administration, deployment, and system hardening Using Group Policy Objects to perform targeted pushes of organizational standards to various resources managed in Active Directory Implementing and maintaining PKI for users (certificates and smart cards), workstations, servers, applications, and domain controllers managed in Active Directory Implementing Kerberos on-premise, ADFS along and other third-party Identity Providers (IdP?s) to offer Single Sign On (SSO) capability for applications Deploying, managing, and maintaining solutions using SQL Server Managing and optimizing SharePoint 365 as part of Office 365 Managing and optimizing Office 365 subscriptions Implementing and managing backup procedures for applications and systems, including recovery and disaster recovery testing Linux Automating deployment and management of RHEL/CentOS 6.x, 7.x, and 8.0 servers based on templates or Kickstart configurations Integrating Linux servers with Active Directory or another Directory for authentication and authorization Using Bash, Ansible, and or Puppet for automation covering server deployments, installing/configuring/updating application stacks, applying hardening, replacement of and frequent manual activities Using configuration management repositories as a source for artifacts with appropriate labeling as needed to build specific combinations and releases Integrating monitoring of server and application components with a centralized monitoring solution with a combination of agent and agent-less approaches Deploying monitoring solutions based on Zabbix or Nagios, and using Grafana or Kibana for visualizations and dashboards Deploying and regularly updating ClamAV and other antivirus solutions to servers using a centralized approach Using Satellite or Foreman for managing package repositories and pushing updates to servers Implementing and managing server and application-level backup procedures that include application-level recovery and full disaster recovery testing Implementing Samba and or NFS for file sharing on Windows and Linux platforms, and tuning for optimal performance with high user volumes and file counts Consistently implementing Linux OS security including SELinux as the standard for all deployments Debugging system and application component communication issues using tcpdump and Wireshark as needed Understands networking and has used tcpdump and Wireshark for troubleshooting issues Java and JavaScript-based applications Deploying, managing, and maintaining Java-based web applications on Spring Boot, Tomcat and or JEE application servers Deploying, managing, maintaining, and tuning JavaScript applications on Node.js For existing applications, developing load test workloads based on browser-based emulation and JMeter, along with optimization of tests, and tailoring of tests for specific application use cases Deploying and configuring Keycloak in IdP and or SP roles Analyzing and tuning Java applications using tools to identify resource issues, hotspots, and other problems, and working with development teams to address issues to resolution Hardening to meet evolving security compliance requirements Project and Productivity tool JIRA for ticket and task management Microsoft Office for authoring and updates to documents, presentations, project plans, and drawings

Similar jobs