Reliability Engineer III (REMOTE - Nashville location Preferred)

Job Locations US
Req ID
2025-7900
Category
Information Technology
Type
Full-Time Regular
Security Access Level
Access 1: US Citizenship Only (No Dual) / CFIUS Approval / Sole US Citizen (DMV & FBI Programs)
Work Schedule
Core Business Hours

Overview

IDEMIA is the global leader in identity and security. Our mission is to create a safe and simple future where identity verification is indisputable, and only you can assert your identity. We are a distributed company leveraging the latest technologies to deliver world-class products in the private and public sectors of finance, telecom, identity, security, retail, sports entertainment, commercial, government, and IoT. We use a variety of technologies and approaches to deliver quality product and services to government agencies and technology companies. IDEMIA is a made up of a group of 14,000 diverse people from different nationalities, speaking over 20 different languages. Together, our solutions impact the everyday lives of citizens and nations. In this ever-changing world, protecting your identity is paramount. Join the team that is ensuring one person- one identity.

Responsibilities

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. According to Ben Treynor, founder of Google's Site Reliability Team, SRE is "what happens when a software engineer is tasked with what used to be called operations." 

 

A Site Reliability Engineer (SRE) will spend up to 50% of their time doing "ops" related work such as investigating and troubleshooting issues, incident response, and maintaining playbooks and other relevant documentation. Since the system that an SRE oversees is expected to be highly available and self-healing, the SRE should spend the other 50% of their time on development tasks such as improving CI and deployment pipelines, enhancing monitoring capabilities, and keeping systems updated. The ideal Site Reliability Engineer candidate is either a software engineer with a good administration background or a highly skilled system administrator with knowledge of deployment automation, coding, and DevOps.

 

You will be responsible for the following: 

  • Ownership of product KPIs and SLA reporting (ex: outages).
  • Availability and performance of production services.
  • Deployment of upgrades and installation of new patches.
  • Troubleshooting, error logs analysis, reports generation, capacity planning, etc.
  • Management of automated deployments into production and lower environments.

Qualifications

Required Experience

  • Minimum 6 years of experience supporting cloud-based, highly available solutions.
  • Minimum 6 years experience working in SRE, DevOps, or Software engineering.
  • Experience in Network Administration.
  • Experience with Unix/Linux operating systems, CLI, and administration.
  • Certification or relevant experience with AWS and/or Azure Cloud services a big plus.
  • BS/MS in Computer Science, Mathematics, Engineering, or equivalent experience.

 

Required Skills

  • Log aggregation, reporting, and monitoring.
  • CI/CD automation and orchestration.
  • Experience in production environments supporting mission-critical applications.
  • Working knowledge of Java, JVM management, and configuration.
  • Familiarity with various levels of security compliance, such as SOC-2 and FedRamp High.
  • Strong communication skills with the ability to articulate technical details to different audiences.

Pluses

  • Knowledge and experience with Datadog, Cloudwatch or Splunk.
  • Experience in Network Administration.
  • Experience in Database Administration.
  • Building observability and standing up monitoring within a FISMA High environment.
  • Ability to translate NIST 800-53 control requirements to implemented solutions.
  • Knowledge and experience designing and developing applications that take into account scalability, reliability, extensibility, etc.
  • Test automation experience with either unit/integration or functional API testing harnessed in a continuous delivery tool.

Options

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.
Share on your newsfeed