What are the responsibilities and job description for the Site Reliability Engineer (SRE) - ForgeRock position at Empower Professionals?
Job Details
Role: Site Reliability Engineer (SRE) - ForgeRock
Location: Jefferson Park, Hanover, NJ 07981 (Hybrid 2 days in office a week)
Duration: 12 Months
Job Description:
- Designing, implementing, deploying and running highly available, fault-tolerant, auto-scaling and auto-healing systems
- Deep expertise in AWS, Azure, and Google Cloud Platform, including Kubernetes (EKS, ECS, Fargate, GKE) and server less architectures
- Implementing advanced monitoring (Prometheus, Grafana, Datadog, ELK), tracing, logging and automated alerting solutions.
- Scaling distributed systems, optimizing compute/storage efficiency, and cost management.
- Designing failure simulations to improve system robustness and incident response. Expert in AWS CLI, CloudFormation, Ansible, Helm, and GitOps for automated infrastructure provisioning.
- Driving reliability best practices across engineering teams, embedding SRE principles into the DevSecOps lifecycle.
- Partnering with engineering, security, and product teams to balance reliability and feature velocity.
- Expertise in CIAM, ForgeRock stack (PingGateway, PingAM, PingIDM, PingDS) with certification or proof of completion of ForgeRock Deep-Dive 400 trainings.
- Building and mentoring high-performing SRE teams, fostering a culture of automation and innovation.
- Defining and enforcing reliability metrics to balance innovation with system stability.
- Optimizing deployment pipelines for high-frequency, zero-downtime releases.
- Leveraging machine learning for anomaly detection, predictive scaling, and automated remediation.
Additional Details:
- 5 years experience in hands-on configuration, deployment and running ForgeRock COTS based IAM solutions (PingGateway, PingAM, PingIDM, PingDS) with automated GitOps CI/CD pipelines using GitLab.
- Design and hands-on implementation of GitOps CI/CD pipelines, automated failover, data backup and restore solutions
- Automating telemetry, dashboards. 10 years experience in Running Disaster Recovery, zero downtime deployment solutions. Designing and implementing continuous delivery.
- Hands-on coding in Python, Bash and JSON/Yaml (CaC).
- Supporting large-scale, distributed, cloud-based micro service and API service solutions with 99.9% uptime.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.