What are the responsibilities and job description for the Azure Cloud Site Reliability Engineer - Warren, NJ (Onsite) position at Smart Caliber Technology?
Job Details
Duration: 7 Months
Location: Warren, NJ (Onsite)
Job Summary: We are seeking an experienced Cloud Site Reliability Engineer with a proven track record in Azure cloud services to drive system reliability and resiliency.
The ideal candidate will have strong expertise in assessing and maintaining fault-tolerant architectures while collaborating across teams to ensure seamless operations.
Key Responsibilities:
- Identify and mitigate single points of failure (SPOFs) to enhance system reliability.
- Conduct Failure Modes and Effects Analysis (FMEA) to assess and address potential failure modes.
- Design and implement strategies to improve system resiliency and fault tolerance.
- Build and maintain architectures incorporating redundancy, load balancing, and automated failover.
- Perform and manage load testing and chaos testing using industry-standard tools.
- Partner with development, operations, and security teams to implement best practices for reliability.
- Provide insights and recommendations on observability, tracing, and telemetry tools.
Required Qualifications:
- Bachelor's degree in IT, Computer Science, or a related field.
- Deep understanding of SPOF and FMEA methodologies.
- Minimum of 10 years of experience in IT infrastructure management, with a focus on resiliency and chaos engineering.
- Demonstrated expertise in fault-tolerant architecture design and maintenance.
- Proficiency in observability tools, tracing, and telemetry frameworks.
- Strong skills in root cause analysis and incident management.
- Excellent analytical abilities coupled with effective communication and collaboration skills.
Must-Have Skills:
- Azure cloud-based services.
- Redundancy, load balancing, and fault tolerance expertise.
- Load and chaos testing tools.
- Knowledge of SPOF and FMEA methodologies.
- Familiarity with automated failover processes.
- Competency with observability, tracing, and telemetry tools.
- Proven ability in root cause analysis and incident management.
Best Regards,
Chetna
-D
-Fax
Truth Lies in Heart