What are the responsibilities and job description for the Site Reliability Engineer position at VetForce Solutions?
Job Details
Site Reliability Engineer
Long Term Contract
Hybrid - Warren, NJ
Primary Skills,
Good hands-on experience in Azure cloud-based services
Experience in assessing complex cloud solutions which includes redundancy, load balancing and fault tolerance
Experience in load/chaos testing tools and process
Job Duties and Responsibilities
Identify and eliminate SPOFs to improve system reliability.
Conduct FMEA to identify potential failure modes and their impacts.
Develop mitigation strategies to enhance system resiliency.
Assess and maintain fault-tolerant architectures using redundancy, load balancing, and automated failover.
Experience with load testing and chaos testing tools.
Collaborate with development, operations, and security teams.
Provide guidance on best practices for resiliency and reliability.
Qualifications
Bachelor's degree in IT, Computer Science, or related field.
knowledge of SPOF and FMEA methodologies.
10 years of experience in IT infrastructure management, focusing on resiliency and chaos engineering.
Experience in designing and maintaining fault-tolerant architectures.
Understanding of observability, tracing, and telemetry tools.
Proficiency in root cause analysis and incident management.
Excellent analytical and problem-solving skills.
Strong communication and interpersonal skills for effective collaboration.