What are the responsibilities and job description for the Site Reliability Engineer position at Talent Space?
Job Description
Job Description
Talent Space, Inc. is seeking a Site Reliability Engineer for a remote full time opportunity!
Responsible for ensuring the stability, reliability, and scalability of our production systems. Design and implement solutions that improve system performance, reduce downtime, and automate repetitive tasks. Combining systems engineering and operations engineering, you'll enhance operational processes, monitoring systems, and tooling to provide a seamless experience for our customers. Ideal background is
system administration, and network management.
Responsibilities
- Monitor all systems and infrastructure for the highest level of availability. Proactively identify and resolve incidents before they impact operations. Perform routine maintenance tasks, including monitoring, patching, and backups.
- Respond to incidents and outages in a timely and effective manner. Collaborate with other teams to diagnose and resolve complex issues.Document incident details and implement corrective actions to prevent recurrence. Document processes, configurations, and troubleshooting procedures.
- Diagnose and resolve application performance problems or system outages. Play the role of Incident Manager during outages.
- Resolve complex hardware and software issues, and work with vendors when necessary.
- Optimize system performance and resource utilization on-prem and in the cloud.
- Develop and maintain automation scripts to streamline repetitive tasks. Utilize scripting languages (e.g., PowerShell, Python, etc.) to automate system administration.
- Implement configuration management tools to ensure consistency and repeatability.
- Create and maintain comprehensive documentation of IT processes and procedures.
Qualifications