What are the responsibilities and job description for the Site Reliability Engineer position at F2Onsite?

Remote Site Reliability Engineer

Job Duties 2. Traffic Management Responsibilities 3. Infrastructure Management 4. Vendor Support/Escalation 5. Change Management 6. Runbook Management and Updates 7. Incident Management 8. In-House SRE Projects 9. Knowledge Sharing and Mentorship 10. Weekly, Monthly and Yearly Reports

Monitoring and Alert Response

Alert Monitoring:

Continuously monitor PagerDuty alerts, perform initial triage, and escalate major issues using predefined SRE runbooks and SOPs.
Monitor and respond to alerts triggered over Slack and email.
Acknowledge partner or vendor maintenance alerts and plan accordingly.

System Health Checks:

Actively track system health using tools like Nagios, Pingdom, Grafana, Prometheus, QAMC , Splunk.

Ticket Management:

Track new SRE tickets, resolve existing ones, and follow up to ensure closure.

Assisting Cross-Functional Teams During Maintenance:

Collaborate with various teams(ProdOps) to safely redirect traffic away from the data center during maintenance activities, ensuring uninterrupted service.

Mitigating Production Issues:

In the event of production problems, promptly reroute traffic from the affected data center to maintain service continuity.

Facilitating Scheduled Hardware Maintenance:

Proactively manage traffic flow adjustments to accommodate planned hardware maintenance, minimizing potential disruptions.

On-Prem Server Management:

Manage and configure on-premises servers, including onboarding and configuring new hardware.
Onboarding to Monitoring tools and setting up alerts

Collaborative Infrastructure Tasks:

Work closely with Network Engineering for major hardware/infrastructure changes/Upgrades

Coordinate with Dev/DevOps and Platform team during on-premises to Cloud migrations.

Troubleshoot deployment issue during CI/CD deployment process.

Coordinate with vendors like:

Dell: For hardware-related issues.

ISP: (CenturyLink, Lumen, Zayo, Level3) For datacenter internet-related issues.

VMware: For on-premises virtualization support.

GCP: For Google Cloud support.

Change Monitoring:

Review all production change tickets to ensure proper procedures are followed.
Prevent unauthorized production changes.

Critical Change Support:

Work closely with cross-functional teams NetEng, DevOps, Dev , Platform teams during production changes.
Help execute critical changes during maintenance windows, ensuring minimal disruption.
Monitor and validate the impact of changes post-deployment.
Maintain and improve existing SRE runbooks by adding new troubleshooting steps and solutions.
Ensure SRE tasks have clear and detailed documentation for consistency.
Incident Response:

Act as the first responder during outages and provide updates in the Incident Management Slack channel.
Offer timely updates to stakeholders during ongoing incidents.

Incident Documentation:

Record incident details, actions taken, and outcomes in SRE incident Tracking Tickets.

Incident Resolution and RCA:

Conduct root cause analysis (RCA) for incidents and document findings.
Lead incident bridge calls and coordinate with stakeholders for resolution.

Post-Incident Management:

Conduct retrospectives/postmortems to evaluate incident handling and identify areas for improvement.
Document incident timelines, resolution steps, and follow-up actions.
Ensure completion of action items to prevent recurrence.
Participate in weekly project sync-ups to plan and execute initiatives for system scalability and reliability.
Optimize existing tools and applications.
Conduct POCs and onboard new tools to enhance capabilities.
Automate repetitive tasks to improve efficiency and reduce manual efforts.
Meghana is currently contributing to an in-house SRE project focused on developing a Slack bot using Python to collect data from Google Cloud Platform (GCP).
Create and update detailed SOPs, runbooks, and troubleshooting guides.
Train and mentor New SRE member to enhance their technical skills.
Share insights and lessons during team meetings and knowledge-sharing sessions.
Meghana consistently ensures the timely delivery of monthly, yearly, and postmortem reports, demonstrating her commitment to transparency and continuous improvement on Incident management.

Apply for this job

Receive alerts for other Site Reliability Engineer job openings

Site Reliability Engineer

What are the responsibilities and job description for the Site Reliability Engineer position at F2Onsite?

What is the career path for a Site Reliability Engineer?

Job openings at F2Onsite

Not the job you're looking for? Here are some other Site Reliability Engineer jobs in the Charlotte, NC area that may be a better fit.

We don't have any other Site Reliability Engineer jobs in the Charlotte, NC area right now.

AI Assistant is available now!