What are the responsibilities and job description for the Monitoring and Alerting Engineer position at Talent Group?
Job Details
Job Title: Monitoring and Alerting Engineer
Location: Fort Worth, TX
Job Type: Contract
Work Schedule: 4 days/week onsite
Job Description
We are seeking an experienced Monitoring and Alerting Engineer to join our team.
The ideal candidate will have 10 years of experience in monitoring and alerting systems, with strong hands-on expertise in configuring and building monitoring tools.
This role requires the ability to hit the ground running, without needing extensive coaching or handholding.
Key Responsibilities:
- System Monitoring: Implement and maintain comprehensive monitoring solutions to track the performance, health, and availability of IT systems, applications, and networks.
- Alert Management: Configure and manage alerting mechanisms to ensure timely notifications of system anomalies, failures, or performance issues.
- Incident Response: Collaborate with support and operations teams to resolve and lead event resolution processes during system outages.
- Root Cause Analysis: Investigate and determine the root causes of incidents and implement corrective actions to prevent recurrence.
- Optimization: Identify opportunities for system optimization, performance improvement, and process streamlining.
- Tool Evaluation & Integration: Evaluate, recommend, and integrate new monitoring tools and technologies, enhancing overall capabilities.
- Documentation & Reporting: Develop and maintain detailed documentation, including monitoring configurations, incident reports, and performance metrics.
- Collaboration & Communication: Work closely with internal IT teams and vendors to ensure effective communication and resolution of incidents.
Skills and Qualifications:
- 10 years of experience in monitoring and alerting systems.
- Strong proficiency in monitoring tools like Dynatrace, Datadog, CloudWatch, Splunk, or similar.
- Understanding of IT infrastructure such as servers, networks, databases, and cloud-based environments.
- Incident management experience and ability to handle critical system outages with minimal oversight.
- Excellent troubleshooting and problem-solving skills, particularly with complex systems.
- DevOps experience is a plus but not required.
- Ability to work flexibly in a 7x24 environment, providing after-hours support, as necessary.
- Effective communication and collaboration skills with both technical and non-technical teams.
Performance of Duties:
- Operate in a 24/7 operational environment, supporting systems and applications at all hours.
- Collaborate with internal teams and external vendors to resolve incidents and maintain system uptime.
- Ensure business system availability through proactive incident, problem, and change management processes.
- Lead root cause analysis for major incidents and provide clear communication with senior management.
- Identify and escalate critical system events to the appropriate stakeholders.
- Constantly evaluate and improve monitoring/alerting tools and processes to ensure maximum efficiency.
Qualifications:
- B.S. in Computer Science, Information Systems, or Engineering (preferred).
- Technical expertise in distributed systems and experience with scripting/programming languages (Python, Node.js, Ruby, Perl, Bash/sh).
- Prior experience with ServiceNow or other IT service management tools.
- Strong experience in handling high-pressure situations and interacting at all organizational levels.