What are the responsibilities and job description for the Infrastructure Lead position at MARS IT?
Job Details
Accountable for translating the high level solution design into infrastructure software and hardware detailed design artifacts and ensuring they are maintained and reviewed throughout the project life cycle. Ensuring the detailed design and implementation of the infrastructure meets the functional and non-functional requirements. Ensuring the appropriate infrastructure components are procured, installed and configured to meet the requirements for all environments (development, test, stage and production. Ensures the project's alignment with client's design and delivery best practices and provides team leadership for infrastructure delivery. 3 years minimum experience required ( 5-7 years recommended)
Identify business requirements, analyze alternatives and conduct product recommendations related to software, platform, network configurations and relevant integrations, in alignment with the industry standard TOGAF architecture framework.
Perform analysis and value stream mapping of existing ITIL processes like event management, incident/change management, server patching and design runbook automation and orchestration solutions using tools - Ansible, Gitlab CI, ServiceNow Orchestrator as per the fitment.
Evaluate data analytics techniques and tools to introduce AIOps capabilities for incident avoidance , using tools - Elastic stack, Splunk, Dynatrace. Design centralized visualization solution utilizing tools - Grafana/Kibana or technologies - JAVA, nodejs or Python, to consolidate health results from different monitoring tools , i.e. New Relic, Dynatrace, Elasticsearch, to present an executive view of the enterprise health.
Evaluate and conduct Proof of Concept for new tools\product offerings or capabilities - Pagerduty, Moogsoft, Big Panda, ServiceNow ITOM, ELK Machine learning for improving enterprise resiliency and enriching user experience.
Evaluate current or emerging open-source technologies/tools i.e. Docker, Nodejs, Ansible, Gitlab CI, Jenkins ,against the licensed enterprise tools capabilities for designing similar solutions, to consider factors such as cost, portability, compatibility, or usability.
Design CICD pipelines for automating the entire release, testing and deployment lifecycyle, of monitoring, management and automation solutions leveraging CICD tools - Gitlab for source control, Gitlab CI for continuous integration pipelines, Ansible/equivalent Gitlab CI runners for continuous deployment and containerization tools for continuous testing.
Design and architect the enterprise monitoring, management, automation and analytics tools solutions and guide the solution development in the full lifecycle of the implementation.
Assist in devising a roadmap for the transformation of existing enterprise monitoring landscape to achieve next level of maturity. Recommend best practices in IT infrastructure monitoring/automation and deploy them for NOC to enhance operational efficiency.
Key Responsibilities
- Design and implement observability solutions for metrics, logging, tracing, and events.
- Collaborate with SREs to enhance system reliability and performance.
- Develop and maintain dashboards, alerts, and monitoring systems.
- Integrate observability tools with existing infrastructure.
- Analyze and visualize data to provide actionable insights.
- Troubleshoot and resolve issues related to system performance and reliability.
- Mentor and guide junior engineers in best practices for observability.
- Stay updated with the latest trends and technologies in the observability and SRE space.
Required Skills and Qualifications
- Bachelor s degree in Computer Science, Engineering, or a related field.
- 5 years of experience in observability, monitoring, or a related field.
- Strong understanding of SRE principles and practices.
- Proficiency in observability tools such as:
- Metrics: Prometheus, Grafana, Datadog
- Logging: ELK Stack (Elasticsearch, Logstash, Kibana), Fluentd, Splunk
- Tracing: Jaeger, Zipkin, OpenTelemetry
- Event Tracking: Kafka, RabbitMQ, Apache Pulsar
- Expertise in cloud environments (AWS, Google Cloud Platform, Azure).
- Experience with configuration management tools (Terraform, Ansible).
- Proficiency in programming/scripting languages (Python, Go, Bash).
- Strong analytical and problem-solving skills.
- Excellent communication and collaboration abilities.
Preferred Qualifications
- Experience with Kubernetes and container orchestration.
- Familiarity with CI/CD pipelines.
- Certification in relevant technologies (e.g., AWS Certified DevOps Engineer, Google Cloud Platform Professional DevOps Engineer).