Share this Job

Sr Site Reliability Engineer (remote)

Date:  Jul 26, 2022

remote, MA, US, remote

Onsite or Remote:  Remote
Company Name:  EBSCO Information Services

EBSCO Information Services (EIS) provides a complete and optimized research solution comprised of e-journals, e-books, and research databases - all combined with the most powerful discovery service to support the information needs and maximize the research experience of our end-users. Headquartered in Ipswich, MA, EIS employs more than 2,700 people worldwide, most now working hybrid or remotely. We are the leader in our field due to our cutting-edge technology, forward-thinking philosophy, and outstanding team. EIS is a company that will motivate you, inspire you, and allow you to grow. Our mission is to transform lives by providing relevant and reliable information when, where, and how people need it. We are looking for bright and creative individuals whose unique differences will allow us to achieve this inclusive mission around the world.


The Sr. Site Reliability Engineer serves as an important strategic and technical resource within the Incident Command team; of which is responsible for supporting and automating our incident response process to ensure high availability, redundancy, recoverability, and capacity. This individual will have a passion for solving deep and complex technical and business problems and wants to have an outsized impact with the solutions they create, deliver and iterate.

Primary Responsibilities

· The Senior Site Reliability Engineer will take ownership of reliability metrics, evangelize Incident Command best practices, build culture around tracking and leveraging SLOs and error budgets in engineer decision making and assists our teams in ensuring our infrastructure scales up/down efficiently

· Proficiency with scoping, right-sizing, tracking, and reporting on Service Level Objectives (SLOs), Service Level Indicators (SLIs), system availability, and other prudent measurables as it relates to availability


· Proven experience driving standardized incident reports across the ecosystem for maximum observability, business context, APM and infrastructure monitoring, and app/service-specific logging.


· Work on optimizing how we deal with unexpected complex failures, including facilitating our incident response process, running post-incident blameless retrospectives, analyzing for and learning from high-level trends, and leveraging new innovations to reduce the level of manual toil in our workflows


· Develop and implement automation solutions for monitoring, alerting, and logging of services running in Production level systems, including application availability, reliability, security, and change management

· Will use automation technologies to ensure repeatability, eliminate toil, reduce mean time to detection and resolution (MTTD & MTTR) and repair services.




· 5+ years of experience managing incidents and taking ownership of the end-to-end workflow of incident management, preferably in enterprise class, loosely coupled environments

· 3+ years of demonstrated success working across deep technical stacks while leveraging the right tools for the right job, such as, but not limited to: Python, Go, Terraform, Docker, OpsGenie, etc.


· Provide a unique blend of software engineering experience and infrastructure and automation experience with a focus on how software runs in production. Obsession over the details is prudent

· Expertise in observability libraries and ability to instrument code to expose new application metrics to deliver a superior end-to-end visibility posture

· Skilled in identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of outages. Ability to work across the stack with a high degree of efficacy.


· Superior ability to communicate in all directions of the organization, while having a deep understanding of the importance of context while leading problem-solving efforts / solutioning.

Preferred Skills

· Creative problem solving & thinking outside of the box (how do we make our current solutions, better?)

· Deep knowledge and experience of building highly available distributed systems, consensus protocols, service discovery, multi-tenancy paradigms, and operating in AWS resiliently at scale


• A track record of successful practical problem solving, excellent written and social communication, and documentation skills. Experience handling services in a large-scale enterprise environment

• Experience with managing private, hybrid & public cloud environments (AWS preferred)

• Experience leading and participating in performance tests identify bottlenecks, opportunities for automation, and capacity demands for new product & feature launches


· Expert knowledge on AWS cloud environments, with expertise working with NLB/ALB, S3, EC2, Autoscaling, EKS, Lamdas, etc. Cloud & Container platform Strategies, Design, Architecture & Migrations

We are an equal opportunity employer and comply with all applicable federal, state, and local fair employment practices laws. We strictly prohibit and do not tolerate discrimination against employees, applicants, or any other covered persons because of race, color, sex, pregnancy status, age, national origin or ancestry, ethnicity, religion, creed, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class. This policy applies to all terms and conditions of employment, including, but not limited to, hiring, training, promotion, discipline, compensation, benefits, and termination of employment. We comply with the Americans with Disabilities Act (ADA), as amended by the ADA Amendments Act, and all applicable state or local law.

Job Segment: Cloud, Software Engineer, Engineer, Change Management, Technology, Engineering, Management