Principal Software Engineer

Walmart Stores SUNNYVALE, CA

About the Job

Position Description


As a Principal Software Engineer (Reliability Engineering) you are responsible for working with Walmart’s ecommerce state of the art fulfillment center warehouse management system as part of the Supply Chain Technology organization. The initiatives require ensuring smooth functioning of the WMS system and creating a great customer order fulfillment experience. We are looking to bring more intellectually curious engineers who are passionate about technology and creating engineering solutions to operations problems such as optimizing existing systems, building monitoring infrastructure and eliminating work through automation I and find innovative ways that reduce time spent on manual operations and proactively identify potential downtimes.

Responsibilities include:
• Work closely with development team on maintaining operational health and performance of core application functions
• Managing and triaging tickets. Driving prioritization and execution of work based on impact
• Scale systems sustainably through mechanisms such as easy to use tooling and automation. Work in concert with application developers, infrastructure engineers , business operations to evolve systems/products for better scalability, reliability and development velocity
• Drives new playbooks to help reduce mean time to discover, mean triage time of incidents and mean time to recovery. Prioritize and automate high volume playbooks
• Develop optimal incident response processes and drive root case analysis
• Demonstrate up-to-date expertise in Software Engineering and apply this to the development, execution, and improvement of action plans.
• Participate in multiple multi-scale projects.
• Technical understanding of core infrastructure, cloud services, platforms and micro-services
• Ability to understand and capture key data from log
• Ability to effectively triage - be able to detect and determine symptom vs cause.
• Analyze trends to pro-actively prevent incidents.
• Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
• Build tools to improve visibility, pro-actively detect issues and restore system availability.
• Strong focus on collecting and inferring metrics.
• Analyzes systems and makes recommendations to prevent possible problems.
• Takes lead on issue resolution activities using knowledge of complex and company-wide systems.
• Perform build, deployment and continuous integration processes to move the code and configurations from local development environments to QA & Production environments.
• Work as Level 2-production support engineer on a rotation-basis to help Level 1 production support team for any production issue where engineering help is required.
• Responsible for production environment health as first priority, enabling automated monitoring and alerting to meet SLAs.
• Clear communication skills.
• Perform build, deployment and continuous integration processes to move the code and configurations from local development environments to QA & Production environments.
• Responsible for production environment health as first priority, enabling automated monitoring and alerting and ensuring close to 100% uptime.
• Troubleshoot business and production issues

Minimum Qualifications


• Bachelor's Degree or Master's Degree in Computer Science + 15+ years of experience
• Proven industry experience with large scale distributed systems
• Solid experience with object-oriented and/or event driven systems
• Strong java programming experience
• Extensive experience building services using back end technologies (Java, Spring, Hibernate)
• In depth knowledge of SQL/No-SQL and database technologies ( Oracle, Cassandra, Hive)
• Experience automating tasks with scripting languages such as Shell, Perl, Python, Bash, and JavaScript
• Systematic problem-solving approach, strong communication skills, a sense of ownership and drive
• Deep understand of service metrics and alarms through the development of dashboards, service KPIs, alarming systems
• Aptitude to
• Experience working in an operational environment with mission critical tier one services with associated pager duty

Additional Preferred Qualifications


Strong aptitude to debug and optimizes code
• Attitude to thrive in a fun, fast-paced start-up like environment
• Experience in production system operations (logging, telemetry, alerting etc.)
• Excellent communication and problem-solving skills
• Has ambition and vigor to add value to a rapidly growing development team

Company Summary


The Walmart eCommerce team is rapidly innovating to evolve and define the future state of shopping. As the world’s largest retailer, we are on a mission to help people save money and live better.  With the help of some of the brightest minds in technology, merchandising, marketing, supply chain, talent and more, we are reimagining the intersection of digital and physical shopping to help achieve that mission.

Position Summary


As a Principal Software Engineer (Reliability Engineering) you are responsible for working with Walmart’s ecommerce state of the art fulfillment center warehouse management system as part of the Supply Chain Technology organization. The initiatives require ensuring smooth functioning of the WMS system and creating a great customer order fulfillment experience. We are looking to bring more intellectually curious engineers who are passionate about technology and creating engineering solutions to operations problems such as optimizing existing systems, building monitoring infrastructure and eliminating work through automation I and find innovative ways that reduce time spent on manual operations and proactively identify potential downtimes.

Responsibilities include:
• Work closely with development team on maintaining operational health and performance of core application functions
• Managing and triaging tickets. Driving prioritization and execution of work based on impact
• Scale systems sustainably through mechanisms such as easy to use tooling and automation. Work in concert with application developers, infrastructure engineers , business operations to evolve systems/products for better scalability, reliability and development velocity
• Drives new playbooks to help reduce mean time to discover, mean triage time of incidents and mean time to recovery. Prioritize and automate high volume playbooks
• Develop optimal incident response processes and drive root case analysis
• Demonstrate up-to-date expertise in Software Engineering and apply this to the development, execution, and improvement of action plans.
• Participate in multiple multi-scale projects.
• Technical understanding of core infrastructure, cloud services, platforms and micro-services
• Ability to understand and capture key data from log
• Ability to effectively triage - be able to detect and determine symptom vs cause.
• Analyze trends to pro-actively prevent incidents.
• Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
• Build tools to improve visibility, pro-actively detect issues and restore system availability.
• Strong focus on collecting and inferring metrics.
• Analyzes systems and makes recommendations to prevent possible problems.
• Takes lead on issue resolution activities using knowledge of complex and company-wide systems.
• Perform build, deployment and continuous integration processes to move the code and configurations from local development environments to QA & Production environments.
• Work as Level 2-production support engineer on a rotation-basis to help Level 1 production support team for any production issue where engineering help is required.
• Responsible for production environment health as first priority, enabling automated monitoring and alerting to meet SLAs.
• Clear communication skills.
• Perform build, deployment and continuous integration processes to move the code and configurations from local development environments to QA & Production environments.
• Responsible for production environment health as first priority, enabling automated monitoring and alerting and ensuring close to 100% uptime.
• Troubleshoot business and production issues