Postmortem

Ronaldaguirre
2 min readFeb 8, 2021

Issue Summary:

On 1/02/21 at 2:23 pm PST, 100% of the website’s service was down for a total 12 minutes, with service reinstated at 2:35 pm PST. Users universally experienced a response with a status code of 500 (internal server error). The main cause of the interrupt was a single character typo in which a .py file was written as .py with a trailing space.

Timeline for 1/02/21 (PST):

  • 2:23 pm: After deploying a Flask update, a junior engineer noticed that the website was returning a 500 status code.
  • 2:25 pm: All running processes on a particular server were checked using ps auxf. Nginx and MySQL were found to be running as expected, indicating an error with PYTHON/Flask.
  • 2:26 pm: The Nginx configuration file /etc/nginx/sites-enabled/default was edited to enable debug mode.
  • 2:27 pm: The website was curled to reveal a fatal error, a missing file /var/www/html/index.py_ ("_" = represents the white space) required in /var/www/html/index.py. The nonexistent .py" ", extension indicated a potential typographical error.
  • 2:28 pm: ls was used to check the contents of /var/www/html/. It was discovered that the file /var/www/html/index.py_ ("_" = represents the white space) existed, confirming a typographical error was made.
  • 2:30 pm: The typographical error was fixed on the individual server using sed -i 's/py_/py/' /var/www/html/index.py.
  • 2:31 pm: Website service was then tested once more, with content being served as expected.
  • 2:32 pm: A puppet manifest was developed to fix this issue on a large scale.
  • 2:35 pm: The puppet manifest was deployed on all remaining servers, bringing website service back to 100%.

Root cause and resolution:

The root cause of this outage was a typo made in the python file /var/www/html/ in which the file /var/www/html/index.py was required. The extension of .py was a typographical error, meant to be .py_ . Since /var/www/html/index.py_ ("_" = represents the white space) did not exist and was required, a fatal error was raised, preventing content from being served. Since this code was deployed on all servers, this error caused a 100% outage. A puppet manifest to fix the typographical error was developed and deployed on all servers, reinstating service within 12 minutes of the outage.

Corrective and preventative measures:

To prevent wide-scale issues like this from occurring in the future, code should never be deployed on all servers before testing. Some things to consider for the future are:

  • the development of company-wide testing protocol
  • setting up isolated docker containers for testing purposes
  • and the implementation of a two-person sign-off before major deployment.

Author:

--

--