At my work we have quite a number of different sites/apps. Sometimes it is just a regular django website. Sometimes django + celery. Sometimes it also has extra django management commands, running from cronjobs. Sometimes Redis is used. Sometimes there are a couple of servers working together....
Anyway, life is interesting if you're the one that people go to when something is (inexplicably) broken :-) What are the moving parts? What do you need to check? Running top to see if there's a stuck process running at 100% CPU. Or if something eats up all the memory. df -h to check for a disk that's full. Or looking at performance graphs in Zabbix. Checking our "sentry" instance for error messages. And so on.
You can solve the common problems that way. Restart a stuck server, clean up some files. But what about a website that depends on background jobs, run periodically from celery? If there are 10 similar processes stuck? Can you kill them all? Will they restart?
I had just such a problem a while ago. So I sat down with the developer. Three things came out of it.
I was told I could just kill the smaller processes. They can be re-run later. This means it is a good, loosely-coupled design: fine :-)
The README now has a section called "troubleshooting" with a couple of command line examples. For instance the specific celery command to purge a specific queue that's often troublesome.
This is essential! I'm not going to remember that. There are too many different sites/apps to keep all those troubleshooting commands in my head.
A handy script (bin/repair) that prints out the commands that need to be executed to get everything right again. Re-running previously-killed jobs, for instance.
The script grew out of the joint debugging session. My colleague was telling me about the various types of jobs and celery/redis queues. And showing me redis commands that told me which jobs still needed executing. "Ok, so how do I then run those jobs? What should I type in?"
And I could check serveral directories to see which files were missing. Plus commands to re-create them. "So how am I going to remember this?"
In the end, I asked him if he could write a small program that did all the work we just did manually. Looking at the directories, looking at the redis queue, printing out the relevant commands?
Yes, that was possible. So a week ago, when the site broke down and the colleague was away on holiday, I could kill a few stuck processes, restart celery and run bin/repair. And copy/paste the suggested commands and execute them. Hurray!
So... make your sysadmin/devops/whatever happy and...
- Provide a good README with troubleshooting info. Stuff like "you can always run bin/supervisorctl restart all without everything breaking. Or warnings not to do that but to instead do xyz.
- Provide a script that prints out what needs doing to get everything OK again.