A weird race condition
Some background: We have a model that is edited only via the Django admin. The save method of the model fires a celery task to update several other records. The reason for using celery here is that the amount of related objects can be pretty big and we decided that it is best to spawn the update as a background job. We have unit and integration tests, the code was also manually tested, everything looked nice and we deployed it.
On the next day we found out that the code is acting weird. No errors, everything looked like it has worked but the updated records actually contained the old data before the change. The celery task accepts the object ID as an argument and the object is fetched from the database before doing anything else so the problem was not that we were passing some old state of the object. Then what was going on?
Trying to reproduce it: Things are getting even weirder. The issues is happening every now and then.
Hm...? Race condition?! Let's take a look at the code:
class MyModel(models.Model):
def save(self, **kwargs):
is_existing_object = False if self.pk else True
super(MyModel, self).save(**kwargs)
if is_existing_object:
update_related_objects.delay(self.pk)
So the celery task is called after the "super().save()" and the changes should be already stored to the database. Unless...
The bigger picture: (this is simplified version of what is happening for the full one check Django's source)
def changeform_view(...):
with transaction.atomic():
...
self.save_model(...) # here happens the call to MyModel.save()
self.save_related(...)
...
Ok, so the save is actually wrapped in transaction. This explains what is going on. Before the transaction is committed the updated changes are not available for the other connections. This way when the celery task is called we end up in a race condition whether the task will start before or after the transaction is completed. If celery manages to pick the task before the transaction is committed it reads the old state of the object and here is the error.
Solution: Honestly, we picked a quick and ugly fix. We added a 60 seconds countdown to the task call giving the transaction enough time to complete. As the call to the task depends on some logic and which properties of the models instance are changes moving it out of the save method was a problem. Another option could be to pass all the necessary data to the task itself but we decided that it will make it too complicated.
However I am always open to other ideas so if you have hit this issue before I would like to know how you solved it.