QubitJob sometimes has a race condition due to SQL transactions
|Assignee:||Sara Allain||% Done:|
|Target version:||Release 2.3.0|
|Google Code Legacy ID:||Tested version:||2.3|
_Recently we tried re-enabling the SQL transaction filter in AtoM to avoid timeouts corrupting our data. However, as probably expected there's an issue that's come up from such a big change. It seems to me that some of the (all?) asynchronous jobs are now broken. For instance, if I run a worker manually to see its output in stdout, and try running update publication status I just get:
2016-03-04 15:32:51 > Job failed: Called a Gearman worker with an invalid QubitJob id._
_I guess that's a race condition where we're telling a different process "hey go and look at the database to take this job with ID $id" even before we commit the database transaction, right? Probably the transaction is not committed until the related Propel connection object is not destroyed when Symfony is done with the request.
I think that we need to look at QubitJob::runJob() rolling its own transaction, so we can commit it before we dispatch the job to gearmand. I know that Propel had some notion about nested transaction but I don't know the details. You can start looking some code like https://git.io/v2bGk but I'm suspecting that if you do a ->beginTransaction() ... ->commit() it will just work out of the box._
#4 Updated by José Raddaoui Marín about 6 years ago
PropelPDO takes care of currently running transactions and avoids nested transactions, so, in the request process, everything is wrapped in one big transaction:
Also, and more important, the Gearman worker in AtoM uses a different (persistent) connection than the one created by each request, so I think Sevein's suggestion won't work.
A solution to just this issue would be to send the entire Job, instead of only the id here, but we may find other race conditions with other resources used by both process (the job and the request). Although I don't think we have such a case right now, it could easily happen in the future.
Another option could be to commit the transaction just before we send the job to the queue in here. We usually don't do a lot in the request process after we launch the job, and it should be possible to start a new transaction after the job is sent here. But I'll have to do some tests to be sure.
#5 Updated by José Raddaoui Marín about 6 years ago
- Status changed from New to Code Review
- Assignee changed from José Raddaoui Marín to Mike Gale
The last solution seems to be working well to fix the job fail, but the never ending jobs are caused by another related issue, that was also creating issues in some of the tasks, as Dan reported to me from his local testings.
The problem was that we use the QubitTransactionFilter::getConnection() function in most of the model inserts and updates to get the Propel connection:
So I think that's why the beginConnection call was commented before in there, because when you call that function from outside a request process (jobs and tasks) the transaction never gets committed. I've moved the beginTransaction call to the execute method, that way it's only called in the request cycle.
#8 Updated by Jesús García Crespo about 6 years ago
- Status changed from QA/Review to Feedback
- Assignee changed from Dan Gillean to José Raddaoui Marín
The latest changes in the transaction filter broke the installer because Propel::getConnection() was raising an uncontrolled exception. I took care of that here: https://github.com/artefactual/atom/pull/301.