Bug #9533

QubitJob sometimes has a race condition due to SQL transactions

Added by Mike Gale about 6 years ago. Updated about 6 years ago.

Status:VerifiedStart date:03/04/2016
Priority:HighDue date:
Assignee:Sara Allain% Done:

0%

Category:Internals
Target version:Release 2.3.0
Google Code Legacy ID: Tested version:2.3
Sponsored:No Requires documentation:

Description

_Recently we tried re-enabling the SQL transaction filter in AtoM to avoid timeouts corrupting our data. However, as probably expected there's an issue that's come up from such a big change. It seems to me that some of the (all?) asynchronous jobs are now broken. For instance, if I run a worker manually to see its output in stdout, and try running update publication status I just get:

2016-03-04 15:32:51 > Job failed: Called a Gearman worker with an invalid QubitJob id._

Below is a good suggestion by Jesús:
_I guess that's a race condition where we're telling a different process "hey go and look at the database to take this job with ID $id" even before we commit the database transaction, right? Probably the transaction is not committed until the related Propel connection object is not destroyed when Symfony is done with the request.

I think that we need to look at QubitJob::runJob() rolling its own transaction, so we can commit it before we dispatch the job to gearmand. I know that Propel had some notion about nested transaction but I don't know the details. You can start looking some code like https://git.io/v2bGk but I'm suspecting that if you do a ->beginTransaction() ... ->commit() it will just work out of the box._


Related issues

Related to Access to Memory (AtoM) - Bug #9401: Fix transaction filter and use transactions for all SQL u... Verified 02/03/2016
Blocks Access to Memory (AtoM) - Bug #9519: Update publication status task not updating ES Verified 03/02/2016

History

#1 Updated by Mike Gale about 6 years ago

  • Related to Bug #9401: Fix transaction filter and use transactions for all SQL updates added

#2 Updated by Mike Gale about 6 years ago

  • Blocks Bug #9519: Update publication status task not updating ES added

#3 Updated by Nick Wilkinson about 6 years ago

  • Assignee changed from Nick Wilkinson to José Raddaoui Marín

#4 Updated by José Raddaoui Marín about 6 years ago

PropelPDO takes care of currently running transactions and avoids nested transactions, so, in the request process, everything is wrapped in one big transaction:

https://github.com/artefactual/atom/blob/qa/2.3.x/vendor/symfony/lib/plugins/sfPropelPlugin/lib/vendor/propel/util/PropelPDO.php#L120-L174

Also, and more important, the Gearman worker in AtoM uses a different (persistent) connection than the one created by each request, so I think Sevein's suggestion won't work.

A solution to just this issue would be to send the entire Job, instead of only the id here, but we may find other race conditions with other resources used by both process (the job and the request). Although I don't think we have such a case right now, it could easily happen in the future.

Another option could be to commit the transaction just before we send the job to the queue in here. We usually don't do a lot in the request process after we launch the job, and it should be possible to start a new transaction after the job is sent here. But I'll have to do some tests to be sure.

Other thoughts?

#5 Updated by José Raddaoui Marín about 6 years ago

  • Status changed from New to Code Review
  • Assignee changed from José Raddaoui Marín to Mike Gale

The last solution seems to be working well to fix the job fail, but the never ending jobs are caused by another related issue, that was also creating issues in some of the tasks, as Dan reported to me from his local testings.

The problem was that we use the QubitTransactionFilter::getConnection() function in most of the model inserts and updates to get the Propel connection:

https://github.com/artefactual/atom/blob/qa/2.3.x/lib/model/om/BaseNote.php#L495
https://github.com/artefactual/atom/blob/qa/2.3.x/lib/model/om/BaseNote.php#L547

So I think that's why the beginConnection call was commented before in there, because when you call that function from outside a request process (jobs and tasks) the transaction never gets committed. I've moved the beginTransaction call to the execute method, that way it's only called in the request cycle.

I've created a pull request with both fixes. Mike G., as you reviewed #9401, could you please take a look?

#6 Updated by Mike Gale about 6 years ago

  • Assignee changed from Mike Gale to José Raddaoui Marín

LGTM.

#7 Updated by José Raddaoui Marín about 6 years ago

  • Status changed from Code Review to QA/Review
  • Assignee changed from José Raddaoui Marín to Dan Gillean

Merged in qa/2.3.x

#8 Updated by Jesús García Crespo about 6 years ago

  • Status changed from QA/Review to Feedback
  • Assignee changed from Dan Gillean to José Raddaoui Marín

The latest changes in the transaction filter broke the installer because Propel::getConnection() was raising an uncontrolled exception. I took care of that here: https://github.com/artefactual/atom/pull/301.

#9 Updated by Jesús García Crespo about 6 years ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from José Raddaoui Marín to Dan Gillean

#10 Updated by Sara Allain about 6 years ago

  • Assignee changed from Dan Gillean to José Raddaoui Marín

How should we go about testing this? Any reproduce-able steps that I can follow?

#11 Updated by Jesús García Crespo about 6 years ago

  • Assignee changed from José Raddaoui Marín to Sara Allain

If async jobs like CSV exports or finding aids are working for you in HEAD (qa/2.3.x) I think you can verify this.

#12 Updated by Sara Allain about 6 years ago

  • Status changed from QA/Review to Verified

Thanks Radda! Verified.

Also available in: Atom PDF