Bug #13109

AtoM worker service is unstable in systemd

Added by Miguel Angel Medinilla Luque 3 months ago. Updated about 1 month ago.

Status:VerifiedStart date:07/12/2019
Priority:MediumDue date:
Assignee:Dan Gillean% Done:

0%

Category:Job scheduling
Target version:Release 2.5.2
Google Code Legacy ID: Tested version:2.5
Sponsored:No Requires documentation:

Description

AtoM DIP uploads fails and nginx error log says:

2019/07/12 10:43:53 [error] 5616#5616: *124382 FastCGI sent in stderr: "PHP message: No Gearman worker available that can handle the job qtSwordPluginWorker" while reading response header from upstream, client: 51.89.136.99, server: host.accesstomemory.org, request: "POST /sword/deposit/demo HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.atom.sock:", host: "host.accesstomemory.org" 

Checking gearmand from atom-worker, the qtsword plugins was shown as enabled:

root@atom-host:/usr/share/nginx/udundee/src# sudo -u www-data php symfony jobs:worker
2019-07-12 12:37:59 > New ability: arFindingAidJob
2019-07-12 12:37:59 > New ability: arInheritRightsJob
2019-07-12 12:37:59 > New ability: arCalculateDescendantDatesJob
2019-07-12 12:37:59 > New ability: arObjectMoveJob
2019-07-12 12:37:59 > New ability: arInformationObjectCsvExportJob
2019-07-12 12:37:59 > New ability: qtSwordPluginWorker
2019-07-12 12:37:59 > New ability: arUpdatePublicationStatusJob
2019-07-12 12:37:59 > New ability: arFileImportJob
2019-07-12 12:37:59 > New ability: arInformationObjectXmlExportJob
2019-07-12 12:37:59 > New ability: arXmlExportSingleFileJob
2019-07-12 12:37:59 > New ability: arGenerateReportJob
2019-07-12 12:37:59 > New ability: arActorCsvExportJob
2019-07-12 12:37:59 > New ability: arActorXmlExportJob
2019-07-12 12:37:59 > New ability: arRepositoryCsvExportJob
2019-07-12 12:37:59 > New ability: arUpdateEsIoDocumentsJob
2019-07-12 12:37:59 > Running worker...
2019-07-12 12:37:59 > PID 7224

The worker and gearmand services were up:

root@atom-host:/var/log/nginx# ps aux | grep jobs:worker
root      7438  0.0  0.0  14856  1116 pts/0    S+   11:40   0:00 grep --color=auto jobs:worker
www-data 22971  0.0  0.6 361948 42592 ?        Ss   Jul10   0:48 /usr/bin/php7.2 -d memory_limit=-1 -d error_reporting=E_ALL /usr/share/nginx/host/src/symfony jobs:worker
root@atom-host:/var/log/nginx# ps aux | grep gearman
root      7440  0.0  0.0  14856  1012 pts/0    S+   11:40   0:00 grep --color=auto gearman
gearman  12852  0.0  0.0 497876  5668 ?        Ssl  Jun06  13:46 /usr/sbin/gearmand --pid-file=/run/gearman/gearmand.pid --listen=127.0.0.1 --port=4730 --daemon --log-file=/var/log/gearman-job-server/gearmand.log --queue-type=builtin

There's no errors in gearmand logs.

After restarting services in AtoM, the issue was fixed, the same DIP were upload to the same slug:

root@atom-host:/var/log/nginx# sudo service gearman-job-server restart
root@atom-host:/var/log/nginx# sudo service php7.2-fpm restart
root@atom-host:/var/log/nginx# sudo service atom-worker-host restart

No Archivematica server actions were done, so the issue seems located in AtoM server.

System Specs: AtoM 2.5.1 (4bc5202bdf6457e8de6eed2b9b9822ff76491ce0), Ubuntu Bionic, php7.2-fpm

I could reproduce the same issue in other AtoM 2.5 and Ubuntu Bionic deploy.


Related issues

Related to Access to Memory (AtoM) - Feature #13117: Notify the need of restarting the AtoM worker when the qt... Verified 07/20/2019

History

#1 Updated by José Raddaoui Marín 3 months ago

  • Status changed from New to Feedback
  • Assignee set to Miguel Angel Medinilla Luque

Hi Miguel Angel,

After enabling the SWORD plugin, the AtoM worker needs to be restarted to pick the new qtSwordPluginWorker job. Restarting the worker is not the same as checking the output of the "jobs:worker" task, as they are running in different contexts.

Please, let me know if there are other issues, otherwise I think we're good.

#2 Updated by José Raddaoui Marín 3 months ago

I'm seeing a similar error in the Bionic vagrant box from deploy-pub, in this case with a different job, so this may worth a deeper look as it's not only related to the qtSwordPlugin enable/disable process.

vagrant@ubuntu-bionic:/usr/share/nginx/atom$ sudo systemctl status atom-worker
● atom-worker.service - AtoM worker
   Loaded: loaded (/lib/systemd/system/atom-worker.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2019-07-12 19:18:23 UTC; 10min ago
 Main PID: 3019 (php7.2)
    Tasks: 1 (limit: 4704)
   CGroup: /system.slice/atom-worker.service
           └─3019 /usr/bin/php7.2 -d memory_limit=-1 -d error_reporting=E_ALL symfony jobs:worker

Jul 12 19:18:27 ubuntu-bionic php7.2[3019]: 2019-07-12 12:18:27 > New ability: arFileImportJob
Jul 12 19:18:27 ubuntu-bionic php7.2[3019]: 2019-07-12 12:18:27 > New ability: arInformationObjectXmlExportJob
Jul 12 19:18:27 ubuntu-bionic php7.2[3019]: 2019-07-12 12:18:27 > New ability: arXmlExportSingleFileJob
Jul 12 19:18:27 ubuntu-bionic php7.2[3019]: 2019-07-12 12:18:27 > New ability: arGenerateReportJob
Jul 12 19:18:27 ubuntu-bionic php7.2[3019]: 2019-07-12 12:18:27 > New ability: arActorCsvExportJob
Jul 12 19:18:27 ubuntu-bionic php7.2[3019]: 2019-07-12 12:18:27 > New ability: arActorXmlExportJob
Jul 12 19:18:27 ubuntu-bionic php7.2[3019]: 2019-07-12 12:18:27 > New ability: arRepositoryCsvExportJob
Jul 12 19:18:27 ubuntu-bionic php7.2[3019]: 2019-07-12 12:18:27 > New ability: arUpdateEsIoDocumentsJob
Jul 12 19:18:27 ubuntu-bionic php7.2[3019]: 2019-07-12 12:18:27 > Running worker...
Jul 12 19:18:27 ubuntu-bionic php7.2[3019]: 2019-07-12 12:18:27 > PID 3019

vagrant@ubuntu-bionic:/usr/share/nginx/atom$ sudo tail -f /var/log/nginx/error.log
2019/07/12 19:18:36 [notice] 3238#3238: signal process started
2019/07/12 19:28:14 [error] 3239#3239: *8 FastCGI sent in stderr: "PHP message: No Gearman worker available that can handle the job arUpdateEsIoDocumentsJob" while reading response header from upstream, client: 192.168.168.1, server: _, request: "POST /art-gallery-of-ontario-research-library-and-archives/edit HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.atom.sock:", host: "192.168.168.199", referrer: "http://192.168.168.199/art-gallery-of-ontario-research-library-and-archives/edit" 

Maybe a Gearman server <-> AtoM worker connection issue in Bionic??

#3 Updated by José Raddaoui Marín 3 months ago

As mentioned by Miguel Angel, restarting the worker solves the issue. However, after restarting the machine, the worker is down again, this time with a different error:

vagrant@ubuntu-bionic:~$ sudo systemctl status atom-worker
● atom-worker.service - AtoM worker
   Loaded: loaded (/lib/systemd/system/atom-worker.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Fri 2019-07-12 20:13:14 UTC; 5min ago
  Process: 1040 ExecStart=/usr/bin/php7.2 -d memory_limit=-1 -d error_reporting=E_ALL symfony jobs:worker (code=e
 Main PID: 1040 (code=exited, status=1/FAILURE)

Jul 12 20:13:14 ubuntu-bionic systemd[1]: Started AtoM worker.
Jul 12 20:13:14 ubuntu-bionic php7.2[1040]: Could not open input file: symfony
Jul 12 20:13:14 ubuntu-bionic systemd[1]: atom-worker.service: Main process exited, code=exited, status=1/FAILURE
Jul 12 20:13:14 ubuntu-bionic systemd[1]: atom-worker.service: Failed with result 'exit-code'.

Maybe the worker is initiated too soon?

Not sure if this particular issue will happen in production instances or if it's related to the VM synced folder ...

#5 Updated by José Raddaoui Marín 3 months ago

Hi David H. and Miguel Angel,

Sevein's fix from the PR above allows the worker to pick the qtSwordPluginWorker ability when the qtSwordPlugin is enabled and the cache is NOT involved. When the plugin is enabled/disabled through the GUI, the cache is cleared in the same request and restarting the worker will also add/remove the ability right away. However, if there is cache involved, and the plugin is enabled/disabled directly in the database or using a CLI task, the cache must be cleared manually before restarting the worker to add/remove the ability.

Additionally, I think we should revisit how we deploy the worker with the Ansible role. From my previous updates, you can ignore update 3 (related to the Vagrant synced folder); but update 2 is still an issue for me. In some cases, the worker seems to be running but the Gearman server is not aware of it, as explained in that update. In other cases, like restarting a default Bionic Vagrant-box, the worker starts with the following error:

Jul 20 16:40:12 vagrant systemd[1]: Started AtoM worker.
Jul 20 16:40:13 vagrant php7.2[535]: Unable to open PDO connection [wrapped: SQLSTATE[HY000] [2002] Connection refused]
Jul 20 16:40:13 vagrant php7.2[535]: <!DOCTYPE html>
Jul 20 16:40:13 vagrant php7.2[535]: <html>
Jul 20 16:40:13 vagrant php7.2[535]:   <head>
Jul 20 16:40:13 vagrant php7.2[535]:     <title>Error</title>
Jul 20 16:40:13 vagrant php7.2[535]:     <link rel="stylesheet" type="text/css" href="symfony/plugins/arDominionPlugin/css/main.css"/>
Jul 20 16:40:13 vagrant php7.2[535]:   </head>
Jul 20 16:40:13 vagrant php7.2[535]:   <body class="yui-skin-sam admin error">
Jul 20 16:40:13 vagrant php7.2[535]:     <div id="wrapper" class="container">
Jul 20 16:40:13 vagrant php7.2[535]:       <section class="admin-message" id="error-404">
Jul 20 16:40:13 vagrant php7.2[535]:         <h2>
Jul 20 16:40:13 vagrant php7.2[535]:           <img alt="" src="symfony/images/logo.png"/>
Jul 20 16:40:13 vagrant php7.2[535]:           Oops! An Error Occurred
Jul 20 16:40:13 vagrant php7.2[535]:         </h2>
Jul 20 16:40:13 vagrant php7.2[535]:         <p>
Jul 20 16:40:13 vagrant php7.2[535]:           Sorry, something went wrong.<br />
Jul 20 16:40:13 vagrant php7.2[535]:           The server returned a 500 Internal Server Error.
Jul 20 16:40:13 vagrant php7.2535: </p>
Jul 20 16:40:13 vagrant php7.2535: <div class="tips">
Jul 20 16:40:13 vagrant php7.2535: <p>
Jul 20 16:40:13 vagrant php7.2535: Try again a little later or ask in the <a href="http://groups.google.ca/group/ica-atom-users">discussion group</a>.<br />
Jul 20 16:40:13 vagrant php7.2535: <a href="javascript:history.go(-1)">Back to previous page.</a>
Jul 20 16:40:13 vagrant php7.2535: </p>
Jul 20 16:40:13 vagrant php7.2535: </div>
Jul 20 16:40:13 vagrant php7.2535: </section>
Jul 20 16:40:13 vagrant php7.2535: </body>
Jul 20 16:40:13 vagrant php7.2535: </html>
Jul 20 16:40:13 vagrant systemd1: atom-worker.service: Main process exited, code=exited, status=1/FAILURE
Jul 20 16:40:13 vagrant systemd1: atom-worker.service: Failed with result 'exit-code'.

In both situations it looks like we're starting the worker too soon. Maybe the OS upgrade or the new MySQL version in new deploys/vagrant-boxes are making this problem to appear more often than before.

Moreover, looking at the service file from the Ansible role, we don't restart the worker on failure. For example, if MySQL has gone away for a moment the worker will die forever with the following error:

Jul 20 16:42:57 vagrant php7.2[1389]: 2019-07-20 09:42:57 > Running worker...
Jul 20 16:42:57 vagrant php7.2[1389]: 2019-07-20 09:42:57 > PID 1389
Jul 20 17:13:31 vagrant php7.2[1389]:                                                                    
Jul 20 17:13:31 vagrant php7.2[1389]:   SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
Jul 20 17:13:31 vagrant php7.2[1389]:                                                                    
Jul 20 17:13:31 vagrant systemd[1]: atom-worker.service: Main process exited, code=exited, status=1/FAILURE
Jul 20 17:13:31 vagrant systemd[1]: atom-worker.service: Failed with result 'exit-code'.

It looks like we had some respawn rules when we used Upstart.

#6 Updated by José Raddaoui Marín 3 months ago

  • Related to Feature #13117: Notify the need of restarting the AtoM worker when the qtSwordPlugin is enabled/disabled through the GUI added

#7 Updated by José Raddaoui Marín 3 months ago

  • Requires documentation set to Yes

The same issue is probably happening in external deploys, as we also have a similar AtoM worker service configuration in our docs:

https://www.accesstomemory.org/es/docs/2.5/admin-manual/installation/asynchronous-jobs/#installation-asynchronous-jobs

#8 Updated by Dan Gillean 3 months ago

  • Target version changed from Release 2.5.1 to Release 2.5.2

It looks like we had some respawn rules when we used Upstart.

Some further ideas have been discussed in the user forum, that are worth investigating. See:

#9 Updated by José Raddaoui Marín 3 months ago

  • Subject changed from DIP Upload gearmand issue to AtoM worker service is unstable in systemd

#10 Updated by José Raddaoui Marín 3 months ago

  • Category set to Job scheduling

#11 Updated by Dan Gillean 3 months ago

To summarize this long ticket:

There are a couple issues with the worker needing to be addressed. The smaller parts of this are:

1) When and how we start the worker, in relation to the qubitSwordPlugin, the upgrade task, and others. Jesus has a PR that addresses this here: https://github.com/artefactual/atom/pull/928

2) We need to remind users that the worker will need a restart whenever the qtSwordPlugin is enabled/disabled manually via the GUI. There is a separate ticket to add a notification reminder of this in #13117.

The bigger part is deciding on how to revise the systemd atom-worker configuration file to prevent the worker from dying and not restarting. There are several considerations here:

1) We previously had some respawn rules which were lost when we moved to systemd. See comment 5 above, and the previous config

2) Several users have shared that they have added values such as Restart=always and RestartSec=5, to make the workers more persistent. We haven't tested this ourselves, and there are alternative options to consider, such as Restart=on-failure

3) There are other systemd params we might want to consider, to prevent race conditions, such as StartLimitIntervalSec, StartLimitBurst, or even JobTimeoutACtion, JobTimeoutRebootArgument, JobTimeoutSec, JobRunningTimeoutSec, etc. See:

https://www.freedesktop.org/software/systemd/man/systemd.unit.html#StartLimitIntervalSec=interval

4) David has reported that he added a wait value for MySQL to his AtoM worker config and that seems to be working. We may want to use a combination of these approaches. Currently we are using a value like "ExecStartPre=/bin/sleep 10s" to delay the start of the atom-worker - but there may be cases where this is not enough. If waiting for MySQL is the only issue, then one alternative would be to add values such as Requires=mysqld.service and After=mysqld.service

#12 Updated by David Juhasz 3 months ago

EDIT: Updated to incorporate Restart limiting options from https://github.com/artefactual/atom/issues/933 and to use "Restart=on-failure" as recommend by https://www.freedesktop.org/software/systemd/man/systemd.service.html

This atom-worker systemd configuration is working in my AtoM dev vagrant box, and includes waiting for the MySQL service to start and the Restart=always directive:

[Unit]
Description=AtoM worker
After=network.target
After=mysql.service
Requires=mysql.service
StartLimitIntervalSec=60
StartLimitBurst=4

[Install]
WantedBy=multi-user.target

[Service]
Type=simple
User=vagrant
Group=vagrant
WorkingDirectory=/home/vagrant/atom
ExecStart=/usr/bin/php7.2 -d memory_limit=-1 -d error_reporting="E_ALL" symfony jobs:worker
KillSignal=SIGTERM
Restart=on-failure
RestartSec=5

# AtoM PHP pool vars
Environment=ATOM_DEBUG_IP="127.0.0.1" 
Environment=ATOM_READ_ONLY="off" 

Of course options such as User, Group, WorkingDirectory, and /usr/bin/php7.2 must be customized for the target environment.

#13 Updated by José Raddaoui Marín 3 months ago

I wonder if waiting for MySQL would be enough. Looking at the case from the description (which I was able to reproduce locally in update 2). The AtoM worker seems to be up without errors, but the jobs can't be triggered. It looks like the Gearman server doesn't know about the AtoM worker.

I don't know if the reason for that issue is starting the AtoM worker before the Gearman server is running, but I think we should investigate that case a little further.

#14 Updated by José Raddaoui Marín 3 months ago

I did a couple of tests in the Docker environment to see if I could reproduce the issue from the description in that environment. Starting the AtoM worker fails when the Gearman server is down: "Couldn't connect to any available servers". Triggering a background job when the server is down shows a different error in the Nginx error: "Could not connect to gearmand:4730". So I don't really know what's the problem :(

#15 Updated by José Raddaoui Marín 3 months ago

Ok, the reason of the issue in the description and in update 2 is the same one as the problem described in #13108. In my case it happens when I load the demo database over a running instance and it's probably the same for Miguel Angel deploying with Ansible. In the purge task or any other change made directly over the database the site title and/or base URL are modified and the worker is not restarted.

#16 Updated by José Raddaoui Marín 3 months ago

  • Assignee changed from Miguel Angel Medinilla Luque to David Juhasz

Hi David,

Looking at the configuration example from update 12, I think we should not include the MySQL conditions:

After=mysql.service
Requires=mysql.service

Those may help in single-machine environments, like the Vagrant box, but they will avoid the worker to start if MySQL is running in a different machine. The AtoM worker will fail with error if the connection to MySQL fails, causing the restart on failure, so I think we're good with only the new start/restart conditions.

#17 Updated by David Juhasz 3 months ago

José Raddaoui Marín wrote:

Hi David,

Looking at the configuration example from update 12, I think we should not include the MySQL conditions:

[...]

Those may help in single-machine environments, like the Vagrant box, but they will avoid the worker to start if MySQL is running in a different machine. The AtoM worker will fail with error if the connection to MySQL fails, causing the restart on failure, so I think we're good with only the new start/restart conditions.

Yeah, good point Radda. I would rather not rely on the service restart in single node environments to get around the MySQL/worker start order problem, but I agree that it's simpler to use a config that works for multi-node and single-node deploys. Updated config below:

[Unit]
Description=AtoM worker
After=network.target
StartLimitIntervalSec=60
StartLimitBurst=4

[Install]
WantedBy=multi-user.target

[Service]
Type=simple
User=vagrant
Group=vagrant
WorkingDirectory=/home/vagrant/atom
ExecStart=/usr/bin/php7.2 -d memory_limit=-1 -d error_reporting="E_ALL" symfony jobs:worker
KillSignal=SIGTERM
Restart=on-failure
RestartSec=5

# AtoM PHP pool vars
Environment=ATOM_DEBUG_IP="127.0.0.1" 
Environment=ATOM_READ_ONLY="off" 

#18 Updated by David Juhasz 3 months ago

  • Assignee changed from David Juhasz to David Hume

#19 Updated by José Raddaoui Marín 3 months ago

Even in single node deploys, waiting for the MySQL service won't guarantee that the worker starts properly, as MySQL may take some time to accept connections (I experienced this in the Docker env.). And the same issues may happen with the Gearman server.

#20 Updated by David Hume 3 months ago

  • Assignee changed from David Hume to David Juhasz

#22 Updated by José Raddaoui Marín 3 months ago

Thanks Davids!!

I was going to add this changes to the AtoM docs and took a look at the restart options. Sorry for not looking at all the changes sooner and for adding more questions now ...

Looking at the restart options docs (here and here), with the new configuration, the worker will be restarted with a 5 seconds interval, a maximum of 4 times within 60 seconds. So, my questions and thoughts are ...

  • 1. If the worker fails immediately (MySQL connection issue):
  • Would 20 seconds be enough for MySQL to init and accept connections?
  • I think it would be better to use `RestartSec=10`.
  • 2. If the worker fails after some time (unexpected error in job execution, out of memory issue?):
  • Would 60 seconds be enough for the worker to restart and fail 4 times?
  • Ideally, we would have rock solid jobs and over-provisioned servers to avoid this issue, but we can't guarantee that.
  • I couldn't find a way to set `StartLimitIntervalSec` to infinite and setting it to 0 disables any kind of rate limiting.
  • We could set `StartLimitIntervalSec` to a really high value to make sure the restart happens 4 times within that period, but I could not find what is the maximum value that can be used in that option.

I'm more worried about point 2 causing an infinite restart than about point 1 but I'd increase both values.

#23 Updated by David Juhasz 3 months ago

  • Assignee changed from David Juhasz to José Raddaoui Marín

Thanks for your feedback Radda. I think we should configure the restart based on the most likely scenarios that would cause the worker to fail. The likely causes of failure that I know of are:

1. The MySQL server is not accepting connections because it is starting up (waiting for a bit should resolve this issue)
2. A job causes a fatal error and kills the process (usually after the worker restarts the same job will run again, and cause the same fatal error, possibly adding duplicate data to the database)

For case #1 I think 30 seconds should be more than enough time to wait for the MySQL server to start and accept connections. I'm fine with bumping up RestartSec to RestartSec=10 to give a bit more time for the server to start.

For case #2 I think allowing the worker to restart too many times will actually increase the likelihood of adding bad data to the database, with a low likelihood of actually allowing the worker to recover from a fatal error.

I can't think of a case where increasing the StartLimitIntervalSec to a very large number would actually lead to better outcomes, but maybe I am just missing a common failure case? Can you think of any examples where the atom-worker is more likely to recover if the restart happens more than a minute after the initial failure?

#24 Updated by José Raddaoui Marín 3 months ago

  • Assignee changed from José Raddaoui Marín to David Juhasz

Thanks David,

The idea of increasing StartLimitIntervalSec is to be sure that the StartLimitBurst count happens in within that time. I may be understanding the docs wrong, but it looks to me like both limits are tied together and if the fourth restart happens 61 seconds after the first one, it won't hit the rate limits and the restart loop will continue.

I'm not trying to increase the options for the worker to recover, I doubt it will if the problem is a recurrent job, I'm trying to prevent an infinite loop restarting the service. We would keep StartLimitBurst to the same amount of restarts and setting StartLimitIntervalSec to a higher value we would increase the possibility of those restarts to happen in that interval.

#25 Updated by David Juhasz 3 months ago

  • Assignee changed from David Juhasz to José Raddaoui Marín

Oh, you're right Radda - I was thinking that increasing the StartLimitIntervalSec would lead to more restarts, but your are right that it actually means less restarts are allowed if the value is higher. Maybe we should go with a maximum of 3 restarts in a 24 hour period (e.g. StartLimitIntervalSec=86400)?

It's hard to know how strict is too strict... I think the worst case scenario is a very long job (like an import) that dies and restarts consistently after running for 25 hours, and nobody notices for a month that duplicate data is being added to the database. :( That's seems pretty unlikely though, and three worker restarts within a day is probably a good indication that there is a problem. It makes me wonder if we can set up systemd to email us when the restart limit is reached.

#26 Updated by José Raddaoui Marín 3 months ago

Thanks David, I think that's a good compromise.

If we consider the jobs we have, the ones that worry me the most at this point are arFileImportJob and qtSwordPluginWorker. Using the default PHP-FPM configuration, the post_max_size and upload_max_filesize limits will help with the file import job, making it easy to fail more than 3 times in 24 hours. I hope that, if someone increases those PHP limits a lot or plans to run heavy imports, they also consider to use the CLI to perform them.

For the DIP upload job, one of the first things we do in the process is to create a QubitAip object. If the job fails later, causing a worker restart and its re-execution, it will try to create another QubitAip with the same UUID and cause a permanent failure (marking the job as failed and not triggering a worker restart). If the job errors before creating the QubitAip object, it will hit the 3 restarts in 24 hours.

I'd use 24h instead of 86400 to make it more readable and, after reducing StartLimitBurst to 3, I'd increase RestartSec to 15. Please, let me know if you think that's okay, since David H. is on vacation I'll give it a try locally and I can amend the ansible-atom role PR and add this changes to the docs.

#27 Updated by José Raddaoui Marín 3 months ago

  • Assignee changed from José Raddaoui Marín to David Juhasz

I've tested the configuration changes using the Vagrant box and I can confirm all we said. However, we need to increase the StartLimitBurst to 4 to give the time we want to MySQL. Check this log trying to start the worker with MySQL down and using the following values:

StartLimitIntervalSec=24h
StartLimitBurst=3

Restart=on-failure
RestartSec=15

As it can be seen, StartLimitBurst is actually a starts count and at restart 3 (start 4) the worker is stopped directly without trying. That means that we're actually giving 30 seconds to MySQL to start, while setting it to 4 will give it 45 seconds.

An important consequence of setting StartLimitIntervalSec to 24h is that you won't be able to restart the worker as usual if it enters the "Start request repeated too quickly" status until those 24 hours expire completely or you run the following commands:

sudo systemctl reset-failed atom-worker
sudo systemctl start atom-worker

Considering that we have to restart the worker manually now, I don't think this is a big deal as long as we document it properly.

#28 Updated by José Raddaoui Marín 3 months ago

It makes me wonder if we can set up systemd to email us when the restart limit is reached.

We could probably use StartLimitAction with a custom script to notify via email, but I'm not sure if there are better ways.

#29 Updated by David Juhasz 2 months ago

José Raddaoui Marín wrote:

I've tested the configuration changes using the Vagrant box and I can confirm all we said. However, we need to increase the StartLimitBurst to 4 to give the time we want to MySQL. Check this log trying to start the worker with MySQL down and using the following values

I'd rather increase RestartSec=30 or add a ExecStartPre=/bin/sleep 15 timer than bump up the number of restarts allowed. Again, I think the more restarts we allow the more duplicate data could be added to the database.

#30 Updated by José Raddaoui Marín 2 months ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from David Juhasz to Dan Gillean

Thanks David!

Pull requests merged:

Ansible role: https://github.com/artefactual-labs/ansible-atom/pull/36
Documentation: https://github.com/artefactual/atom-docs/pull/122

I tested the changes locally using the Vagrant box, but I'm not sure if we want to give it a try somewhere else before verifying this ticket.

#31 Updated by Dan Gillean 2 months ago

  • Requires documentation deleted (Yes)

#32 Updated by Dan Gillean about 1 month ago

  • Status changed from QA/Review to Verified

Also available in: Atom PDF