Feature #13578

Schedule removal of old data from the "access_log" table

Added by David Juhasz about 1 month ago. Updated about 1 month ago.

Status:VerifiedStart date:10/20/2021
Priority:MediumDue date:
Assignee:-% Done:

100%

Category:Internals
Target version:Release 2.7.0
Google Code Legacy ID: Tested version:2.7
Sponsored:No Requires documentation:

Description

A row is added to the "access_log" table every time an AtoM resource (e.g. archival description, authority record, archival institution) is loaded and this data is used to populate the "Popular this week" feature on the AtoM homepage.

Problem

There is no mechanism in AtoM currently to remove access_log table rows from the database though, so the table can quickly grow very large on a high traffic AtoM site. The "Popular this week" feature only needs usage data for the previous seven days, so any data older than seven days is unnecessary.

Desired functionality

Provide an automated way to remove obsolete data from the "access_log" table.

History

#1 Updated by David Juhasz about 1 month ago

  • Description updated (diff)

#2 Updated by David Juhasz about 1 month ago

  • Description updated (diff)

#3 Updated by David Juhasz about 1 month ago

  • Status changed from New to In progress
  • Assignee set to David Juhasz

#4 Updated by David Juhasz about 1 month ago

  • Subject changed from Schedule removal of old data from the access_log table to Schedule removal of old data from the "access_log" table

I considered three possible solutions to expire "access_log" data after seven days:

  1. Expire old data when the "Popular this week" widget is loaded (in the DefaultPopularComponent class)
  2. Expire old data when adding a new "access_log" row (in the QubitAccessLogObserver class)
  3. [Choosen] Expire old data with a CLI script (e.g. tools:expire-data)
I selected Option 3 for the following reasons:
  • With Option 1, if a client removes the "Popular this week" widget from their home page (we have several clients that do so) then data will continue to be added to the access_log table, the data will not be used at all (the data is only used to populate "Popular this week"), AND the data will never be expired
  • Option 2 would run the expiration code on every page load of an archival description, authority record, repository, or function (see: https://github.com/artefactual/atom/commit/caf01bb3a94e86ed74b630d66fc9cdcb0baaea4b) which seems excessive
  • Both Option 1 & 2 would automatically delete "access_log" data even though the site administrators may not be aware of the change (e.g. when they update to Release 2.7.0); if the "access_log" data is being used for other purposes than the "Popular this week" feature the data deletion could be problematic
  • Option 3 requires explicitly running the expiration script manually or via a scheduler. This ensures that the AtoM site administrators must choose if the expiration happens, and on what schedule (e.g. daily, weekly, monthly)

#5 Updated by David Juhasz about 1 month ago

  • Status changed from In progress to Code Review

I opened a pull request that adds an "access_log" option to to the tools:expire-data CLI script:
https://github.com/artefactual/atom/pull/1455

#6 Updated by David Juhasz about 1 month ago

  • Status changed from Code Review to QA/Review
  • Assignee deleted (David Juhasz)

#7 Updated by David Juhasz about 1 month ago

  • Tested version deleted (2.6)

#8 Updated by David Juhasz about 1 month ago

I found a bug with deleting multiple resource types at the same time where the first calculated "expiry date" is used for all subsequent types.

To reproduce:
1) Expire multiple resource types without using the "--older-than" option. E.g.

symfony tools:expire-data clipboard,job,access_log

Resulting error

The first "expiry date" calculated will be used for all resource types. E.g.

>> expire-data Used app_clipboard_save_max_age setting to set expiry date of 2021-10-20.

  Are you sure you want to delete saved clipboards older than 2021-10-20 (y/N)?

y
>> expire-data 0 saved clipboards deleted.

  Are you sure you want to delete jobs (and any related files) older than 2021-10-20 (y/N)?

y
>> expire-data 0 jobs (and any related files) deleted.

  Are you sure you want to delete access logs older than 2021-10-20 (y/N)?

Expected outcome

The expiry date for each resource type should be calculated independently based on the rules for that resource type. E.g. the default expiry date for access_log should be 7 days before the current date.

#10 Updated by David Juhasz about 1 month ago

  • Requires documentation set to Yes

#12 Updated by Dan Gillean about 1 month ago

  • Status changed from QA/Review to Verified
  • Target version set to Release 2.7.0
  • % Done changed from 0 to 100
  • Requires documentation deleted (Yes)

Also available in: Atom PDF