Task #13261

Improve Elasticsearch indexing process

Added by José Raddaoui Marín 8 months ago. Updated 5 months ago.

Status:VerifiedStart date:02/16/2020
Priority:MediumDue date:
Assignee:-% Done:

0%

Category:Performance / scalability
Target version:Release 2.6.0
Google Code Legacy ID: Tested version:
Sponsored:No Requires documentation:

Description

After the changes from #13224 and #13238, a few more queries have been identified that could be improved to speed up the indexing process:

- Get IO collection root.
- Get IO ancestors.
- Get Actor related terms (including ancestors).
- Get IO material types.
- ...

Also, after testing the search populate task with a very large database, the process died complaining about memory usage (+ 4GB). We should try to avoid the ORM in places like:

- Get all Accessions.
- Get IO repository.
- ...

Other possible enhancements:

- Avoid stdClass where possible.
- Use PDO::FETCH_ASSOC to fetch results.
- Pass only the resource id (and type) to update (moved to #13272).
- Clear classes cache periodically during the process (may not be needed after avoiding the ORM).


Related issues

Related to Access to Memory (AtoM) - Task #13224: Improve hierarchy management queries Verified 12/09/2019
Related to Access to Memory (AtoM) - Task #13238: Avoid multiple fetches of the languages enabled from the ... Verified 01/13/2020
Related to Access to Memory (AtoM) - Task #13272: Use resource's id instead of the full resource to create/... New 03/13/2020
Related to Access to Memory (AtoM) - Task #13241: Use CTE (or parent_id keymap) to get the related terms an... Duplicate 01/13/2020

History

#1 Updated by José Raddaoui Marín 8 months ago

  • Related to Task #13224: Improve hierarchy management queries added

#2 Updated by José Raddaoui Marín 8 months ago

  • Related to Task #13238: Avoid multiple fetches of the languages enabled from the database on the search populate task added

#3 Updated by José Raddaoui Marín 8 months ago

  • Description updated (diff)

#4 Updated by José Raddaoui Marín 8 months ago

  • Description updated (diff)

#5 Updated by José Raddaoui Marín 8 months ago

  • Description updated (diff)

#6 Updated by José Raddaoui Marín 7 months ago

  • Description updated (diff)

#7 Updated by José Raddaoui Marín 7 months ago

  • Related to Task #13272: Use resource's id instead of the full resource to create/update the Elasticsearch document added

#8 Updated by José Raddaoui Marín 7 months ago

  • Status changed from In progress to Code Review

#9 Updated by José Raddaoui Marín 6 months ago

  • Related to Task #13241: Use CTE (or parent_id keymap) to get the related terms and their ancestors on the search populate task added

#10 Updated by José Raddaoui Marín 5 months ago

  • Status changed from Code Review to QA/Review
  • Assignee deleted (José Raddaoui Marín)

I have tested this with a considerable big dataset (~1M IOs) and the indexing time went from over 8 days to 6 hours. The four most time consuming queries (+90% of the total query time) have been completely removed from the process and the next one has been reduced to approximately a third. The CPU utilization has been maintained with this changes while the average disk I/O activity increased from ~500 kB/s to ~2.30 MB/s during the process, as it did the amount of queries per second, from 275 to 3K (average).

There are many things to test to verify this improvements didn't change the indexed data. #13291 added a new task to easily get the indexed document by a resource's slug, which may be useful for this tests. The full indexing process and the single resource index differ a little more after this changes so I'd suggest to check the resulting document in both cases, when they're indexed in a full search:populate run and when they are updated (either save from the GUI or use the slug option of the search:populate task). In both cases, the metadata that may be more affected is:

All index types:

- i18n

QubitInformationObject:

- places, subjects and genres (direct and not direct)
- collectionRoot
- mediaType
- repository
- ancestors
- referenceCode

QubitActor:

- places and subjects (direct and not direct)

QubitAccession:

- all fields

QubitTerm:

- descendantsCount

Please, let me know if you need more information.

#11 Updated by Dan Gillean 5 months ago

  • Status changed from QA/Review to Verified

Interesting, the order of the inherited access point terms was different between indexing methods in descriptions. However, this did not result in any material differences to the search results. Calling this verified.

#12 Updated by José Raddaoui Marín 5 months ago

Good catch Dan ;)

The full index uses an in memory term -> parent lookup to reduce the DB hits while the single index uses CTE to get those terms and ancestors without loading all terms in memory. As you noted the order doesn't really matter at the moment, so that was an extra optimization we could make.

Thanks!

Also available in: Atom PDF