Bug #13175

Improve Information Object Elasticsearch indexing speed

Added by Steve Breker over 2 years ago. Updated about 2 years ago.

Status:VerifiedStart date:09/16/2019
Priority:MediumDue date:
Assignee:-% Done:

0%

Category:Search / Browse
Target version:Release 2.5.4
Google Code Legacy ID: Tested version:
Sponsored:Yes Requires documentation:

Description

There is a performance issue when indexing descriptions with place, subject or genre terms for sites where there are a large number of these types of terms.

The specific issue is on this line:

https://github.com/artefactual/atom/blob/qa/2.6.x/plugins/arElasticSearchPlugin/lib/model/arElasticSearchInformationObjectPdo.class.php#L629

The outside bracketing on LFT and RGT will get slower as the number of terms in these taxonomies increases.

There are a few changes I would recommend for the arElasticSearchInformationObjectPdo::getRelatedTerms() method:
  1. Remove the join to get term_i18n.name as this is not used anywhere
  2. Remove the join to get the slug as this is not used anywhere
  3. Change the query to fetch PDO::FETCH_ASSOC instead of PDO::FETCH_OBJ. Change the calling functions to access the id from the resulting array
  4. Change the query structure to use the code from arElasticSearchInformationObjectPdo::getDirectlyRelatedTerms() and then get related terms by recursively crawling up the parent_id fields.

I have built a prototype that finds parents using recursion here: https://gist.github.com/sbreker/abb17641a1f50045056ee77a8041dd97

Finding the parents by id repeatedly in this fashion is faster than using the outside LFT RGT join on large sites.

e.g.:

This query will examine all ~300K place terms for each term on this description:

SELECT
DISTINCT term.id,
term.taxonomy_id,
term.source_culture,
slug.slug,
i18n.name
FROM object_term_relation otr
JOIN term current ON otr.term_id = current.id
JOIN term term ON term.lft <= current.lft AND term.rgt >= current.rgt
JOIN term_i18n i18n ON term.id = i18n.id
JOIN slug slug ON term.id = slug.object_id
WHERE otr.object_id = '1565240'
AND term.taxonomy_id = '35'

+----+-------------+---------+--------+-----------------------------------------------------+---------------------------+---------+------------------+--------+---------------------------------------+
| id | select_type | table   | type   | possible_keys                                       | key                       | key_len | ref              | rows   | Extra                                 |
+----+-------------+---------+--------+-----------------------------------------------------+---------------------------+---------+------------------+--------+---------------------------------------+
|  1 | SIMPLE      | otr     | ref    | object_term_relation_FI_2,object_term_relation_FI_3 | object_term_relation_FI_2 | 4       | const            |     10 | Using temporary                       |
|  1 | SIMPLE      | current | eq_ref | PRIMARY,lft                                         | PRIMARY                   | 4       | atom.otr.term_id |      1 | NULL                                  |
|  1 | SIMPLE      | i18n    | ALL    | PRIMARY                                             | NULL                      | NULL    | NULL             | 358653 | Using join buffer (Block Nested Loop) |
|  1 | SIMPLE      | term    | eq_ref | PRIMARY,lft,term_FI_2                               | PRIMARY                   | 4       | atom.i18n.id     |      1 | Using where                           |
|  1 | SIMPLE      | slug    | eq_ref | slug_U_1                                            | slug_U_1                  | 4       | atom.i18n.id     |      1 | NULL                                  |
+----+-------------+---------+--------+-----------------------------------------------------+---------------------------+---------+------------------+--------+---------------------------------------+

History

#1 Updated by Steve Breker over 2 years ago

Perhaps another solution is to load the terms into some kind of hash table for the duration of the indexing job.

#3 Updated by Steve Breker over 2 years ago

  • Category set to Search / Browse
  • Assignee set to Dan Gillean
  • Target version set to Release 2.6.0

Merged - Ready for QA.

This change directly affects the indexing of a description's subject, place and genre terms, related terms, and material type ID. It attempts to speed up the indexing process by pre-loading all subject, place and genre terms so they do not have to be repeatedly looked up for every description being indexed. For small databases, this may not result in a perceptible speed increase. For large systems with a large number of terms (> ~100K), this will result in a large speed increase when indexing descriptions.

Verifying that the indexed description document has the same data before and after the fix:

This can be verified via searching in the WebUI to ensure searching for these terms still results in finding the description in question. Or....

I tested that the same information is being indexed before and after this change via the CLI as follows:

Before deploying this fix to your test env:
1) index the description you are using to test:
php symfony search:populate --slug="anderson-archibald-se-2-06-32-w1"

2) retrieve and save the indexed json document:
curl -i -XGET 'http://localhost:9200/atom/QubitInformationObject/_search?q=+2-06-32-W1,+Anderson,+Archibald' > ~/indexed-document-before-deployment.txt

Note that you may need to change the index name (atom) and the search terms (in this case "2-06-32-W1", "Anderson", "Archibald") for you specific test case.

Then repeat after the change is deployed to the test env:

1) index the description you are using to test:
php symfony search:populate --slug="anderson-archibald-se-2-06-32-w1"

2) retrieve and save the indexed json document:
curl -i -XGET 'http://localhost:9200/atom/QubitInformationObject/_search?q=+2-06-32-W1,+Anderson,+Archibald' > ~/indexed-document-AFTER-deployment.txt

The indexed documents should be identical. You can test this by manually searching for your test terms in each document, or using the 'diff' command.

Related terms are terms that are not directly linked to the description, but are terms that are higher in the term hierarchy for a term that is attached to the description.

#4 Updated by Steve Breker over 2 years ago

This needs to be picked to the PAS client repo.

#5 Updated by Dan Gillean over 2 years ago

  • Status changed from New to Verified
  • Assignee deleted (Dan Gillean)
  • Sponsored changed from No to Yes

#6 Updated by Dan Gillean about 2 years ago

  • Target version changed from Release 2.6.0 to Release 2.5.4

Also available in: Atom PDF