Add weighting to archival description searches
|Category:||Search / Browse|
|Target version:||Release 2.4.0|
|Google Code Legacy ID:||Tested version:|
To provide better results in AtoM, we would like to add search weighting. This was done in ICA-AtoM 1.x (see search fields ), and we have used this page as the basis for re-evaluating weighting for 2.x. In the end, we have kept the same weighting, though added scope and content, as well as the admin/biog history fields. New fields that were previously not indexed will also be added to the search index, on another ticket.
Proposed weighting for archival descriptions¶
- Subject access point
- Scope and content
- Name access point
- Place access point
#2 Updated by Jesús García Crespo almost 8 years ago
Currently, only a few places in AtoM are doing query time boosting. They don't use the boost search parameter but the boost operator in Query String Query (see http://goo.gl/Dm4rSF).
taxonomy/actions/indexAction.class.php: $queryString->setFields(array('i18n.'.$culture.'.name^5', 'useFor.i18n.'.$culture.'.name')); accession/actions/browseAction.class.php: 'identifier^10', accession/actions/browseAction.class.php: 'donors.i18n.'.$culture.'.authorizedFormOfName^10', accession/actions/browseAction.class.php: 'i18n.'.$culture.'.title^10', accession/actions/browseAction.class.php: 'i18n.'.$culture.'.scopeAndContent^10', accession/actions/browseAction.class.php: 'i18n.'.$culture.'.locationInformation^5', accession/actions/browseAction.class.php: 'i18n.'.$culture.'.processingNotes^5', accession/actions/browseAction.class.php: 'i18n.'.$culture.'.sourceOfAcquisition^5', accession/actions/browseAction.class.php: 'i18n.'.$culture.'.archivalHistory^5',
#3 Updated by Jesús García Crespo almost 8 years ago
It looks like boosting the relevancy of certain fields at index time is not available anymore in ES 1.x, so I don't think it's a good idea to use it at this point (see http://goo.gl/wXYT5S).
In /search, we user query string query. The default field is _all for authenticated users or a list of unhidden fields for authenticated users that is generated in setAllFields() (in arElasticSearchPluginUtil.class.php). Its code is specially slow because it parses mapping.yml and the results are not cached.
Fields can't be boosted individually when using _all. The easiest solution off the top of my head is to change setAllFields() to stop relying on _all so our custom weights can be applied. If we ever do that, we should take the chance to cache the results to avoid the extra CPU cycles that this function takes each time someone is visiting
This seems to be another reason to consider stop using query string query and build our own query parser. But that would mean more work that we can't really afford.
#4 Updated by Jesús García Crespo almost 8 years ago
- Status changed from In progress to Feedback
- Assignee changed from Jesús García Crespo to Mike Gale
Thoughts? I was thinking on subclassing Elastica\Query\QueryString for a custom QueryString class aware of the weighting requirements of our AtoM types, instead of keep adding logic to setAllFieds().
#5 Updated by Mike Gale almost 8 years ago
Hey Jesús, I think that's a pretty good strategy, to subclass it. That way we can have the field boosting baked into it transparently, and setAllFields() can focus on setting all the fields :). I agree that adding boosting right into that function is going to make it bloated.
#13 Updated by Dan Gillean almost 6 years ago
- Project changed from AtoM Wishlist to Access to Memory (AtoM)
- Description updated (diff)
- Category set to Search / Browse
- Status changed from New to In progress
- Assignee set to Mike Gale
- Target version set to Release 2.4.0
- Sponsored changed from No to Yes
- Requires documentation set to Yes
This has been sponsored for inclusion in AtoM 2.4 as part of the work being undertaken on #10082
Note: weights in Elastic search are cumulative - so a result that includes a match in both identifier and scope and content would end up with a 10x boost (2 time 5x). Because we automatically add creator names as an access point, the original suggested weightings were boosting a creator name match above a title match (8X for creator + 3X for name access point = 11x vs. 10x for title). Correspondingly, for now we've simply reduced the weighting for creator names to 6x.
#16 Updated by Dan Gillean almost 5 years ago
- Assignee deleted (
- Requires documentation deleted (
Updates added to 2.4 documentation in: