Feature #6681

Add weighting to archival description searches

Added by Dan Gillean about 6 years ago. Updated almost 3 years ago.

Status:VerifiedStart date:05/07/2014
Priority:MediumDue date:
Assignee:-% Done:

0%

Category:Search / Browse
Target version:Release 2.4.0
Google Code Legacy ID: Tested version:
Sponsored:Yes Requires documentation:

Description

To provide better results in AtoM, we would like to add search weighting. This was done in ICA-AtoM 1.x (see search fields ), and we have used this page as the basis for re-evaluating weighting for 2.x. In the end, we have kept the same weighting, though added scope and content, as well as the admin/biog history fields. New fields that were previously not indexed will also be added to the search index, on another ticket.

Proposed weighting for archival descriptions

10x weight

  • Title

6x weight

  • Creator

5x weight

  • Identifier
  • Subject access point
  • Scope and content

3x weight

  • Name access point
  • Place access point

Related issues

Related to Access to Memory (AtoM) - Bug #5741: Title should be weighted in search results for authority ... New 10/03/2013
Related to Access to Memory (AtoM) - Feature #10082: Improve Elasticsearch mappings for archival descriptions Verified 05/02/2016

History

#1 Updated by Jesús García Crespo almost 6 years ago

  • Status changed from New to In progress
  • Assignee changed from José Raddaoui Marín to Jesús García Crespo

#2 Updated by Jesús García Crespo almost 6 years ago

Currently, only a few places in AtoM are doing query time boosting. They don't use the boost search parameter but the boost operator in Query String Query (see http://goo.gl/Dm4rSF).

taxonomy/actions/indexAction.class.php:    $queryString->setFields(array('i18n.'.$culture.'.name^5', 'useFor.i18n.'.$culture.'.name'));
accession/actions/browseAction.class.php:  'identifier^10',
accession/actions/browseAction.class.php:  'donors.i18n.'.$culture.'.authorizedFormOfName^10',
accession/actions/browseAction.class.php:  'i18n.'.$culture.'.title^10',
accession/actions/browseAction.class.php:  'i18n.'.$culture.'.scopeAndContent^10',
accession/actions/browseAction.class.php:  'i18n.'.$culture.'.locationInformation^5',
accession/actions/browseAction.class.php:  'i18n.'.$culture.'.processingNotes^5',
accession/actions/browseAction.class.php:  'i18n.'.$culture.'.sourceOfAcquisition^5',
accession/actions/browseAction.class.php:  'i18n.'.$culture.'.archivalHistory^5',

#3 Updated by Jesús García Crespo almost 6 years ago

It looks like boosting the relevancy of certain fields at index time is not available anymore in ES 1.x, so I don't think it's a good idea to use it at this point (see http://goo.gl/wXYT5S).

In /search, we user query string query. The default field is _all for authenticated users or a list of unhidden fields for authenticated users that is generated in setAllFields() (in arElasticSearchPluginUtil.class.php). Its code is specially slow because it parses mapping.yml and the results are not cached.

Fields can't be boosted individually when using _all. The easiest solution off the top of my head is to change setAllFields() to stop relying on _all so our custom weights can be applied. If we ever do that, we should take the chance to cache the results to avoid the extra CPU cycles that this function takes each time someone is visiting /search.

This seems to be another reason to consider stop using query string query and build our own query parser. But that would mean more work that we can't really afford.

#4 Updated by Jesús García Crespo almost 6 years ago

  • Status changed from In progress to Feedback
  • Assignee changed from Jesús García Crespo to Mike Gale

Thoughts? I was thinking on subclassing Elastica\Query\QueryString for a custom QueryString class aware of the weighting requirements of our AtoM types, instead of keep adding logic to setAllFieds().

#5 Updated by Mike Gale almost 6 years ago

Hey Jesús, I think that's a pretty good strategy, to subclass it. That way we can have the field boosting baked into it transparently, and setAllFields() can focus on setting all the fields :). I agree that adding boosting right into that function is going to make it bloated.

#6 Updated by Mike Gale almost 6 years ago

  • Assignee changed from Mike Gale to Jesús García Crespo

#7 Updated by Jesús García Crespo almost 6 years ago

  • Target version changed from Release 2.1.0 to Release 2.2.0

#8 Updated by Dan Gillean over 5 years ago

  • Related to Bug #5741: Title should be weighted in search results for authority records and institutions added

#9 Updated by Dan Gillean over 5 years ago

  • Status changed from Feedback to New
  • Target version deleted (Release 2.2.0)

#10 Updated by Dan Gillean about 5 years ago

  • Project changed from Access to Memory (AtoM) to AtoM Wishlist
  • Category deleted (Search / Browse)

Moved to AtoM wishlist until sponsored for inclusion.

#11 Updated by Jesús García Crespo about 5 years ago

  • Assignee deleted (Jesús García Crespo)

#13 Updated by Dan Gillean almost 4 years ago

  • Project changed from AtoM Wishlist to Access to Memory (AtoM)
  • Description updated (diff)
  • Category set to Search / Browse
  • Status changed from New to In progress
  • Assignee set to Mike Gale
  • Target version set to Release 2.4.0
  • Sponsored changed from No to Yes
  • Requires documentation set to Yes

This has been sponsored for inclusion in AtoM 2.4 as part of the work being undertaken on #10082

Note: weights in Elastic search are cumulative - so a result that includes a match in both identifier and scope and content would end up with a 10x boost (2 time 5x). Because we automatically add creator names as an access point, the original suggested weightings were boosting a creator name match above a title match (8X for creator + 3X for name access point = 11x vs. 10x for title). Correspondingly, for now we've simply reduced the weighting for creator names to 6x.

#14 Updated by Dan Gillean almost 4 years ago

  • Related to Feature #10082: Improve Elasticsearch mappings for archival descriptions added

#15 Updated by Dan Gillean almost 4 years ago

  • Status changed from In progress to Verified

#16 Updated by Dan Gillean almost 3 years ago

  • Assignee deleted (Mike Gale)
  • Requires documentation deleted (Yes)

Also available in: Atom PDF