Feature #10082

Improve Elasticsearch mappings for archival descriptions

Added by Mike Gale about 4 years ago. Updated about 1 year ago.

Status:VerifiedStart date:05/02/2016
Priority:MediumDue date:
Assignee:-% Done:

0%

Category:Search / BrowseEstimated time:70.00 hours
Target version:Release 2.4.0
Google Code Legacy ID: Tested version:
Sponsored:Yes Requires documentation:

Description

This feature will revise the ES implementation for archival descriptions to provide more targeted search results.

Currently, we include the entirety of related actors, repositories, and child records embedded within each archival description (we use _all to add these indiscriminately). This can lead to problematic search results, as outlined in issues #8128 and #5944.

This feature will involve 3 main enhancements:

First, we will analyze and select only relevant fields from related entities to add to each description in the index, resolving the problems outlined in issues #8128 and #5944.

Second, we will add weighting to archival descriptions, as outlined in issue #6681 (we will review the weightings proposed there first, given that new fields have been added to AtoM since then, such as Genre access points).

Finally, we will change the default operator in AtoM from "OR" to "AND".

one-to-twenty-test.csv Magnifier - test CSV, 22 rows with random search term "puppa" added in different combinations of fields for testing (44.8 KB) Dan Gillean, 07/29/2016 05:21 PM

puppa-test-search.png - Screenshot of test search results for "puppa" (276 KB) Dan Gillean, 07/29/2016 05:35 PM

twentytwo-test-search.png - Screenshot of test search results for "twentytwo" after adding it to title, scope and content, identifier, and archival history of 4 different records (99.2 KB) Dan Gillean, 07/29/2016 05:42 PM

es_fields.tgz (4.13 KB) Mike Gale, 02/10/2017 02:04 AM

QubitInformationObject.txt Magnifier (1.35 KB) Mike Gale, 02/10/2017 06:53 PM


Related issues

Related to Access to Memory (AtoM) - Feature #6681: Add weighting to archival description searches Verified 05/07/2014
Related to Access to Memory (AtoM) - Bug #11537: Notes should be searchable via the global search box Verified 09/21/2017
Related to Access to Memory (AtoM) - Task #11538: Consider returning creator histories to global archival d... Verified 09/21/2017
Related to Access to Memory (AtoM) - Bug #11943: Authority record search: default operator should be chang... Verified 02/01/2018
Related to Access to Memory (AtoM) - Feature #13096: Remove unnecessary data from Elasticsearch index Verified 06/21/2019

History

#2 Updated by Dan Gillean almost 4 years ago

  • Related to Feature #6681: Add weighting to archival description searches added

#3 Updated by Mike Gale almost 4 years ago

  • Status changed from New to Code Review
  • Assignee changed from Mike Gale to Mike Cantelon

#4 Updated by Jesús García Crespo almost 4 years ago

Should we ask Fiver to test this before it's merged?

#5 Updated by Dan Gillean almost 4 years ago

  • Assignee changed from Mike Cantelon to Jesús García Crespo

Can you CR this please Jesus? Mike is away all week.

#6 Updated by Dan Gillean almost 4 years ago

  • Sponsored changed from No to Yes

#7 Updated by Mike Gale almost 4 years ago

  • Status changed from Code Review to QA/Review
  • Assignee changed from Jesús García Crespo to Dan Gillean

#8 Updated by Mike Gale almost 4 years ago

merged qa/2.4.x
b7856e9e5e3379f06838ce6d8d926e21c69aa3bf

#9 Updated by Dan Gillean almost 4 years ago

Initial testing suggests that the updates to the indexed fields (solving #8128 and #5944) and changing the default operator from OR to AND are both working as expected. Right now, it appears to be the weighting that is giving unexpected results.

Attaching a test CSV I used for this (despite the name, it as 22 rows, not 20).

For each test record, I included the random word "puppa" in various fields.

Based on the proposed weights in #6681 and the assumption that multiple weighted field matches should lead to a combined weight (e.g. exact match in title + identifier = 10x weight title + 5x weight identifier, ergo total weight =15x), I then indicated in the title where the term was found, and what I thought the weight should be.

Test 1 - puppa

First, the new relevance sort option. This should be the default sort option for all users (public and authenticated) whenever a search in the search box is performed, otherwise the weighting isn't helpful or self-evident unless the user happens to stumble across it.

Next, I think we need to know exactly how the weighting works, because it is not returning the expected results. After the import and some initial testing, I got the impression that for a match in title for example, a short title (e.g. "Puppa") was considered more of a match than a longer one that also had the term (such as "Puppa one (10 weight)"). That's fine and perhaps to be expected... but I saw other mismatches I did not expect.

For example "Puppa fourteen - in creator and title (16 or 19 weight w name access)" appeared after both of these... so the combined weight theory doesn't seem to be holding?

Then lower down in the results, a record w a match in scope and content (5 weight) was listed higher than a match in both title and identifier (Puppa twelve, which should have had a combined weight of 15, or even just 10 based on the title, putting it higher up).

There are other mismatches too - for example, a match in identifier (eleven, 5 weight) appears AFTER several results with no weighting but with the term included - suggesting that in some cases the weighting is perhaps not working at all - or at least not how we expect.

Attaching a screenshot of my search results.

Test 2 - twentytwo

After the import, I edited a couple of different records to include the word twentytwo - it was in the title of the last imported record. I then tried searching for twentytwo after adding it to a scope and content, an identifier, and an archival history field on other records.

Expected result:

  1. Match in title (10x)
  2. Match in scope and content or identifier (5x)
  3. Match in scope and content or identifier (5x)
  4. Match in archival history (0x)

Actual results:

  1. Match in identifier (5x)
  2. Match in title (10x)
  3. match in scope and content (5x)
  4. Match in archival history (0x)

Attaching a screenshot as well.

Mike, any thoughts you have or investigations you can make would be helpful.

#10 Updated by Mike Gale almost 4 years ago

  • Assignee changed from Mike Gale to Dan Gillean

Hi Dan, that's unfortunate to hear that none of the weighting works. Can you provide me with a SQL dump of your test data you were using? I think it'd be very helpful if I can test the exact same things you're describing when debugging, as in my tests before I pushed this feature, all the weighting works exactly as expected.

thanks.

#11 Updated by Jesús García Crespo almost 4 years ago

  • Assignee changed from Dan Gillean to Mike Gale

Mike, Dan is in a conference but he's provided the CSV file with the data causing the issues described. Could you try to import it?

#12 Updated by Mike Gale almost 4 years ago

  • Assignee changed from Mike Gale to Dan Gillean

Hi Dan,

Test 1:

I suspect the issues you were encountering were because you were adding a bunch of text in the descriptions surrounding the keywords you were searching. Please keep in mind we've changed the default operator to AND, so any extra words in the field are going to negatively affect the search result's relevance score.

For example the title fields:

"Puppa One (10 weight)"
"Puppa fourteen - in creator and title (16 or 19 weight w name access)"

The latter title will be far less relevant than the first, since it has a ton of text next to it and we're scoring based on AND as the operator. The Puppa One record will also gain a ton of relevance score because it'll be higher and x10 boost, so that might push it over the edge even if Puppa fourteen has a creator with the word puppa in it as well (only 5x boost).

Note when I did my own testing with just the keywords in certain fields and no other words / noise surrounding those fields messing with the scores, the boosting worked fine.

Test 2:

I can't really say what happened here because you were editing records on the fly and I don't know exactly what changes you made. Sorry.

Possible solutions:

  • I noticed if I search "Puppa fourteen" it ONLY returns that single record. I think making the default operator AND might be a bad idea in the end? The records that don't have "Puppa" AND "fourteen" in them simply aren't showing now. It seems like all results with Puppa in them should appear, with Puppa fourteen taking the top result. Should we consider switching back to OR?
  • We could tweak the boost scores again to massage which results we want near the top, but I don't think that'll really solve this issue

I'm not sure how to proceed other than that, ElasticSearch has pretty basic boost functionality from what I can tell (just adding ^boostValue to the field name). It's the same way we do it when boosting accession and term searches, so unless we do one of my two suggestions above, I think we'll need to solicit feedback from the rest of the AtoM team on what we can do from here.

#13 Updated by Mike Gale almost 4 years ago

Also note there are several factors that go into how ES scores documents, a decent outline can be found here: https://qbox.io/blog/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting

It isn't simply a matter of "how many times does this keyword occur in the document + boost score", there are other things factoring in as well. It's kind of complicated! Heh.

#14 Updated by Dan Gillean almost 4 years ago

Interesting... thanks for this update, Mike. Let's leave it as is for now and I will do a bit more testing next week when I'm back; maybe we can make some minor tweaks along the way. I'm not yet convinced we should go back to OR as the operator; I think most search engines tend to use AND as the default, and I've certainly heard a LOT of users asking for this change over the last year. I'll test further so I can understand the consequences better when it's embedded in other text, we'll document it all well, and it might even be a good topic for a future screencast I can point people to, as that will save a lot of time explaining the changes and showing off some of the tricks I've learned (like searching for empty fields, searching by field name, boosting the search for one word over another, etc). Anyway, not ruling anything out yet. We'll chat more soon when I'm in the office. Thanks!

#15 Updated by Dan Gillean almost 4 years ago

  • Assignee changed from Dan Gillean to Mike Gale

Okay, I've done a bunch of searching using the demo data and comparing 2.4 results to 2.3 - in 95% of cases, the results are definitely better with these changes. The tougher part was in multi keyword searches where one keyword is misspelled or simply missing - using "AND" means that such a search would fail to turn up the relevant record. However, in general, the OR results of 2.3 and earlier usually would return so many results with just one of the keywords that it was pretty useless. I think that we can address strategies for addressing this in the documentation and via the user forum - e.g. better explanations of some of the operators available for power searching, etc.

Mike, the only feedback we really need to impleent at this point is making "Relevance" the default sort option for any search. The browse sort (e.g. when there is no search term input) should remain whatever the sort settings say they are (e.g. defaults are alphabetic for anonymous users and most recent for authenticated).

Thanks!

#16 Updated by Mike Gale almost 4 years ago

  • Status changed from Feedback to QA/Review

Mike C code reviewed the relevance by default sort change. Merging now.

#17 Updated by Mike Gale almost 4 years ago

  • Assignee changed from Mike Gale to Dan Gillean

merged qa/2.4.x

#18 Updated by Dan Gillean almost 4 years ago

  • Status changed from QA/Review to Verified

Awaiting client feedback but otherwise looking good.

#19 Updated by Dan Gillean almost 4 years ago

  • Status changed from Verified to Feedback
  • Assignee changed from Dan Gillean to Mike Gale

Hey Mike,

This feature is fine and done, but for documentation, are you able to give me a full list of the fields we index? I'd like this for all entities if possible, but archival descriptions are the first concern. Thanks!

#20 Updated by Mike Gale over 3 years ago

Hi Dan, I wrote a python script to dump a comprehensive, user friendly list of all our ES fields.

If our schema changes in the future, I've saved the script I made here: https://gist.github.com/MikeFE/9768069954808644a509b443b6244631 if we need to reuse it. It should be python 2/3 neutral.

I'm attaching a tarball with the results of when I ran it on 2.4.x's ES schema.

Note: I've only included the 'en' i18n fields, you can just tell the reader to replace 'en' with their culture's 2 letter language code if they have AtoM set to a different language.

#21 Updated by Mike Gale over 3 years ago

  • Assignee changed from Mike Gale to Dan Gillean

#22 Updated by Dan Gillean over 3 years ago

  • Assignee changed from Dan Gillean to Mike Gale

Mike!!!! This is amazing... only: I started looking at the IO output, and I'm confused. I can see that every sub-field from the linked authority record is listed in the file, but the whole point of this ticket was to stop nesting all those fields directly in the IO results, to prevent issues like #8128. Simliarly, i see a ton of METS related fields in there, but I'm pretty certain that you can't actually search all those fields for DOs from the archival description search page.

I guess what i'm looking for with IOs is a list of what can actually be searched from global search/Advanced search page, so users can use it as a guide for expert queries. This is still helpful for giving me an updated list of field names, but is there a way to find out which are actually included in the global search indexing?

#23 Updated by Mike Gale over 3 years ago

Hi Dan, I went through and removed all the nested type fields in QubitInformationObject.txt, then re-entered the few nested type i18n fields we still do search for during archival description search (see: https://github.com/artefactual/atom/blob/qa/2.4.x/plugins/arElasticSearchPlugin/config/mapping.yml#L397-L410). Said i18n fields are at the end of the .txt and separated by a blank line.

#24 Updated by Dan Gillean over 3 years ago

  • Status changed from Feedback to Document

This looks awesome, Mike - thanks sooooo much!!!

#25 Updated by Nick Wilkinson over 3 years ago

  • Status changed from Document to Deploy
  • Assignee changed from Dan Gillean to Nick Wilkinson

#26 Updated by Nick Wilkinson over 3 years ago

  • Status changed from Deploy to Verified
  • Assignee deleted (Nick Wilkinson)

#27 Updated by Dan Gillean almost 3 years ago

  • Requires documentation deleted (Yes)

#28 Updated by Dan Gillean almost 3 years ago

  • Related to Bug #11537: Notes should be searchable via the global search box added

#29 Updated by Dan Gillean almost 3 years ago

  • Related to Task #11538: Consider returning creator histories to global archival description search results added

#30 Updated by Dan Gillean over 2 years ago

  • Related to Bug #11943: Authority record search: default operator should be changed to AND to match description and terms search added

#31 Updated by David Juhasz about 1 year ago

  • Status changed from Verified to In progress
  • Assignee set to David Juhasz

#32 Updated by David Juhasz about 1 year ago

  • Related to Feature #13096: Remove unnecessary data from Elasticsearch index added

#33 Updated by David Juhasz about 1 year ago

  • Status changed from In progress to Verified
  • Assignee deleted (David Juhasz)

I've opened issue #13096 to follow up on the goal of storing only relevant data in the Elasticsearch index.

Also available in: Atom PDF