Bug #9290

Accession numbers and acquisition dates are not easily searched

Added by Dan Gillean over 3 years ago. Updated 2 months ago.

Status:VerifiedStart date:01/12/2016
Priority:MediumDue date:
Assignee:-% Done:

0%

Category:Accessions
Target version:Release 2.5.0
Google Code Legacy ID: Tested version:2.5
Sponsored:No Requires documentation:

Description

To reproduce:

Make sure you have a number of accessions in your database - ideally with different starting months and years. Examples in my test instance:
  • 2015-05-04/1
  • 2015-05-05/2 (this one has an Acquisition date of 2014-11/28)
  • 2015-05-06/3
  • 2016-05-06/4
  • Example Accession Number
  • 2015-10-15/8

etc

Conduct a search for:
  • 2015 = 0 results
  • 2015-05-06 = 0 results
  • 2015-05-06/3 = 500 error (caused by the slash)
  • 2015-05-06//3 = 0 results (attempt to escape the slash)
  • "2015-05-06" = 0 results
  • 2015-05-06* = 3 results (as expected)
  • 2015* = all results (ok)
  • 2014* = 0 results (expecting hit on Acquisition date match)
  • *2014 = 0 results
  • 2014 = 0 results
  • "2014-11-28" = 0 results
  • identifier:2015-05-06 = 0 results (attempt to target search to accession number ES field)
  • identifier:"2015-05-06" = 0 results
  • identifier:2015-05-06* = 3 results (as expected)
  • date:2014* = 0 results (attempt to target acquisition date ES field)
  • date:*2014* = 0 results
  • date:2014-11-28 = 1 result (correct one)
  • date:"2014-11-28" = 1 result (correct)
  • Accession = 0 hits
  • Example accession number = 2 results (the target record, and a different accession with "Example" in the title)
  • "Example Accession Number" = 1 result (correct)

Issues noted:

  • There are very few, kind of obscure ways to search for an accession number or acquisition date and have it return the expected results. We should broaden this if at all possible
  • Part of the issue is the slash used in the accession number causes a 500 error, but if escaped, returns nothing
  • For whatever reason, matches on the acquisition date do not seem to return results unless the specific ES field is targeted
  • I think that ES interprets dashes as stop-word characters and removes them, therefore it may be understanding a search for 2015-05-06 as a search for 2015 OR 05 OR 06... but partial matches (e.g. searching for 2015) do not seem to return results
  • This suggests that ES is tokenizing the search query parameters... but not tokenizing what it is comparing them to - so 2015 OR 05 OR 06 does not match any of the 2015-05-06/n accessions
For now, Sara and I will add some notes to the documentation, but I would like to hear from a developer about ways we could improve this. Possibly:
  • don't tokenize queries for accession search?
  • Change default operator to AND instead of OR?
  • Investigate why "2014-11-28" = 0 results but date:"2014-11-28" = results and discuss options for improving
  • Any other suggestions to make it easier for users to search on accession number

Accession search is a major piece of functionality and we should really try to resolve this or improve this for the next release. Tentatively marking as 2.3 so we don't lose track of it.

Reproduced in both 2.2 and 2.3 branches.

Possible related tickets and user forum threads:

History

#1 Updated by Dan Gillean over 3 years ago

  • Assignee changed from Jesús García Crespo to José Raddaoui Marín
  • Target version set to Release 2.3.0

Tentatively tagging as 2.3, so we don't lose track of this. Not currently sponsored for fixing though, so no guarantees any improvements will make it into the next release.

#2 Updated by Dan Gillean about 3 years ago

  • Target version deleted (Release 2.3.0)

#4 Updated by Corinne Rogers 8 months ago

  • Tested version 2.4 added

I am finding the same problems in 2.4.1, getting odd returns by searching the accession number, full or partial. E.g. searching for 2018-09-24/10 in my local test instance returns two hits - one is the accession with that number, the other is a totally unrelated accession. I have added 5 new accession records, each with a number that begins 2018-09- but searching 2018 returns only one of them, while searching 2018* returns them all.

#5 Updated by Nick Wilkinson 7 months ago

  • Target version set to Release 2.4.1

#6 Updated by Nick Wilkinson 7 months ago

  • Target version changed from Release 2.4.1 to Release 2.5.0

#7 Updated by José Raddaoui Marín 7 months ago

Some notes:

- The identifier is a non analyzed field, which means that it's not divided in tokens.
- The date field is not included in the query directly as it's a date field and the query a string query. However, it can be targeted by the field name, with some limitations compared to a normal string field.
- The escaping char is "\".
- The query is divided in tokens and it uses the OR operator and the default analyzer, as mentioned in the description.

To improve this I would:

- Make the identifier an analyzed field.
- Use AND instead of OR as operator.

Including the date field in the string search by default will cause more issue. For example, any wildcard search over a date field will cause the following error in ES 5+, so I'd leave that like it is.

Can only use prefix queries on keyword and text fields - not on [date] which is of type [date] [index: atom] [reason: all shards failed]

Please, let me know if you think that will help. It will require different changes on stable/2.4.x (ES 1.x) and qa/2.5.x (ES 5.x), but that should not be a problem.

#8 Updated by Dan Gillean 7 months ago

Hi Radda!

These changes sound good - but let's make them only for 2.5. We're trying to wrap up 2.4.1 now and I don't want to mess with the ES index just before the release.

I agree that we should leave the acquisitionDate field alone for now. Thanks!

#9 Updated by Dan Gillean 7 months ago

  • Requires documentation set to Yes

Adding YES to "requires documentation" so we can review the Accessions search docs - make sure we acknowledge that the operator will be AND going forward, and add any useful notes about acquisition date searching.

#10 Updated by José Raddaoui Marín 7 months ago

Just realized that the identifier field must have the same declaration in all index types, so the same change will be required in all identifiers (IOs and repos), which will improve the search option in all cases. However, this fields are also used to sort, which requires a not analyzed (keyword) field. Therefore this will require a sub-field, like we do for other string fields used to sort, like the IO title.

#11 Updated by José Raddaoui Marín 7 months ago

  • Status changed from New to Code Review
  • Assignee changed from José Raddaoui Marín to Nick Wilkinson

#12 Updated by José Raddaoui Marín 7 months ago

The following fields have been affected by this change:

Accession -> identifier
Information object -> identifier
Information object -> alternative identifiers -> identifier
Repository -> identifier
Actor -> description identifier
Function -> description identifier

We should verify that searching works as expected and that the ones used to sort in some pages still work.

#13 Updated by Nick Wilkinson 7 months ago

  • Assignee changed from Nick Wilkinson to Mike Cantelon
  • Priority changed from High to Medium

Hi Mike, passing to you for CR.

#14 Updated by Mike Cantelon 7 months ago

  • Status changed from Code Review to Feedback
  • Assignee changed from Mike Cantelon to José Raddaoui Marín

Looks good!

#15 Updated by José Raddaoui Marín 7 months ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from José Raddaoui Marín to Nick Wilkinson

Merged in qa/2.5.x. The search index needs to be re-built.

#16 Updated by Nick Wilkinson 7 months ago

  • Assignee changed from Nick Wilkinson to Corinne Rogers

Hi Corinne, passing this to you for QA.

#17 Updated by Corinne Rogers 7 months ago

  • Status changed from QA/Review to Feedback
  • Assignee changed from Corinne Rogers to Nick Wilkinson
  • Tested version 2.5 added

Testing 2.5; I have the following accession numbers/acquisition dates:
Acc. Acq.
1. 2018-10-17/6 2013-10-17
2. ACC2009-007 209-01-16 (note wrong year)
3. 2012-01-01/6 1984-12-01
4. 2013-12-20/5 2013-12-20
5. 2013-11-14/4 1982-11-14

Searched for:
1984 this returns 3. above
1982 nothing returned
2018 this returns 1.
2013 this returns 4., 5., but did not return 1., which has 2013 in the acquisition date
/6 this returns 1., 3., but also 4. (which has no /6 anywhere)
"-11-" (without the quotes) this returns 1., 2., 3., 4. - none of which has "-11-" in either accession number or acquisition date, and does not return 5., which has "-11-" in the acquisition date

So it seems that there are still some unexpected returns.

#18 Updated by Dan Gillean 7 months ago

Hi Corinne,

One note on the acquisition date searches - see Radda's note 7, above. Essentially, what he's saying is that acquisition dates are set as a date type in AtoM - we're not able to tokenize these (break them into smaller components and index those) so partial searches will not return results. For example, I would not expect 1982 to return results, since it only appears in the acquisition date. Users will need to search the full acquisition date to get results, likely in quotes. We can perhaps add a note about this in the documentation?

Because of this, I'm surprised that 1984 DID return results! Might want to check the related accession and see if that appears in another field?

#19 Updated by Nick Wilkinson 7 months ago

  • Assignee changed from Nick Wilkinson to José Raddaoui Marín

Passing to Radda for any comments.

#20 Updated by Dan Gillean 6 months ago

  • Assignee changed from José Raddaoui Marín to Dan Gillean

Assigning to myself to investigate further

#21 Updated by Dan Gillean 2 months ago

  • Status changed from Feedback to Verified
  • Assignee deleted (Dan Gillean)
  • Requires documentation deleted (Yes)

Calling this good enough for now, so this can be included in the 2.5 release. This definitely improved identifier and accession number searching in AtoM. If we want to further and improve the searching, I propose we open a new issue in the future.

Also available in: Atom PDF