Accession numbers and acquisition dates are not easily searched
|Target version:||Release 2.5.0|
|Google Code Legacy ID:||Tested version:||2.5|
To reproduce:Make sure you have a number of accessions in your database - ideally with different starting months and years. Examples in my test instance:
- 2015-05-05/2 (this one has an Acquisition date of 2014-11/28)
- Example Accession Number
etcConduct a search for:
- 2015 = 0 results
- 2015-05-06 = 0 results
- 2015-05-06/3 = 500 error (caused by the slash)
- 2015-05-06//3 = 0 results (attempt to escape the slash)
- "2015-05-06" = 0 results
- 2015-05-06* = 3 results (as expected)
- 2015* = all results (ok)
- 2014* = 0 results (expecting hit on Acquisition date match)
- *2014 = 0 results
- 2014 = 0 results
- "2014-11-28" = 0 results
- identifier:2015-05-06 = 0 results (attempt to target search to accession number ES field)
- identifier:"2015-05-06" = 0 results
- identifier:2015-05-06* = 3 results (as expected)
- date:2014* = 0 results (attempt to target acquisition date ES field)
- date:*2014* = 0 results
- date:2014-11-28 = 1 result (correct one)
- date:"2014-11-28" = 1 result (correct)
- Accession = 0 hits
- Example accession number = 2 results (the target record, and a different accession with "Example" in the title)
- "Example Accession Number" = 1 result (correct)
- There are very few, kind of obscure ways to search for an accession number or acquisition date and have it return the expected results. We should broaden this if at all possible
- Part of the issue is the slash used in the accession number causes a 500 error, but if escaped, returns nothing
- For whatever reason, matches on the acquisition date do not seem to return results unless the specific ES field is targeted
- I think that ES interprets dashes as stop-word characters and removes them, therefore it may be understanding a search for 2015-05-06 as a search for 2015 OR 05 OR 06... but partial matches (e.g. searching for 2015) do not seem to return results
- This suggests that ES is tokenizing the search query parameters... but not tokenizing what it is comparing them to - so 2015 OR 05 OR 06 does not match any of the 2015-05-06/n accessions
- don't tokenize queries for accession search?
- Change default operator to AND instead of OR?
- Investigate why "2014-11-28" = 0 results but date:"2014-11-28" = results and discuss options for improving
- Any other suggestions to make it easier for users to search on accession number
Accession search is a major piece of functionality and we should really try to resolve this or improve this for the next release. Tentatively marking as 2.3 so we don't lose track of it.
Reproduced in both 2.2 and 2.3 branches.Possible related tickets and user forum threads:
#1 Updated by Dan Gillean over 3 years ago
- Assignee changed from Jesús García Crespo to José Raddaoui Marín
- Target version set to Release 2.3.0
Tentatively tagging as 2.3, so we don't lose track of this. Not currently sponsored for fixing though, so no guarantees any improvements will make it into the next release.
#4 Updated by Corinne Rogers 11 months ago
- Tested version 2.4 added
I am finding the same problems in 2.4.1, getting odd returns by searching the accession number, full or partial. E.g. searching for 2018-09-24/10 in my local test instance returns two hits - one is the accession with that number, the other is a totally unrelated accession. I have added 5 new accession records, each with a number that begins 2018-09- but searching 2018 returns only one of them, while searching 2018* returns them all.
#7 Updated by José Raddaoui Marín 10 months ago
- The identifier is a non analyzed field, which means that it's not divided in tokens.
- The date field is not included in the query directly as it's a date field and the query a string query. However, it can be targeted by the field name, with some limitations compared to a normal string field.
- The escaping char is "\".
- The query is divided in tokens and it uses the OR operator and the default analyzer, as mentioned in the description.
To improve this I would:
- Make the identifier an analyzed field.
- Use AND instead of OR as operator.
Including the date field in the string search by default will cause more issue. For example, any wildcard search over a date field will cause the following error in ES 5+, so I'd leave that like it is.
Can only use prefix queries on keyword and text fields - not on [date] which is of type [date] [index: atom] [reason: all shards failed]
Please, let me know if you think that will help. It will require different changes on stable/2.4.x (ES 1.x) and qa/2.5.x (ES 5.x), but that should not be a problem.
#10 Updated by José Raddaoui Marín 10 months ago
Just realized that the identifier field must have the same declaration in all index types, so the same change will be required in all identifiers (IOs and repos), which will improve the search option in all cases. However, this fields are also used to sort, which requires a not analyzed (keyword) field. Therefore this will require a sub-field, like we do for other string fields used to sort, like the IO title.
#12 Updated by José Raddaoui Marín 10 months ago
The following fields have been affected by this change:
Accession -> identifier
Information object -> identifier
Information object -> alternative identifiers -> identifier
Repository -> identifier
Actor -> description identifier
Function -> description identifier
We should verify that searching works as expected and that the ones used to sort in some pages still work.
#17 Updated by Corinne Rogers 10 months ago
- Status changed from QA/Review to Feedback
- Assignee changed from Corinne Rogers to Nick Wilkinson
- Tested version 2.5 added
Testing 2.5; I have the following accession numbers/acquisition dates:
1. 2018-10-17/6 2013-10-17
2. ACC2009-007 209-01-16 (note wrong year)
3. 2012-01-01/6 1984-12-01
4. 2013-12-20/5 2013-12-20
5. 2013-11-14/4 1982-11-14
1984 this returns 3. above
1982 nothing returned
2018 this returns 1.
2013 this returns 4., 5., but did not return 1., which has 2013 in the acquisition date
/6 this returns 1., 3., but also 4. (which has no /6 anywhere)
"-11-" (without the quotes) this returns 1., 2., 3., 4. - none of which has "-11-" in either accession number or acquisition date, and does not return 5., which has "-11-" in the acquisition date
So it seems that there are still some unexpected returns.
#18 Updated by Dan Gillean 10 months ago
One note on the acquisition date searches - see Radda's note 7, above. Essentially, what he's saying is that acquisition dates are set as a date type in AtoM - we're not able to tokenize these (break them into smaller components and index those) so partial searches will not return results. For example, I would not expect 1982 to return results, since it only appears in the acquisition date. Users will need to search the full acquisition date to get results, likely in quotes. We can perhaps add a note about this in the documentation?
Because of this, I'm surprised that 1984 DID return results! Might want to check the related accession and see if that appears in another field?
#21 Updated by Dan Gillean 5 months ago
- Status changed from Feedback to Verified
- Assignee deleted (
- Requires documentation deleted (
Calling this good enough for now, so this can be included in the 2.5 release. This definitely improved identifier and accession number searching in AtoM. If we want to further and improve the searching, I propose we open a new issue in the future.