Searching for titles with stop words will not give results
|Google Code Legacy ID:||Tested version:||2.4|
- Navigate to the AtoM demo site
- Search for department of medical imaging" - 0 results
- Now search for department medical imaging - 1 result
No results found for searcch on department of medical imaging, even though there is a matching result.
1 matching record returned when search is: department of medical imaging
I believe that 2 factors in ES are causing this issue - the fact that we use the English stopwords list, and that our default Boolean operator is AND in 2.4 and later. In ES, default English stopwords are as follows:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
This means that "of" is being stripped from the index. However, our default Boolean operator is now AND in 2.4 or later, meaning that a search for department of medical imaging is in fact:
department AND of AND medical AND imaging
Only, with the stopword removed from the index, AtoM doesn't find a record that matches the requirements.
You think this issue would come up for many ES users, but interestingly, I couldn't find many other examples of users reporting issues like this - so it's possible that there's something about how we've implemented our index settings that is incorrect.
More on stopwords:
- 1.x documentation: https://www.elastic.co/guide/en/elasticsearch/guide/1.x/stopwords.html and following chapter sections
- Master branch ES version: https://www.elastic.co/guide/en/elasticsearch/guide/master/using-stopwords.html
There are alternatives to stopwords we could investigate as well - like using Common terms instead. See for example:
There is also the common_grams token filter:
And likely other configuration options.
I think we need to first investigate if this is something we can adjust to work better while using the various stopwords lists. Barring that, we may have to investigate some of these alternative methods - which may require development support.