Bug #5850

Slash in search phrase causes internal server error

Added by Tim Hutchinson over 6 years ago. Updated almost 3 years ago.

Status:VerifiedStart date:10/22/2013
Priority:MediumDue date:
Assignee:-% Done:

0%

Category:Search / Browse
Target version:Release 2.4.0
Google Code Legacy ID: Tested version:2.2, 2.4
Sponsored:No Requires documentation:

Description

This came up in trying to search the identifier field via advanced search, but it occurs in the general search and for any field: A search such as MG01/01 (anything with a slash; most likely to come up for identifier) leads to an internal server error.

History

#1 Updated by Tim Hutchinson over 6 years ago

In debug mode, the error returned (this is for identifier) is below, but it seems to boil down to: Cannot parse 'MG01/01': Lexical error at line 1, column 8. Encountered: <EOF> after : "/01"

SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[0uqyggAxToKGc20bK8eeSQ][qubit_item2x][3]: SearchParseException[[qubit_item2x][3]: from[-1],size10: Parse Failure [Failed to parse source [{"size":"10","query":{"bool":{"must":[{"bool":{"must":[{"query_string":{"query":"MG01/01","default_field":"identifier","default_operator":"OR"}}]}}]}},"filter":{"bool":{"must":[{"term":{"publicationStatusId":160}}]}}}]]]; nested: QueryParsingException[[qubit_item2x] Failed to parse query [MG01/01]]; nested: ParseException[Cannot parse 'MG01/01': Lexical error at line 1, column 8. Encountered: <EOF> after : "/01"]; nested: TokenMgrError[Lexical error at line 1, column 8. Encountered: <EOF> after : "/01"]; }{[0uqyggAxToKGc20bK8eeSQ][qubit_item2x][1]: SearchParseException[[qubit_item2x][1]: from[-1],size10: Parse Failure [Failed to parse source [{"size":"10","query":{"bool":{"must":[{"bool":{"must":[{"query_string":{"query":"MG01/01","default_field":"identifier","default_operator":"OR"}}]}}]}},"filter":{"bool":{"must":[{"term":{"publicationStatusId":160}}]}}}]]]; nested: QueryParsingException[[qubit_item2x] Failed to parse query [MG01/01]]; nested: ParseException[Cannot parse 'MG01/01': Lexical error at line 1, column 8. Encountered: <EOF> after : "/01"]; nested: TokenMgrError[Lexical error at line 1, column 8. Encountered: <EOF> after : "/01"]; }{[0uqyggAxToKGc20bK8eeSQ][qubit_item2x][2]: SearchParseException[[qubit_item2x][2]: from[-1],size10: Parse Failure [Failed to parse source [{"size":"10","query":{"bool":{"must":[{"bool":{"must":[{"query_string":{"query":"MG01/01","default_field":"identifier","default_operator":"OR"}}]}}]}},"filter":{"bool":{"must":[{"term":{"publicationStatusId":160}}]}}}]]]; nested: QueryParsingException[[qubit_item2x] Failed to parse query [MG01/01]]; nested: ParseException[Cannot parse 'MG01/01': Lexical error at line 1, column 8. Encountered: <EOF> after : "/01"]; nested: TokenMgrError[Lexical error at line 1, column 8. Encountered: <EOF> after : "/01"]; }{[0uqyggAxToKGc20bK8eeSQ][qubit_item2x][0]: SearchParseException[[qubit_item2x][0]: from[-1],size10: Parse Failure [Failed to parse source [{"size":"10","query":{"bool":{"must":[{"bool":{"must":[{"query_string":{"query":"MG01/01","default_field":"identifier","default_operator":"OR"}}]}}]}},"filter":{"bool":{"must":[{"term":{"publicationStatusId":160}}]}}}]]]; nested: QueryParsingException[[qubit_item2x] Failed to parse query [MG01/01]]; nested: ParseException[Cannot parse 'MG01/01': Lexical error at line 1, column 8. Encountered: <EOF> after : "/01"]; nested: TokenMgrError[Lexical error at line 1, column 8. Encountered: <EOF> after : "/01"]; }]

#2 Updated by Dan Gillean over 6 years ago

  • Category set to Search / Browse
  • Assignee set to Jesús García Crespo
  • Target version set to Release 2.0.2

Reproduced in dev branch. Will have to look into ES parameters and see what is allowed, and if we can escape this character when searching or something.

#3 Updated by Tim Hutchinson over 6 years ago

It turns out that enclosing the entire string in quotes would be one way to make this work.

There also seem to be other reserved characters that raise this error, e.g. [ ], although using those in a search seems less likely.

#4 Updated by Jesús García Crespo almost 6 years ago

  • Target version changed from Release 2.0.2 to Release 2.1.0

#5 Updated by Jesús García Crespo almost 6 years ago

  • Target version changed from Release 2.1.0 to Release 2.2.0

#6 Updated by Sarah Romkey over 5 years ago

  • Assignee changed from Jesús García Crespo to José Raddaoui Marín

#7 Updated by Tim Hutchinson over 5 years ago

Further information about this "odd" bug in this user forum thread:
https://groups.google.com/d/topic/ica-atom-users/7BZ7xOkm7K4/discussion

#8 Updated by José Raddaoui Marín over 5 years ago

  • Status changed from New to QA/Review
  • Assignee changed from José Raddaoui Marín to Dan Gillean

As David said in the user forum thread, forward-slashes are used for regular expressions and it's one of the reserved characters:

http://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-query-string-query.html

It's used as a delimiter in the tokenizer when the fields are anlyzed, e.g. "foo/bar" is splitted and saved as "foo", "bar":

http://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html

For example, I've created three descriptions with the following titles:

- foo
- bar
- foo/bar

And I've done the following searches:

Search: foo

Results: foo, foo/bar

Search: bar

Results: bar, foo/bar

Search: foobar

Results:

Search: foo bar

Results: bar, foo, foo/bar

Search: foo\\/bar (the only way I could find to escape the /, but in the end is searching: foo bar)

Results: bar, foo, foo/bar

Search: "foo/bar"

Results: foo/bar

The only way to get "foo/bar" and not the other two, is using doble quotes, wich I think is okay being a reserved character. I don't think we should change anything, maybe add those links to our documentation somewhere.

#9 Updated by Dan Gillean over 5 years ago

  • Status changed from QA/Review to Won't fix
  • Target version deleted (Release 2.2.0)
  • Requires documentation set to Yes
  • Tested version 2.2 added

Thanks for the useful feedback, Radda. I agree, at this point there's not much we can do, and there are many workarounds. I have marked this as "won't fix," but I've also set it as requiring documentation, as a note to myself to supplement our advanced search documentation with some more information about search behavior, reserved characters, and search strategies.

#10 Updated by Dan Gillean over 4 years ago

UNTESTED fix suggestion from a community user (on the related user forum thread, linked above by Tim):

I am still testing this but it seems to work if I add:
../apps/qubit/modules/search/actions/indexAclion.class.php
use Elastica\Util;

$queryText = new \Elastica\Query\QueryString(Util::escapeTerm($request->query));

#12 Updated by David Juhasz about 3 years ago

My expectation is that using Util::escapeTerm($request->query) will prevent ALL reserved characters from working in search. This may be an acceptable trade-off for some AtoM users, but would definitely need to be a configuration option as it would be unacceptable for other users.

I'm surprised that escaping the query works for searches that include forward slashes, as I would expect the slashes to be removed from the ES strings by the standard tokenizer. Maybe escaping the slashes is sufficient for ES to treat the entire string as one search phrase even though the forward slashes are not indexed?

#13 Updated by Nick Wilkinson about 3 years ago

  • Status changed from Won't fix to In progress
  • Assignee changed from Dan Gillean to Steve Breker

HI Steve, can you please look into testing the fix suggested in comment # 10 on this issue?

#14 Updated by Steve Breker about 3 years ago

  • Status changed from In progress to Feedback
  • Assignee changed from Steve Breker to Nick Wilkinson

I inserted the change in atom/apps/qubit/modules/informationobject/actions/browseAction.class.php at line 462 to change:

$this->getParameters['sq0'] = $request->query;

to

$this->getParameters['sq0'] = \Elastica\Util::escapeTerm($request->query);

Gist here: https://gist.github.com/sbreker/a327b875a59139d62037d02bac3d8578

This seems to give the desired results in that searches with forward slashes in them "/" do not trigger ES errors. Conducting Radda's tests over again we see:

Pre-changes:

Search: foo
Results: foo, foo/bar

Search: bar
Results: bar, foo/bar (search quick results contains only 'bar')

Search: foobar
Results:

Search: foo bar
Results: foo/bar (diff results from Radda (foo, bar, foo/bar) - result of chg from OR to AND search. Search quick results contains 'bar', 'foo' and 'foo/bar')

Search: foo\/bar
Results: foo/bar (diff results from Radda (foo, bar, foo/bar) - result of chg from OR to AND search. Search quick results contains 'bar', 'foo' and 'foo/bar')

Search: "foo/bar"
Results: foo/bar (Search quick results contains 'bar', 'foo' and 'foo/bar')

Search: foo/bar
Results: Elasticsearch error: Elastica\Exception\ResponseException (SearchPhaseExecutionException)

Post-changes:

Search: foo
Results: foo, foo/bar

Search: bar
Results: bar, foo/bar

Search: foobar
Results:

Search: foo bar
Results: foo/bar (diff results from Radda - result of chg from OR to AND search. Search quick results contains 'bar', 'foo' and 'foo/bar')

Search: foo\/bar (the only way I could find to escape the /, but in the end is searching: foo bar)
Results: foo/bar (diff results from Radda (foo, bar, foo/bar) - result of chg from OR to AND search. Search quick results contains 'bar', 'foo' and 'foo/bar')

Search: "foo/bar"
Results: foo/bar (Search quick results contains 'bar', 'foo' and 'foo/bar')

Search: foo/bar
Results: foo/bar (Search quick results contains 'bar', 'foo' and 'foo/bar')

#15 Updated by David Juhasz about 3 years ago

Hi Steve,

Two questions about your tests:

  1. What does "(Search quick results contains 'bar', 'foo' and 'foo/bar')" mean exactly? Are these the quick search results? Did you do tests with other types of search?
  2. Did you test searching with other reserved characters? A test with the or operator "||" or the wildcard operator "*" should confirm if they are being treated as literal characters.

#16 Updated by David Juhasz about 3 years ago

  • Assignee changed from Nick Wilkinson to Steve Breker

#17 Updated by Steve Breker about 3 years ago

  • Assignee changed from Steve Breker to David Juhasz

To clarify, I am searching from the top header search box to test this. When I was referring to the "quick search results" above, I am referring to the autocomplete dropdown results that populate under the search box as the user types their query. The main 'Results' I refer to above are what appears in the main page after the user executes the search by pressing 'enter'.

Additional tests:
- Added a new descr with title "Test*fond"
- Added a new descr with title "Test||fond"
- Added a new descr with title "Test fond"
- Added a new descr with title "Test * fond"

Searches performed:
- search for "Test*fond"
- search for "Test||fond"
- Search for "Test"
- search for "Test*"
- search for "Test||"
- search for "Test fond"
- search for "fond"
- search for "Test fond" <-- space in front of asterisk "Test" space asterisk "fond"
- search for "Test
fond" <-- space following asterisk "Test" asterisk space "fond"
- Search for "*Test"
- Search for "||Test"

The above searches all successfully return the same three records: "Test*fond", "Test||fond", "Test fond" & "Test * fond".

- search for "Test *" <-- with a space
- search for "Test * fond" <-- with spaces

Returns zero results

So it appears that the escaped special characters are not considered as tokens in the search. There is a strange edge case where when the special character is surrounded by whitespace that no results are returned, but no error results.

No errors are triggered when the special characters are included in a user's search phrase when the code change is in place.

#18 Updated by Dan Gillean about 3 years ago

Hi Steve,

What about searching where those symbols are used not as characters in a title, but as operators - see:

For example, what records are returned if you search for:

test?fonds

or

test*

or

test~

?

#19 Updated by Steve Breker about 3 years ago

Ah, I did not realize these operators were available. I have tested these and they break when using Elastica/Utils:: escapeTerm() since they are escaped prior to the ES search executing. 'tes~' returns zero results with escapeTerm() in place, and all expected results when it is not.

It would be really simple to create a custom version of escapeTerm() to only escape characters we choose. The contents of the Elastica escapeTerm() method are below. It would be easy to copy this to QubitHelper.php and customize the $chars array to include '/'. Or even make the list a configurable setting.

public static function escapeTerm($term)
{
$result = $term;
// \ escaping has to be first, otherwise escaped later once again
$chars = array('\\', '+', '-', '&&', '||', '!', '(', ')', '{', '}', '[', ']', '^', '"', '~', '*', '?', ':', '/', '<', '>');
foreach ($chars as $char) {
$result = str_replace($char, '\\'.$char, $result);
}
return $result;
}

#20 Updated by Dan Gillean about 3 years ago

I don't think we should implement a solution that disables all other available ES special character operators. I much prefer your second proposed approach, Steve.

#21 Updated by David Juhasz about 3 years ago

  • Assignee deleted (David Juhasz)

I think any escaping strategy needs to be configurable. Some institutions will only need forward slash escaping. Some will want hyphen escaping. Some may want to escape all reserve characters. Some may not want to escape any reserve characters.

The scope of the work is highly variable depending on how configurable we want the escaping to be, but in any case we will need a sponsor for the work to add escaping configuration.

#22 Updated by Steve Breker about 3 years ago

Regarding Borthwick: Dan suggested that they have this working in their prod site - I do not think this is true.

The following query seems to work at Borthwick supporting both '/' and '*' (MD/100/*):

https://borthcat.york.ac.uk/index.php/search?query=MD%2F100%2F*

In this situation, there are two '/' characters in the search phrase creating a correctly formatted regex expression. Changing this query to search for 'MD/100' as in the following query:

https://borthcat.york.ac.uk/index.php/search?query=MD%2F100

... will trigger a 500 error at Borthwick.

If you search with two forward slash characters in a search phrase in stock stable/2.3.x or qa/2.4.x, it will also seem to work, not generating any errors. It is considering the text between the slashes as a literal regex component.

https://demo.accesstomemory.org/informationobject/browse?topLod=0&query=ON%2F0%2F*&repos=

#23 Updated by Steve Breker about 3 years ago

In terms of the location where the escaping is done, the forum post indicates to make the change in:

search/actions/indexAction.class.php at line 31 (qa/2.4.x)

$queryText = new \Elastica\Query\QueryString(\Elastica\Util::escapeTerm($request->query));

Changes will need to be made in different locations as this controller no longer handles search from the top search box since changes made to support Adv Search (ticket 9141 See https://github.com/artefactual/atom/commit/d9f449c58b91db6b3fa70678dd08bf274cb66ade and look for changes to indexAction). I think indexAction now is used for XHR requests from IO treeview search.

The search phrase is passed back and forth in $request->sq0 (not sure what sf0 is for). When making this change be careful no to modify the contents of $request->sq0 as this will cause the escape chars to show up in the display fields when the page is redrawn.

#24 Updated by Steve Breker about 3 years ago

  • Assignee set to David Juhasz

Development plan:

- extend Elastica/Util.php in a new file e.g. QubitElasticaUtil.class.php
- look for all calls in AtoM to Elastica/Util.php functions and update to instead call QubitElasticaUtil.
- add a setting to activate escapeTerm() in WebUI
- check setting before call to escapeTerm()
- default list of escaped chars will include '/' only.
- allow override of this list using settings.yml.

Steve will test if escaping hyphen ('-') provides different search results than when not escaping hyphen.

#25 Updated by Steve Breker about 3 years ago

I see no difference searching with '-' vs '\\-'. It's like the hyphen is ignored in both cases - exact same search results.

E.g. searching for 'test-fond' or 'test\\-fond' both return 'test-fond', 'test*fond', 'test fond' and 'test * fond'.

#26 Updated by David Juhasz about 3 years ago

  • Subject changed from slash in search phrase causes internal server error to Slash in search phrase causes internal server error

#29 Updated by Dan Gillean about 3 years ago

  • Status changed from Feedback to Verified
  • Assignee deleted (David Juhasz)
  • Target version set to Release 2.4.0
  • Tested version 2.4 added

We have implemented this as configurable via settings. Users can add multiple special characters to be escaped as needed, separated by commas.

When escaped, items such as foo/bar are still tokenized, so a search for foo/bar will return records such as:

  • foo bar/foo bar
  • foo bar
  • foo/bar
  • foo//bar
  • foo/foo/bar/bar

... and searching in quotations for "foo/bar" returns the same (as I believe the search character is being escaped in both search and result, so you get foo and bar as tokenized search params. This means that searching for bar/foo will return records that contain "foo" AND "bar" in any order, but searcing for "bar/foo" would not return any of the the above except:

  • foo bar/foo bar

#30 Updated by Dan Gillean almost 3 years ago

  • Requires documentation deleted (Yes)

Also available in: Atom PDF