Bug #7229
ElasticSearch 1.3 fails to create search index for long fields
Status: | Verified | Start date: | 09/10/2014 | |
---|---|---|---|---|
Priority: | Medium | Due date: | ||
Assignee: | Misty De Meo | % Done: | 0% | |
Category: | Search / Browse | |||
Target version: | Release 2.1.0 | |||
Google Code Legacy ID: | Tested version: | 2.0.1, 2.1 | ||
Sponsored: | No | Requires documentation: |
Description
When attempting to index an AtoM database using AtoM 2.1 and Elasticsearch 1.3, Elasticsearch failed because the scope and content is too long.
Console output:
[InformationObject] Procès-verbaux. - 16 avril 1902-14 décembre 1906 inserted (107.32s) (2308/71667) Error in one or more bulk request actions: index: /atom/QubitInformationObject/12229 caused IllegalArgumentException[Document contains at least one immense term in field="i18n.fr.scopeAndContent.untouched" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[76, 97, 32, 115, 111, 117, 115, 45, 115, -61, -87, 114, 105, 101, 32, 112, 111, 114, 116, 101, 32, 112, 114, 105, 110, 99, 105, 112, 97, 108]...', original message: bytes can be at most 32766 in length; got 56445]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 56445]; PHP Fatal error: Uncaught exception 'Elastica\Exception\Bulk\ResponseException' with message 'Error in one or more bulk request actions: index: /atom/QubitInformationObject/12229 caused IllegalArgumentException[Document contains at least one immense term in field="i18n.fr.scopeAndContent.untouched" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[76, 97, 32, 115, 111, 117, 115, 45, 115, -61, -87, 114, 105, 101, 32, 112, 111, 114, 116, 101, 32, 112, 114, 105, 110, 99, 105, 112, 97, 108]...', original message: bytes can be at most 32766 in length; got 56445]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 56445]; ' in /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php:395 Stack trace: #0 /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php(345): Elastica\Bulk->_processResponse(Object(Elastica\Response)) #1 /home/vagrant/atom-2/vendor/Elastica/ in /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php on line 395 Fatal error: Uncaught exception 'Elastica\Exception\Bulk\ResponseException' with message 'Error in one or more bulk request actions: index: /atom/QubitInformationObject/12229 caused IllegalArgumentException[Document contains at least one immense term in field="i18n.fr.scopeAndContent.untouched" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[76, 97, 32, 115, 111, 117, 115, 45, 115, -61, -87, 114, 105, 101, 32, 112, 111, 114, 116, 101, 32, 112, 114, 105, 110, 99, 105, 112, 97, 108]...', original message: bytes can be at most 32766 in length; got 56445]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 56445]; ' in /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php:395 Stack trace: #0 /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php(345): Elastica\Bulk->_processResponse(Object(Elastica\Response)) #1 /home/vagrant/atom-2/vendor/Elastica/ in /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php on line 395
I was testing on the qa/2.1.x branch, which was fully up to date.
Related issues
History
#1 Updated by Dan Gillean over 7 years ago
- Tested version 2.0.1, 2.1 added
#2 Updated by Misty De Meo over 7 years ago
This does not happen with Elasticsearch 0.9.x, by the way. The issue was introduced in 1.x.
According to this Stack Overflow comment, indexing for these long fields was never actually happening. In the past it was silently ignored, while now it's an error.
#3 Updated by Jesús García Crespo over 7 years ago
Ok, thanks for the report Misty!
I believe a good solution could be to avoid indexing a not_analyzed version of the field unless it's strictly required, e.g. information_object.title or actor.authorized_form_of_name is ok, they are varchar(1024) - so small terms even for not_analyzed - and necessary as we use them for sorting (it's suboptimal to sort on analyzed fields).
For the record, this is how the i18n props look before the fix in the QubitInformationObject type:
accessConditions accessConditions: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed accruals accruals: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed acquisition acquisition: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed alternateTitle alternateTitle: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed appraisal appraisal: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed archivalHistory archivalHistory: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed arrangement arrangement: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed edition edition: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed extentAndMedium extentAndMedium: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed findingAids findingAids: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed institutionResponsibleIdentifier institutionResponsibleIdentifier: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed locationOfCopies locationOfCopies: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed locationOfOriginals locationOfOriginals: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed physicalCharacteristics physicalCharacteristics: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed relatedUnitsOfDescription relatedUnitsOfDescription: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed reproductionConditions reproductionConditions: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed revisionHistory revisionHistory: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed rules rules: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed scopeAndContent scopeAndContent: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed sources sources: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed title autocomplete: type=string index=analyzed title: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed
And this is how it looks after the change:
accessConditions accessConditions: type=string index=analyzed analyzer=std_portuguese accruals accruals: type=string index=analyzed analyzer=std_portuguese acquisition acquisition: type=string index=analyzed analyzer=std_portuguese alternateTitle alternateTitle: type=string index=analyzed analyzer=std_portuguese appraisal appraisal: type=string index=analyzed analyzer=std_portuguese archivalHistory archivalHistory: type=string index=analyzed analyzer=std_portuguese arrangement arrangement: type=string index=analyzed analyzer=std_portuguese edition edition: type=string index=analyzed analyzer=std_portuguese extentAndMedium extentAndMedium: type=string index=analyzed analyzer=std_portuguese findingAids findingAids: type=string index=analyzed analyzer=std_portuguese institutionResponsibleIdentifier institutionResponsibleIdentifier: type=string index=analyzed analyzer=std_portuguese locationOfCopies locationOfCopies: type=string index=analyzed analyzer=std_portuguese locationOfOriginals locationOfOriginals: type=string index=analyzed analyzer=std_portuguese physicalCharacteristics physicalCharacteristics: type=string index=analyzed analyzer=std_portuguese relatedUnitsOfDescription relatedUnitsOfDescription: type=string index=analyzed analyzer=std_portuguese reproductionConditions reproductionConditions: type=string index=analyzed analyzer=std_portuguese revisionHistory revisionHistory: type=string index=analyzed analyzer=std_portuguese rules rules: type=string index=analyzed analyzer=std_portuguese scopeAndContent scopeAndContent: type=string index=analyzed analyzer=std_portuguese sources sources: type=string index=analyzed analyzer=std_portuguese title autocomplete: type=string index=analyzed title: type=string index=analyzed analyzer=std_portuguese untouched: type=string index=not_analyzed
Note that only title contains the raw and autocomplete extra versions. Similar for other entities like actor, repository or term.
#4 Updated by Jesús García Crespo over 7 years ago
- Status changed from New to QA/Review
- Assignee changed from Jesús García Crespo to Misty De Meo
Fixed in 847f75b.
FYI I've also tagged that commit as "v2.1.0-rc3".
Please, verify if this fix solved your problem :)
#5 Updated by Jesús García Crespo over 7 years ago
- Status changed from QA/Review to Verified