Bug #7229

ElasticSearch 1.3 fails to create search index for long fields

Added by Misty De Meo over 7 years ago. Updated over 7 years ago.

Status:VerifiedStart date:09/10/2014
Priority:MediumDue date:
Assignee:Misty De Meo% Done:

0%

Category:Search / Browse
Target version:Release 2.1.0
Google Code Legacy ID: Tested version:2.0.1, 2.1
Sponsored:No Requires documentation:

Description

When attempting to index an AtoM database using AtoM 2.1 and Elasticsearch 1.3, Elasticsearch failed because the scope and content is too long.

Console output:

 [InformationObject] Procès-verbaux. - 16 avril 1902-14 décembre 1906 inserted (107.32s) (2308/71667)

Error in one or more bulk request actions:

index: /atom/QubitInformationObject/12229 caused IllegalArgumentException[Document contains at least one immense term in field="i18n.fr.scopeAndContent.untouched" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[76, 97, 32, 115, 111, 117, 115, 45, 115, -61, -87, 114, 105, 101, 32, 112, 111, 114, 116, 101, 32, 112, 114, 105, 110, 99, 105, 112, 97, 108]...', original message: bytes can be at most 32766 in length; got 56445]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 56445];

PHP Fatal error: Uncaught exception 'Elastica\Exception\Bulk\ResponseException' with message 'Error in one or more bulk request actions:

index: /atom/QubitInformationObject/12229 caused IllegalArgumentException[Document contains at least one immense term in field="i18n.fr.scopeAndContent.untouched" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[76, 97, 32, 115, 111, 117, 115, 45, 115, -61, -87, 114, 105, 101, 32, 112, 111, 114, 116, 101, 32, 112, 114, 105, 110, 99, 105, 112, 97, 108]...', original message: bytes can be at most 32766 in length; got 56445]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 56445];
' in /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php:395
Stack trace:
#0 /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php(345): Elastica\Bulk->_processResponse(Object(Elastica\Response))
#1 /home/vagrant/atom-2/vendor/Elastica/ in /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php on line 395

Fatal error: Uncaught exception 'Elastica\Exception\Bulk\ResponseException' with message 'Error in one or more bulk request actions:

index: /atom/QubitInformationObject/12229 caused IllegalArgumentException[Document contains at least one immense term in field="i18n.fr.scopeAndContent.untouched" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[76, 97, 32, 115, 111, 117, 115, 45, 115, -61, -87, 114, 105, 101, 32, 112, 111, 114, 116, 101, 32, 112, 114, 105, 110, 99, 105, 112, 97, 108]...', original message: bytes can be at most 32766 in length; got 56445]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 56445];
' in /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php:395
Stack trace:
#0 /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php(345): Elastica\Bulk->_processResponse(Object(Elastica\Response))
#1 /home/vagrant/atom-2/vendor/Elastica/ in /home/vagrant/atom-2/vendor/Elastica/lib/Elastica/Bulk.php on line 395

I was testing on the qa/2.1.x branch, which was fully up to date.


Related issues

Blocks Access to Memory (AtoM) - Feature #6334: Support Elasticsearch 1.0+ Verified 02/19/2014

History

#1 Updated by Dan Gillean over 7 years ago

  • Tested version 2.0.1, 2.1 added

#2 Updated by Misty De Meo over 7 years ago

This does not happen with Elasticsearch 0.9.x, by the way. The issue was introduced in 1.x.

According to this Stack Overflow comment, indexing for these long fields was never actually happening. In the past it was silently ignored, while now it's an error.

#3 Updated by Jesús García Crespo over 7 years ago

Ok, thanks for the report Misty!
I believe a good solution could be to avoid indexing a not_analyzed version of the field unless it's strictly required, e.g. information_object.title or actor.authorized_form_of_name is ok, they are varchar(1024) - so small terms even for not_analyzed - and necessary as we use them for sorting (it's suboptimal to sort on analyzed fields).

For the record, this is how the i18n props look before the fix in the QubitInformationObject type:

  accessConditions
    accessConditions: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  accruals
    accruals: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  acquisition
    acquisition: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  alternateTitle
    alternateTitle: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  appraisal
    appraisal: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  archivalHistory
    archivalHistory: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  arrangement
    arrangement: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  edition
    edition: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  extentAndMedium
    extentAndMedium: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  findingAids
    findingAids: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  institutionResponsibleIdentifier
    institutionResponsibleIdentifier: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  locationOfCopies
    locationOfCopies: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  locationOfOriginals
    locationOfOriginals: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  physicalCharacteristics
    physicalCharacteristics: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  relatedUnitsOfDescription
    relatedUnitsOfDescription: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  reproductionConditions
    reproductionConditions: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  revisionHistory
    revisionHistory: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  rules
    rules: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  scopeAndContent
    scopeAndContent: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  sources
    sources: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

  title
    autocomplete: type=string index=analyzed
    title: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

And this is how it looks after the change:

  accessConditions
    accessConditions: type=string index=analyzed analyzer=std_portuguese

  accruals
    accruals: type=string index=analyzed analyzer=std_portuguese

  acquisition
    acquisition: type=string index=analyzed analyzer=std_portuguese

  alternateTitle
    alternateTitle: type=string index=analyzed analyzer=std_portuguese

  appraisal
    appraisal: type=string index=analyzed analyzer=std_portuguese

  archivalHistory
    archivalHistory: type=string index=analyzed analyzer=std_portuguese

  arrangement
    arrangement: type=string index=analyzed analyzer=std_portuguese

  edition
    edition: type=string index=analyzed analyzer=std_portuguese

  extentAndMedium
    extentAndMedium: type=string index=analyzed analyzer=std_portuguese

  findingAids
    findingAids: type=string index=analyzed analyzer=std_portuguese

  institutionResponsibleIdentifier
    institutionResponsibleIdentifier: type=string index=analyzed analyzer=std_portuguese

  locationOfCopies
    locationOfCopies: type=string index=analyzed analyzer=std_portuguese

  locationOfOriginals
    locationOfOriginals: type=string index=analyzed analyzer=std_portuguese

  physicalCharacteristics
    physicalCharacteristics: type=string index=analyzed analyzer=std_portuguese

  relatedUnitsOfDescription
    relatedUnitsOfDescription: type=string index=analyzed analyzer=std_portuguese

  reproductionConditions
    reproductionConditions: type=string index=analyzed analyzer=std_portuguese

  revisionHistory
    revisionHistory: type=string index=analyzed analyzer=std_portuguese

  rules
    rules: type=string index=analyzed analyzer=std_portuguese

  scopeAndContent
    scopeAndContent: type=string index=analyzed analyzer=std_portuguese

  sources
    sources: type=string index=analyzed analyzer=std_portuguese

  title
    autocomplete: type=string index=analyzed
    title: type=string index=analyzed analyzer=std_portuguese
    untouched: type=string index=not_analyzed

Note that only title contains the raw and autocomplete extra versions. Similar for other entities like actor, repository or term.

#4 Updated by Jesús García Crespo over 7 years ago

  • Status changed from New to QA/Review
  • Assignee changed from Jesús García Crespo to Misty De Meo

Fixed in 847f75b.

FYI I've also tagged that commit as "v2.1.0-rc3".
Please, verify if this fix solved your problem :)

#5 Updated by Jesús García Crespo over 7 years ago

  • Status changed from QA/Review to Verified

Also available in: Atom PDF