Feature #8687

Improve ES anlyzers for a better work with diacritics

Added by José Raddaoui Marín almost 7 years ago. Updated over 6 years ago.

Status:VerifiedStart date:06/03/2015
Priority:MediumDue date:
Assignee:Dan Gillean% Done:

0%

Category:Search / Browse
Target version:Release 2.3.0
Google Code Legacy ID: Tested version:
Sponsored:Yes Requires documentation:

Description

1. The researches give different results when we use accents/diacritics and when we don't.
For example : researches for "évaluation" and "evaluation" / "hôtel" and "hotel" / "déjà" and "deja" won't give the same results.
Would it be possible to change the catalogue so it won't take care of the diacritic marks / accents when we do some researches?

2. When we search for a word like "évaluation", we won't get in the results all the descriptions including "l'évaluation" or "d'évaluation".
Would it be possible to correct this so we find all the results (including descriptions with "l'évaluation" and "d'évaluation"), when entering "évaluation" in the search menu?
This is a problem for all the words beginning with a-e-i-o-u-y, since we often use them with "l'" or "d'" in french. The sign ' should be considered as a separation between two words.

Proposed solutions:

  1. Implement the ASCII folding token filter for French to ignore diacritics in searches to address case #1
  2. Implement the Elision token filter to address case #2

Related issues

Related to Access to Memory (AtoM) - Bug #8676: Elasticsearch analyzers not working over 'multi_field' ty... Verified 07/10/2015

History

#2 Updated by José Raddaoui Marín almost 7 years ago

  • File deleted (french_token_filters.png)

#3 Updated by José Raddaoui Marín almost 7 years ago

  • File deleted (french_analyzer.png)

#4 Updated by José Raddaoui Marín almost 7 years ago

  • File deleted (i18n_fr_fields_mapping.png)

#5 Updated by José Raddaoui Marín almost 7 years ago

  • Status changed from New to Code Review
  • Assignee changed from José Raddaoui Marín to Mike Gale

#6 Updated by Mike Gale almost 7 years ago

  • Status changed from Code Review to Feedback
  • Assignee changed from Mike Gale to José Raddaoui Marín

#7 Updated by Mike Gale almost 7 years ago

Looks good

#8 Updated by José Raddaoui Marín almost 7 years ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from José Raddaoui Marín to Dan Gillean

Merged in qa/2.3.x

The search index needs to be rebuilt.

#9 Updated by José Raddaoui Marín almost 7 years ago

  • Related to Bug #8676: Elasticsearch analyzers not working over 'multi_field' type fields added

#10 Updated by José Raddaoui Marín almost 7 years ago

I've added the asciifolding filter to the ES default analyzer to make it work with non i18n fields:

https://github.com/artefactual/atom/commit/e510528327112720ab7de4781d1bba9dc3ce0185

The search index needs to be rebuilt again.

#11 Updated by Dan Gillean over 6 years ago

  • Status changed from QA/Review to Verified
  • Sponsored changed from No to Yes
  • Requires documentation deleted (Yes)

Also available in: Atom PDF