Remove @encodinganalog from <language> tags and rework how EAD imports are determined for languages and scripts od description (and material)
|Assignee:||José Raddaoui Marín||% Done:|
|Target version:||Release 1.4.0|
|Google Code Legacy ID:||Tested version:|
Step 1: remove the encodinganalog from the language tags
Currently the code in AtoM appears to rely on @encodinganalog attributes used within <language> elements to determine where the language belongs (language or script of material, OR language/script of description.)
<language langcode="eng" encodinganalog="Language of Description">English</language> OR <language scriptcode="Cari" encodinganalog="Script">Carian</language>
However, this is an incorrect use of the @encoding analog, which is "A field or element in another descriptive encoding system to which an EAD element or attribute is comparable." (EAD Tag Library 2002). Generally, this is used to encode standards numbers, such as ISAD, or MARC fields, or RAD, etc. Many other users trying to import their data who do not use the attribute the same peculiar way that we currently do will therefore have problems importing their data into AtoM.
Step 2: Rework how PHP calls to these elements and determines what goes where
Each language element is already wrapped in a tag that determines whether it is a language of descrption or material, so the encoding analog should not be necessary, and the PHP code should not be using these attributes in the switch function.
Language of Material and Script of Material should appear wrapped in <langmaterial> tags like such
<langmaterial encodinganalog="3.4.3"> <language langcode="eng">English</language> <language scriptcode="latn">Latin</language> LANG AND SCRIPT NOTES HERE </language>
Language of Description and Script of Description should be similar, only wrapped in the <langusage> element:
<langusage> <language langcode="eng">English</language> <language scriptcode="latn">Latin</language> </langusage>
Thus it should be possible to use the parent tags to determine which language belongs where, instead of relying on information jammed into the @encoding analog.
#3 Updated by José Raddaoui Marín about 9 years ago
We have a little problem here. Currently AtoM is exporting three types of languages:
We are already using <langmaterial> inside the <did> element for exporting these languages. But they are not importing yet.
<langmaterial encodinganalog="3.4.3"> <language langcode="fre">French</language> </langmaterial>
<langusage> <language langcode="eng" encodinganalog="Language Of Description">English</language> </langusage>
AtoM also exports the source culture of the archival description and the user's culture.
<langusage> <language langcode="eng" encodinganalog="Language">English</language> </langusage>
So, AtoM is relying on @encodinganalog attributes to determine if it's the source language of the description or a description language. NOT language of material or language of description.
I'm going to fix the export for scripts of material, add the language and script notes to the export, and add the import for the whole <langmaterial> element. But, what can we do to distinguish the source languages for the languages of description?
#4 Updated by David Juhasz about 9 years ago
From what I can tell from the EAD working group documents <http://www2.archivists.org/groups/technical-subcommittee-on-encoded-archival-description-ead> the new version of EAD (Scheduled for August 2013?) will allow using the @xml:lang attribute for all tags which will allow encoding mixed-language finding aids (hurray!) and mapping i18n data between AtoM and EAD will become much easier.
In the meantime, we've got a bit of a mess with the difference in the way AtoM handles languages/scripts and the way EAD 2002 does.
Here's what I suggest:
This isn’t controversial as far as I can tell, we just need to get it importing properly. I'm not sure if we have a separate issue for encoding the "Language and script notes" field, but we should include it in the <langmaterial> element.
<langmaterial encodinganalog="3.4.3"> <language langcode="eng"/> <language langcode="fre"/> <language scriptcode="Latn"/> Correspondence is predominantly English, with some French. </langmaterial>
The problem here is that we have two definitions for the language(s) of the description:
- The i18n data schema (i.e. information_object_i18n.culture)
- The “Language(s)” list (a “property” row in the database)
I think that rather than using the @encodinganalog, which is defined for another purpose and creates confusion, to distinguish definition 1 from definition 2, that we should assume that the first (and only the first) <language> element is the language of the EAD-XML file (i.e. the current culture in AtoM when the EAD-XML is exported), and all other languages are pulled from the “Language(s)” list (definition 2).
For example, assume a Fonds which has English and French descriptions in AtoM (It has ‘en’ and ‘fr’ i18n rows), and it has “Language(s)”: [English, Spanish, Russian] and “Script(s)”: [Latin, Cyrillic]. If my current culture in AtoM is English, and I export the Fonds as EAD then I propose the representation should be:
<langusage> <language langcode="eng"/> <language langcode="spa"/> <language langcode="rus"/> <language scriptcode="Latn"/> <language scriptcode="Cyrl"/> </langusage>
Note that the ordering is only important for the first <language> element as this will set the “culture” of the description on import. Also notice that we are not including French in the list, even though the original description in AtoM includes a French translation. I think it’s important to respect the conscious choice of the archivist of what to include in the description, even if it doesn’t match the data. However, we must include the current culture when exporting the description, so we can use the correct culture when importing the EAD file into another instance of AtoM.
The <langusage> list should not contain duplicates.
In AtoM, if I now switch my culture to French and export the same Fonds, the EAD <langusage> representation should be:
<langusage> <language langcode="fre"/> <language langcode="eng"/> <language langcode="spa"/> <language langcode="rus"/> <language scriptcode="Latn"/> <language scriptcode="Cyrl"/> </langusage>
French would be the culture (and source_culture) of the description on import, and would added to the “Language(s)” list in AtoM on import. Although including all of the <langusage><language> elements in the “Language(s)” list in AtoM means this data doesn’t roundtrip from AtoM to AtoM perfectly, I think this convention is necessary to support EAD imported from other (non-AtoM) sources.
#7 Updated by José Raddaoui Marín about 9 years ago
- Status changed from New to QA/Review