XML import drops data in <unittitle> if it is wrapped in EAD tags like <corpname>
|Assignee:||José Raddaoui Marín||% Done:|
|Target version:||Release 2.0.1|
|Google Code Legacy ID:||Tested version:|
We imported a large batch of EAD XML via command line. Some of our EAD has tags like <corpname> and <persname> in the <archdesc><did><unittitle>, for example:
<corpname>Dalhousie Women's Club</corpname> fonds</unittitle>
In 1.x, the entire title imported with no problem. In 2.0, the entire title is dropped during import. The information object appears as Untitled.
#3 Updated by Creighton Barrett over 8 years ago
We're also noticing lost data with other tags, such as <title>:
<did> <unittitle> <title render="italic">Darkling sea: a novel</title> : [typescript draft]</unittitle> <langmaterial> <language langcode="eng"/> </langmaterial> <unitid>MS-2-753, Box 11, Folder 7</unitid> <container id="cid1030001" type="Box-folder" label="Text">Box 11, Folder 7</container> <physdesc>160 pages</physdesc> <unitdate>September 19 2007</unitdate> </did>
#4 Updated by Dan Gillean over 8 years ago
- Status changed from QA/Review to Feedback
Hey Radda, I haven't had a chance to test this issue yet but I've changed it to feedback to ensure that you've seen the comments added by Creighton. Let me know how you are approaching the fix, and if you need me to dig up a full list of EAD tags that are allowable within <unittitle>, to ensure we are not dropping titles with wrapped elements on import.
#5 Updated by Creighton Barrett over 8 years ago
Here is a list of EAD tags that can be nested within <unittitle>:
#PCDATA, abbr, archref, bibref, bibseries, corpname, date, edition, emph, expan, extptr, extref, famname, function, genreform, geogname, imprint, lb, linkgrp, name, num, occupation, persname, ptr, ref, subject, title, unitdate
I don't know how you're approaching the fix but some of those tags seem pretty obscure. If it helps, the tags that are available in the title area of the Archivist Toolkit resources module are:
corpname, date, emph, extref, famname, function, genreform, geogname, name, occupation, persname, ref, subject, title
#6 Updated by José Raddaoui Marín over 8 years ago
The fix I made is giving support for corpname, famname, geogname, name and persname tags under unittitle. What other labels should I look for? Also, I guess that I have to look for the same tags under '<unittitle type="parallel">' and '<unittitle type="otherInfo">', am I right?
Thank you both ;)
#7 Updated by Dan Gillean over 8 years ago
I feel that the most important tags to check for and support would be: date, emph, function, genreform, occupation, ref, subject, title, in addition to those you've already included. Most of them will be edge cases that I don't expect to see too often, but these are still the more likely tags, in my estimation, and since their use is supported in EAD, we should at least make sure that the title will still import properly if nothing else.
Applying the same logic to the two other unittitles would also be great. Thanks Radda!
#11 Updated by Dan Gillean over 8 years ago
If you can merge the code with 1.x, I think we should. Our main goal with 1.4 was to leave a release that vastly improves the data roundtripping (EAD, EAC, DC, MODS, SKOS, etc.) and since this is very much a part of improving the import/export routine, I'd love to see us pass on this fix to any users who choose to stay with the legacy 1.x branch and update to 1.4. That said, since it's unsponsored, if merging the fix with 1.x is going to be a lot of work, it's not our priority at the moment. So I'll leave it to you to make the call based on how long you think it will take.
#12 Updated by Dan Gillean over 8 years ago
- Status changed from QA/Review to Feedback
Unfortunately, now the import script is only bringing in data contained in the EAD tags that we've whitelisted.
For example, if I import:
<unittitle encodinganalog="3.1.2"><famname>Bushey</famname> Family Fonds</unittitle>
...then I end up with a fonds titled "Bushey" and not "Bushey Family Fonds".
If I import:
<unittitle encodinganalog="1.1B">The Ultimate ISAD <corpname>Artefactual</corpname> Test Fonds</unittitle>
... then the fonds ends up importing with the title "Artefactual" (not "The Ultimate ISAD Artefactual Test Fonds").
Basically, the behaviour we need is for any internal EAD tags to unittitle to be ignored on import. The entire #PCDATA string should be imported (regardless of which parts are wrapped in other tags) but the internal tags themselves should ideally not affect the import.
#13 Updated by José Raddaoui Marín over 8 years ago
- Status changed from Feedback to QA/Review
So now the import is just ignoring any tag inside the 'unittitle' element.
Please test also an import with paragraphs and linebreaks, becouse at first your examples were looking like 'BusheyFamily Fonds' and 'The Ultimate ISADArtefactualTest Fonds' after the import, and I had to remove a call to 'trim()' I added in that fix.
#14 Updated by Dan Gillean over 8 years ago
Tried, among other variations and combinations, the following:
<unittitle encodinganalog="1.1B"><persname>Anna Ruth</persname> <famname>Cummings</famname> fonds with a Line break in the middle of it and a <title>TITLE too!</title></unittitle>
...and it worked. Marking verified. Thanks Radda.