Bug #5864

XML import drops data in <unittitle> if it is wrapped in EAD tags like <corpname>

Added by Creighton Barrett over 8 years ago. Updated over 8 years ago.

Status:VerifiedStart date:10/24/2013
Priority:MediumDue date:
Assignee:José Raddaoui Marín% Done:

100%

Category:Import/Export
Target version:Release 2.0.1
Google Code Legacy ID: Tested version:
Sponsored:No Requires documentation:

Description

We imported a large batch of EAD XML via command line. Some of our EAD has tags like <corpname> and <persname> in the <archdesc><did><unittitle>, for example:

<archdesc level="fonds">
<did>
<unittitle>
<corpname>Dalhousie Women's Club</corpname> fonds</unittitle>

In 1.x, the entire title imported with no problem. In 2.0, the entire title is dropped during import. The information object appears as Untitled.

History

#1 Updated by David Juhasz over 8 years ago

  • Assignee changed from Mike Gale to José Raddaoui Marín

#2 Updated by José Raddaoui Marín over 8 years ago

  • Status changed from New to QA/Review
  • % Done changed from 0 to 100

#3 Updated by Creighton Barrett over 8 years ago

We're also noticing lost data with other tags, such as <title>:

 <did>
                     <unittitle>
                        <title render="italic">Darkling sea: a novel</title> : [typescript draft]</unittitle>
                     <langmaterial>
                        <language langcode="eng"/>
                     </langmaterial>
                     <unitid>MS-2-753, Box 11, Folder 7</unitid>
                     <container id="cid1030001" type="Box-folder" label="Text">Box 11, Folder 7</container>
                     <physdesc>160 pages</physdesc>
                     <unitdate>September 19 2007</unitdate>
                  </did>

#4 Updated by Dan Gillean over 8 years ago

  • Status changed from QA/Review to Feedback

Hey Radda, I haven't had a chance to test this issue yet but I've changed it to feedback to ensure that you've seen the comments added by Creighton. Let me know how you are approaching the fix, and if you need me to dig up a full list of EAD tags that are allowable within <unittitle>, to ensure we are not dropping titles with wrapped elements on import.

#5 Updated by Creighton Barrett over 8 years ago

Here is a list of EAD tags that can be nested within <unittitle>:

#PCDATA, abbr, archref, bibref, bibseries, corpname, date, edition, emph, expan, extptr, extref, famname, function, genreform, geogname, imprint, lb, linkgrp, name, num, occupation, persname, ptr, ref, subject, title, unitdate

http://www.loc.gov/ead/tglib/elements/unittitle.html

I don't know how you're approaching the fix but some of those tags seem pretty obscure. If it helps, the tags that are available in the title area of the Archivist Toolkit resources module are:

corpname, date, emph, extref, famname, function, genreform, geogname, name, occupation, persname, ref, subject, title

#6 Updated by José Raddaoui Marín over 8 years ago

Hi,

The fix I made is giving support for corpname, famname, geogname, name and persname tags under unittitle. What other labels should I look for? Also, I guess that I have to look for the same tags under '<unittitle type="parallel">' and '<unittitle type="otherInfo">', am I right?

Thank you both ;)

#7 Updated by Dan Gillean over 8 years ago

I feel that the most important tags to check for and support would be: date, emph, function, genreform, occupation, ref, subject, title, in addition to those you've already included. Most of them will be edge cases that I don't expect to see too often, but these are still the more likely tags, in my estimation, and since their use is supported in EAD, we should at least make sure that the title will still import properly if nothing else.

Applying the same logic to the two other unittitles would also be great. Thanks Radda!

#8 Updated by Creighton Barrett over 8 years ago

Thanks guys!

#9 Updated by José Raddaoui Marín over 8 years ago

  • Status changed from Feedback to QA/Review

AtoM|commit: 7778e8d4e13033f8262b49533b7c8e5bc22fe3b6

#10 Updated by José Raddaoui Marín over 8 years ago

Ok, this is done in the 2.x branch. But I'm wondering if this fix should be in the 1.4 version too. What do you think Dan?

#11 Updated by Dan Gillean over 8 years ago

If you can merge the code with 1.x, I think we should. Our main goal with 1.4 was to leave a release that vastly improves the data roundtripping (EAD, EAC, DC, MODS, SKOS, etc.) and since this is very much a part of improving the import/export routine, I'd love to see us pass on this fix to any users who choose to stay with the legacy 1.x branch and update to 1.4. That said, since it's unsponsored, if merging the fix with 1.x is going to be a lot of work, it's not our priority at the moment. So I'll leave it to you to make the call based on how long you think it will take.

#12 Updated by Dan Gillean over 8 years ago

  • Status changed from QA/Review to Feedback

Hi Radda,

Unfortunately, now the import script is only bringing in data contained in the EAD tags that we've whitelisted.

For example, if I import:

<unittitle encodinganalog="3.1.2"><famname>Bushey</famname> Family Fonds</unittitle>

...then I end up with a fonds titled "Bushey" and not "Bushey Family Fonds".

If I import:

<unittitle encodinganalog="1.1B">The Ultimate ISAD <corpname>Artefactual</corpname> Test Fonds</unittitle>

... then the fonds ends up importing with the title "Artefactual" (not "The Ultimate ISAD Artefactual Test Fonds").

Basically, the behaviour we need is for any internal EAD tags to unittitle to be ignored on import. The entire #PCDATA string should be imported (regardless of which parts are wrapped in other tags) but the internal tags themselves should ideally not affect the import.

#13 Updated by José Raddaoui Marín over 8 years ago

  • Status changed from Feedback to QA/Review

Thanks Dan,

So now the import is just ignoring any tag inside the 'unittitle' element.
Please test also an import with paragraphs and linebreaks, becouse at first your examples were looking like 'BusheyFamily Fonds' and 'The Ultimate ISADArtefactualTest Fonds' after the import, and I had to remove a call to 'trim()' I added in that fix.

#14 Updated by Dan Gillean over 8 years ago

Tried, among other variations and combinations, the following:

<unittitle encodinganalog="1.1B"><persname>Anna Ruth</persname> <famname>Cummings</famname> fonds with a 

      Line break in the middle of it and a <title>TITLE too!</title></unittitle>

...and it worked. Marking verified. Thanks Radda.

#15 Updated by Dan Gillean over 8 years ago

  • Status changed from QA/Review to Verified

Also available in: Atom PDF