Bug #6221

EAD <processinfo> is dropped during import

Added by Dan Gillean over 8 years ago. Updated about 7 years ago.

Status:VerifiedStart date:01/17/2014
Priority:MediumDue date:
Assignee:Mike Gale% Done:

0%

Category:EAD
Target version:Release 2.2.0
Google Code Legacy ID: Tested version:2.0.0, 2.0.1
Sponsored:No Requires documentation:

Description

See: http://www.loc.gov/ead/tglib/elements/processinfo.html

"The <processinfo> element is comparable to ISAD data element 3.7.1 and MARC field 583. A <date> within a <processinfo><p> element is comparable to ISAD data element 3.7.3."

We will parse <processinfo> to map to ISAD 3.7.1 Archivist's note, regardless of whether or not there is a date. However, we should make sure that most of the possible nested elements in <processinfo> will import correctly.

<processinfo> may contain
address, blockquote, chronlist, head, list, note, p, processinfo, table

We should especially make sure that data nested in a <head>, <note>, <p>, and <address> tag inside of processinfo is not dropped.

This should also be mapped to DACS 8.1.5 Archivist and Date.

clara-fritz-collection-processInfo-test;ead.xml Magnifier (5.47 KB) Dan Gillean, 09/12/2014 05:07 PM


Related issues

Related to Access to Memory (AtoM) - Bug #8385: AtoM EAD is not compliant with DTD, causes DTD warnings o... Verified 05/04/2015

History

#1 Updated by Mike Gale about 8 years ago

Hey Radda, I don't know how important it is for AtoM to parse children tags of <processinfo>, in https://github.com/artefactual/atom/blob/2.x/apps/qubit/modules/object/config/import/ead.yml#L414 there are stray ']' characters in the XPath. Also, even wtih those ']'s removed, that XPath doesn't seem to work for a basic case:

<processinfo>
    <p>Testing</p>
    <p>123</p>
</processinfo>

Which is what my client's data had. Parsing will get confusing if we have to parse <note>s in there too. We should ask Dan for clarification maybe? At any rate, I just changed the XPath to 'processinfo' only and it seems to be picking that up. I'll just put it in my client repo for now (you can ask if you want to see it)

#2 Updated by Mike Gale about 8 years ago

  • Assignee changed from José Raddaoui Marín to Mike Gale

Mike G will merge the bug fix from a client's repo to the general repo.

#3 Updated by Mike Gale about 8 years ago

  • Status changed from New to QA/Review
  • Assignee changed from Mike Gale to Dan Gillean

Pushed a fix to 2.x

#4 Updated by Dan Gillean about 8 years ago

  • Status changed from QA/Review to Feedback

There is some complexity here to consider. Adding notes for anyone reviewing this in the future. The only outstanding work, from my perspective, is to tweak the export behavior. See TO DO at bottom if you want to skip the TL;DR

Point one: mappings

EAD Tag Library 2002 says the following:

The <processinfo> element is comparable to ISAD(G) data element 3.7.1 and MARC field 583. A <date> within a <processinfo><p> element is comparable to ISAD(G) data element 3.7.3.

ISAD 3.7.1 is Archivist's note.
ISAD 3.7.3 is Date(s) of description - i.e. in our template, Dates of creation, revision, and deletion.

Previously, we have been mapping Archivist's note to the EAD < author > element. This still works and roundtrips - because the field is repeatable, when < author > and < processinfo > are present, both can still import successfully. The mapping to author isn't great, but there is no direct ISAD/RAD equivalent for this field, so I propose we leave it as is.

However, on export, I think we should start using the recommended mapping, to < processinfo >.

A second point: nested tags

The following data was imported as a test:

<processinfo>
        <p>This data will be lost, I bet.            
            <date>Dates of creation, revision, and deletion</date>
        Yup, I sure do bet.
        </p>
        <p>This data will not be lost, I think. I hope. This is process info in a P tag.</p>
        <p>There is also a second paragraph tag. Note that in ISAD template we have also been mapping Archivist's notes to author</p>
    </processinfo>

Note that the first part ("This data will be lost..."), and its follow-up after the date element, was in fact lost on import, as it contained mixed content - the date element was nested inside. The "Dates of creation, revision, and deletion" data was still mapped to the 3.7.3 (Dates of creation, revision, deletion) field in the template. Similarly, the multiple paragraph elements worked fine. I also included an < author > element in the the import - where 2 paragraph tags appear in one < processinfo > element, they are imported as line breaks. Where multiple processinfo or author elements appear, they will import as separate Archivist's notes.

I think, given the suggested mappings and the limitations of XPath, that this is the best outcome we can hope for.

TO DO
  • Change export behavior to use < processinfo >< p > for Archvist's notes, and < processinfo >< date > for Dates of creation, revision, and deletion (separate processinfo elements, so they will roundtrip back in to the appropriate fields.

The mapping on import of < author > to Archivist's notes can stay in place, but we should use the recommended mapping on export. This should also ensure that anyone with an older version of AtoM will still be able to roundtrip their data into a newer version with these updated mappings.

#5 Updated by Dan Gillean about 8 years ago

  • Assignee changed from Dan Gillean to Mike Gale

#6 Updated by Dan Gillean almost 8 years ago

  • Target version changed from Release 2.0.2 to Release 2.1.0

This issue is so close to being done - it'd be great if we could get that last TO DO above done in time to verify it for 2.1.

#7 Updated by Mike Gale almost 8 years ago

I don't know if I just am not setting the right fields, but I was trying to get these tags to show up when exporting http://2x.test.artefactual.com/clara-fritz-collection and I didn't even see any :/

So I don't know if there was a regression or what. Is there a URL to a description you can find that exports these tags? thanks

#8 Updated by Dan Gillean almost 8 years ago

  • File clara-fritz-collection;ead.xml added
  • Tested version 2.0.0, 2.0.1 added

I had to update the clara fritz collection to have the right fields filled in. Then I added some more to the EAD for testing, deleted it, and then reimported. See it after the import described below, here: http://2x.test.artefactual.com/clara-fritz-collection

Actually, the only thing that appears to be getting dropped on import right now is the Dates of creation, revision, and deletion - e.g:

<processinfo>
   <date>Dates of creation, revision, deletion</date>
<processinfo>

This is what we are testing, in the attached EAD file:

on lines 7-10 (testing the author tag) :

<titlestmt>
   <titleproper encodinganalog="title">Clara Fritz collection</titleproper>
   <author encodinganalog="creator">this is archivists notes that were in an author EAD tag</author>
</titlestmt>

on lines 98-102:

<processinfo>
  <date>This is dates of creation, revision, and deletion - in processinfo then date EAD tags</date>
  <p>This is a test and should go to Archivist notes - was in processinfo then paragraph tag</p>
  <p>This should go to archivist notes too - was also in processinfo then paragraph tag</p>
</processinfo>

The 2 paragraph elements in process info, as well as the author element, imported correctly. So if we can just fix the processinfo-date import, I will verify this!

#9 Updated by Jesús García Crespo over 7 years ago

  • Target version changed from Release 2.1.0 to Release 2.2.0

#10 Updated by Jesús García Crespo over 7 years ago

  • Target version changed from Release 2.2.0 to Release 2.1.0

#11 Updated by Jesús García Crespo over 7 years ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from Mike Gale to Dan Gillean

#12 Updated by Dan Gillean over 7 years ago

  • File deleted (clara-fritz-collection;ead.xml)

#13 Updated by Dan Gillean over 7 years ago

Ok, re-tested, and for whatever reason, we are still not getting the data from the nested <date> element into the Dates of Creation, revision, and deletion field.

in the attached sammple, on lines 98-102:

<processinfo>
  <date>This is dates of creation, revision, and deletion - in processinfo then date EAD tags</date>
  <p>This is a test and should go to Archivist notes - was in processinfo then paragraph tag</p>
  <p>This should go to archivist notes too - was also in processinfo then paragraph tag</p>
</processinfo>

the <processinfo><date> data does not seem to import. I checked in RAD, DACS, and ISAD templates, just to make sure it isn't a crosswalking issue between the templates. no dice.

If we get that working this issue can close!

#14 Updated by Jesús García Crespo over 7 years ago

  • Target version changed from Release 2.1.0 to Release 2.1.1

#15 Updated by Jesús García Crespo over 7 years ago

  • Assignee changed from Jesús García Crespo to Mike Gale

#16 Updated by Tim Hutchinson over 7 years ago

Re <processinfo><date>, I ran into a roundtripping error on that one, so found this issue. <date> is not allowed in <processinfo>, but <processinfo><p><date> would be valid.

#17 Updated by Dan Gillean over 7 years ago

  • Target version changed from Release 2.1.1 to Release 2.2.0

Bahhhhhhh you're right, Tim! I even put that in the issue ticket description, and then it got lost somewhere along the way. Hopefully we can come back to this and tidy up these remaining issues. There are going to be a bunch of EAD improvements in 2.2.

#18 Updated by Dan Gillean about 7 years ago

  • Related to Bug #8385: AtoM EAD is not compliant with DTD, causes DTD warnings on roundtrip added

#19 Updated by Dan Gillean about 7 years ago

  • Status changed from Feedback to Verified

Ok, after the fixes in #8385, going to mark this as verified. Tested with the following:

                <processinfo>
                  <p>
                    <date>Dates of creation, revision and deletion.</date>
                  </p>
                  <p>This is a test and should go to Archivist notes - was in processinfo then paragraph tag</p>
                  <p>This should go to archivist notes too - was also in processinfo then paragraph tag</p>
                </processinfo>

As well as with an archivist's note in the <author> element of the EAD header, since that is also a mapping.

Result:

The 2 paragraph elements in process info end up in a single archivist's note element, separated by a line break. The one in the <author> element of the EAD header ends up in a separate archivists note. With the EAD corrected to use <processinfo.<p><date>, the dates of creation, revision, and deletion roundtrip fine.

Also available in: Atom PDF