Bug #7171

Extent tags in EAD with ampersands in them crash XML export

Added by Mike Gale over 7 years ago. Updated over 7 years ago.

Status:VerifiedStart date:08/29/2014
Priority:MediumDue date:
Assignee:Dan Gillean% Done:

0%

Category:EAD
Target version:Release 2.1.0
Google Code Legacy ID: Tested version:2.0.0, 2.0.1, 2.1
Sponsored:No Requires documentation:

Description

If the extent tag has a & in it, it'll crap out half way through the export and just spew XML to stdout.

I presume other "bad" characters would crash this also. The solution was to call esc_specialchars() on the extent data before parsing it for display.

7171-fix.patch Magnifier (1.13 KB) Mike Gale, 09/13/2014 11:48 PM


Related issues

Related to Access to Memory (AtoM) - Bug #6949: Ampersand in identifier breaks EAD export Verified 07/08/2014
Related to Access to Memory (AtoM) - Bug #5588: Unencoded ampersand in ISDIAH record (authorized form of ... Verified 09/14/2013
Related to Access to Memory (AtoM) - Bug #3809: Parsing error in MODS and DC when title contains an amper... Verified
Related to Access to Memory (AtoM) - Bug #5984: Parsing error in DC export when institution name contains... Verified 11/16/2013

History

#1 Updated by Mike Gale over 7 years ago

  • Status changed from New to QA/Review
  • Assignee changed from Mike Gale to Dan Gillean

Committed a fix to the general AtoM repo: 9937f0ed044565a25a996fffbeb4424c1ce0c0b8

#2 Updated by Dan Gillean over 7 years ago

  • Status changed from QA/Review to New
  • Assignee changed from Dan Gillean to Mike Gale
  • Tested version 2.0.0, 2.0.1, 2.1 added

I've related this to a bunch of other issues - notably #6949 for the moment - but as you can see, this is affecting all of our metadata exchange formats. It would be nice to come up with a more global solution for escaping these and angle brackets ( < and > will also break export) and/or replacing them with encoding values so they still display properly, etc.

#3 Updated by Dan Gillean over 7 years ago

  • Status changed from New to QA/Review
  • Assignee changed from Mike Gale to Dan Gillean

oops, simultaneous update

#4 Updated by Dan Gillean over 7 years ago

  • Status changed from QA/Review to Feedback

Welp,

In my local testing, the ampersand is not being escaped so much as just removed entirely. Tried to import the following:

<physdesc>
  <extent encodinganalog="3.1.5">22 pp of textual records & other stuff. </extent>
</physdesc>

What I got, in both view and edit mode, was the extent with the ampersand completely removed. This is perhaps better than breaking the import entirely, but the user might not even notice that the character has been removed, making the sentence confusing and incomplete.

I don't know what's possible, but ideally, an ampersand would just be replaced with the html code for adding one, e.g.

&amp;

Testing with brackets found that the opening < bracket was removed, while the closing bracket > was successfully imported.

Warnings were thrown in all cases despite the successful import - but these warnings aren't verbose or clear enough for an end-user to understand that special characters were removed.

Willing to let this in as is, but would like to see this improved in the future if possible.

#5 Updated by Dan Gillean over 7 years ago

Another quick update:

An unintended consequence of escaping this field is that it affects the nested tags we were using to create struction for multiple fields (such as extent) nested in the physdesc element - the <dt> and <dd> tags are also being escaped:

Haven't yet tried to roundtrip this file, but the results might be unpredictable.

#6 Updated by Mike Gale over 7 years ago

Hey Dan, to clarify this fix was for exporting EAD was being broken by &s. Not importing.

I think this may be a separate bug unrelated to my fix?

#7 Updated by Mike Gale over 7 years ago

  • Status changed from Feedback to In progress
  • Assignee changed from Dan Gillean to Mike Gale

Possible regression with the <>s being escaped for dd and dt tags in the extent field

#8 Updated by Dan Gillean over 7 years ago

  • Status changed from In progress to Feedback

Not sure if this is caused by your recent fix on this issue as well or not, but strangely, <eadid> element tags are being escaped:

Don't see this happening with any other elements. Not sure why this one is being affected.

#9 Updated by Mike Gale over 7 years ago

I'm pretty sure I broke the eadid tag when I fixed #6949. It was dumb of me to escape the entire tag (which of course has escapable characters like < and > in it) instead of just the problem identifier. Now we can both say we did something silly this ticket, Dan :P

Working on a fix now.

#10 Updated by Mike Gale over 7 years ago

Attaching a GIT patch since my SSH keys aren't working on this new work VM I setup at home..

#11 Updated by Jesús García Crespo over 7 years ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from Mike Gale to Dan Gillean

Ok, I've just applied your patch.

#12 Updated by Sarah Romkey over 7 years ago

  • Status changed from QA/Review to Verified

Verified! No more crapping out.

Also available in: Atom PDF