Task #9448

Improve character escaping behavior during editing and roundtripping

Added by Dan Gillean over 6 years ago. Updated over 6 years ago.

Status:NewStart date:02/16/2016
Priority:MediumDue date:
Assignee:Mike Gale% Done:

0%

Category:Form validation
Target version:-
Google Code Legacy ID: Tested version:2.2, 2.3
Sponsored:No Requires documentation:

Description

In AtoM 2.2, we introduced a number of changes, including:

  • #7647 - Escape HTML entities "<", ">", '"', "&" to prevent XSS exploits
  • #7587 - Unescaped ampersand in creator name causes description export errors
  • #8426 - Physical description field does not include character escaping
  • #8555 - EAD export fails when certain fields have ampersands

And likely more. #7647 was a major feature for security reasons; the others have to do with character escaping so that EAD XML can still be roundtripped.

However, a recent thread in the user forum has pointed out some issues. For example:

  • Create a description
  • In a text field, manually add a character escape for an ampersand, e.g.
    black &amp; white
    
  • Save the description. At this point the character escaping is literal in the user interface - the content does not display as "&" but as
    "&amp;"
  • Re-enter edit mode, edit a different field, and then save

Resulting issue:

  • Escaped ampersand has been transformed into a literal ampersand character

Issue 2

  • import the attached escaping-test-fonds;ead.xml file. It's an older EAD file so you may get some warnings, but it should import.
  • Each field in the EAD file has been populated with the following, for testing:
    Escaping test fonds ! & @ # $ % ^ &amp; * ( ) &lt; &gt; ? " ' ; : ~ ` { _ } [ ]
    
  • Note this includes both a literal and an escaped ampersand, as well as character escaped greater-than and lesser-than symbols

resulting error
On import, both the escaped and the literal ampersands, and the escaped great/lesser-than symbols are stripped out of the import

Note: this may be an older bug we never successfully followed up on. See for example: https://projects.artefactual.com/issues/7171#note-4


Ideal behaviors:

I'm not sure, but in the user forum, it is recommended that any substitutions are only made at the last minute, during export:

- if you replace & with &amp you can solve this issue for short period. If you edit some other field of the same description where you have &amp it will become just &; (final solutions should be just before exporting to check if there is some special character and to replace it with its "code", for example using:

function replace_special_characters($str)
{
$str1 = str_replace('"','&quot;',$str);
$str2 = str_replace('&','&amp;',$str1);
$str3 = str_replace("'",'&apos;',$str2);
$str4 = str_replace('<','&lt;',$str3);
$str5 = str_replace('>','&gt;',$str4);

return $str5;
}

)

We should consider if, and how, we display literal and escaped elements, given the overall escaping introduced for security in AtoM 2.2. We should definitely not strip out elements during import without a warning to the user - Ideally, they should just be substituted for the corresponding character escapes - but I guess that's only useful if they display properly in the user interface.

escaping-test-fonds;ead.xml Magnifier (9.4 KB) Dan Gillean, 02/16/2016 02:17 PM

History

#1 Updated by Dan Gillean over 6 years ago

  • Description updated (diff)

#2 Updated by Dan Gillean over 6 years ago

Useful comments from a user on the related user forum thread:

eventually, the EAD export has to be valid XML - we encountered invalid XML if characters are not escaped properly. This bothers other services which expect valid XML :) - thus reject the invalid one for further processing.

  • so if one enter "&amp" via User interface -> one should get back "&amp" in the user interface - and "&amp" in the XML Export.
  • if one enters "&" via User interface -> one should get back "&" in the user interface - and "&" in the XML Export.

There are pretty much standard methods how to escape dangerous HTML elements provided within form fields, and to ensure proper values in the database (security issue solved).
For Java apps, i was pretty fine with the JSoup library- for PHP one may surely find similar libraries ..
Further you may think as well about the more restrictive workflow - reject any ingest (save) if input contains unsecure tags - and ask users to correct it .. - and do not do anything automatically.

Also available in: Atom PDF