Task #8574

Create script to scrub tags from content

Added by Mike Cantelon over 5 years ago. Updated almost 5 years ago.

Status:VerifiedStart date:06/17/2015
Priority:MediumDue date:
Assignee:Sarah Romkey% Done:

0%

Category:CLI tools
Target version:Release 2.2.1
Google Code Legacy ID: Tested version:2.2, 2.3
Sponsored:No Requires documentation:No

Description

XSS prevention logic now escapes HTML in AtoM content. Some users, however, have HTML content in AtoM. We need a script to remove/transform HTML content (in information objects to start).


Related issues

Related to Access to Memory (AtoM) - Feature #7647: Escape HTML entities "<", ">", '"', "&" to prevent XSS ex... Verified 12/03/2014
Related to Access to Memory (AtoM) - Bug #9181: HTML scrubber sometimes escapes ampersands that are outsi... Verified 11/23/2015
Related to Access to Memory (AtoM) - Feature #9184: Update HTML scrub script to replace HTML links with custo... Verified 11/23/2015
Related to Access to Memory (AtoM) - Feature #12149: Convert existing AtoM data to Parsedown syntax Verified 03/13/2018

History

#1 Updated by Mike Cantelon over 5 years ago

Ideas from David J:

The script should start target the information_object_i18n data to start. We may need to expand it to cover actor_i18n, term_i18n, etc. but lets start with information objects.

For most HTML tags I think we should just delete the start and end tags and leave the contents, e.g. "<p class="foo"><b>Mr. Bean</b> is an eminent..." should become "Mr. Bean is an eminent..." . Hopefully there's some helpful library or Regex out there to make this task easier.

AtoM does have some special cases though:
<a href="http://example.org/foo">Foo</a> should become "http://example.org/foo"
<a href="mailto:">email Jane Doe</a> should become ""
<p> tags should be replaced with a double linebreak (e.g. "\n\n")
<li>item 1</li> should become "* item 1"
Within extent and medium we've used some funky definition list tags. <dl><dt>Extent</dt><dd>12 meters of stuff</dd></dl> should become "12 meters of stuff". There's several possible <dt> values, but they should be removed.
For this list I would consider 1 the highest priority and 5 the lowest priority. If any of them are going to be a lot of work, then let me know. We'll have to evaluate at that point if the effort is worth the reward.

Mike G started work on a task for this:

https://gist.github.com/MikeFE/23dec4147f2ae3bb4ef5

#2 Updated by Mike Cantelon over 5 years ago

  • Status changed from New to Code Review
  • Assignee changed from Mike Cantelon to Nick Wilkinson

Pull request for HTML scrubber:

https://github.com/artefactual/atom/pull/196

#3 Updated by Nick Wilkinson over 5 years ago

  • Assignee changed from Nick Wilkinson to Mike Gale

Hi Mike G, can you please take a look at this for code review?

#4 Updated by Mike Gale over 5 years ago

  • Assignee changed from Mike Gale to Mike Cantelon

Hey Mike, it mostly looks good. The main thing I think we should change is what we replace the 'dd' tags with. Comments in the pull request on GitHub. Danke.

#5 Updated by Mike Cantelon over 5 years ago

  • Status changed from Code Review to QA/Review
  • Assignee changed from Mike Cantelon to Dan Gillean

Hi Dan. You invoke the HTML scrubber by entering: ./symfony i18n:remove-html-tags

#6 Updated by Mike Cantelon over 5 years ago

I've created a version of the HTML removal script that'll be easy for AtoM 2.2 users to run.

1) In command-line, change directory to AtoM root directory:

$ cd /directory/where/atom/lives

2) Download the HTML translation script:

$  curl https://gist.githubusercontent.com/mcantelon/082a6ceefcbfa66ede99/raw/b3322ea9d555df5ec99d09cb60ff35afb50c6d80/remove-html.php > remove-html.php

3) Run the script:

$ php symfony tools:run remove-html.php

#7 Updated by Sarah Romkey over 5 years ago

  • Status changed from QA/Review to Feedback
  • Assignee changed from Dan Gillean to Mike Cantelon
  • Target version changed from Release 2.3.0 to Release 2.2.0

Testing in qa-22, I made these changes to the scope and content of this fonds:

http://qa-22x.test.artefactual.com/archibald-galbraith-fonds


This <b>fonds</b> consists of <i>handwritten correspondence</i> from 1844-1884 between Archibald Galbraith and his family. The content of the correspondence from 1844-1855 mainly concern <a href="www.google.com">Archibald Galbraith's</a> experiences fighting in India, and the correspondence from 1857-1884 are focused on his later family life. 

Note also that an apostrophe was transformed later in the scope and content:


There are newspaper clippings from three editions of Glasgow Herald&#xE2;&#x80;&#x99;s Saturday Extra 

Original:


There are newspaper clippings from three editions of Glasgow Herald’s Saturday Extra

#8 Updated by Sarah Romkey over 5 years ago

Another, from the same description:

Physical description:


2 b&amp;w photographs

was


2 b&w photographs

before transformation.

#9 Updated by Mike Cantelon over 5 years ago

  • Assignee changed from Mike Cantelon to David Hume

OK, I've fixed the issues hopefully. The script ( https://gist.githubusercontent.com/mcantelon/082a6ceefcbfa66ede99/raw/0b0bc35e04e9dbb3846b464a375b4cb7349dc07b/remove-html.php ) can be updated on the QA server.

#10 Updated by David Hume over 5 years ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from David Hume to Sarah Romkey

Deployed and ran (4th time sounds the charm!)

Sarah, if you could summarize your positive result and then I guess consult with Mike on next steps as mentioned.

#11 Updated by Sarah Romkey over 5 years ago

  • Target version changed from Release 2.2.0 to Release 2.3.0
  • Requires documentation set to Yes
  • Tested version 2.2 added

Announced to the user forum here:

https://groups.google.com/forum/#!topic/ica-atom-users/_xdBK0ucegg

Leaving this ticket in QA review and changing target version to 2.3 so that we remember to QA for that release as well.

#12 Updated by Dan Gillean about 5 years ago

  • Status changed from QA/Review to Verified

tested in AtoM 2.3 as a CLI task. To invoke:

php symfony i18n:remove-html-tags

The task will run through archival description i18n fields, and report the number of changes per information object, along with a total number of changed information objects as the final output.

#13 Updated by Dan Gillean about 5 years ago

  • Related to Feature #7647: Escape HTML entities "<", ">", '"', "&" to prevent XSS exploits added

#14 Updated by Dan Gillean about 5 years ago

  • Requires documentation changed from Yes to No
  • Tested version 2.3 added

#17 Updated by Mike Gale almost 5 years ago

  • Related to Bug #9181: HTML scrubber sometimes escapes ampersands that are outside html tags... added

#18 Updated by Dan Gillean almost 5 years ago

  • Related to Feature #9184: Update HTML scrub script to replace HTML links with custom linking formatting used in AtoM added

#19 Updated by Dan Gillean almost 5 years ago

  • Target version changed from Release 2.3.0 to Release 2.2.1

We've backported this to the 2.2.1 release, and updated it to use the linking syntax described in feature #8410 (which is also backported)

#20 Updated by Dan Gillean over 2 years ago

  • Related to Feature #12149: Convert existing AtoM data to Parsedown syntax added

Also available in: Atom PDF