Task #8574
Create script to scrub tags from content
Status: | Verified | Start date: | 06/17/2015 | |
---|---|---|---|---|
Priority: | Medium | Due date: | ||
Assignee: | Sarah Romkey | % Done: | 0% | |
Category: | CLI tools | |||
Target version: | Release 2.2.1 | |||
Google Code Legacy ID: | Tested version: | 2.2, 2.3 | ||
Sponsored: | No | Requires documentation: | No |
Description
XSS prevention logic now escapes HTML in AtoM content. Some users, however, have HTML content in AtoM. We need a script to remove/transform HTML content (in information objects to start).
Related issues
History
#1 Updated by Mike Cantelon almost 7 years ago
Ideas from David J:
The script should start target the information_object_i18n data to start. We may need to expand it to cover actor_i18n, term_i18n, etc. but lets start with information objects.
For most HTML tags I think we should just delete the start and end tags and leave the contents, e.g. "<p class="foo"><b>Mr. Bean</b> is an eminent..." should become "Mr. Bean is an eminent..." . Hopefully there's some helpful library or Regex out there to make this task easier.
AtoM does have some special cases though:
<a href="http://example.org/foo">Foo</a> should become "http://example.org/foo"
<a href="mailto:janedoe@example.org">email Jane Doe</a> should become "janedoe@example.org"
<p> tags should be replaced with a double linebreak (e.g. "\n\n")
<li>item 1</li> should become "* item 1"
Within extent and medium we've used some funky definition list tags. <dl><dt>Extent</dt><dd>12 meters of stuff</dd></dl> should become "12 meters of stuff". There's several possible <dt> values, but they should be removed.
For this list I would consider 1 the highest priority and 5 the lowest priority. If any of them are going to be a lot of work, then let me know. We'll have to evaluate at that point if the effort is worth the reward.
Mike G started work on a task for this:
#2 Updated by Mike Cantelon almost 7 years ago
- Status changed from New to Code Review
- Assignee changed from Mike Cantelon to Nick Wilkinson
Pull request for HTML scrubber:
#3 Updated by Nick Wilkinson almost 7 years ago
- Assignee changed from Nick Wilkinson to Mike Gale
Hi Mike G, can you please take a look at this for code review?
#4 Updated by Mike Gale almost 7 years ago
- Assignee changed from Mike Gale to Mike Cantelon
Hey Mike, it mostly looks good. The main thing I think we should change is what we replace the 'dd' tags with. Comments in the pull request on GitHub. Danke.
#5 Updated by Mike Cantelon almost 7 years ago
- Status changed from Code Review to QA/Review
- Assignee changed from Mike Cantelon to Dan Gillean
Hi Dan. You invoke the HTML scrubber by entering: ./symfony i18n:remove-html-tags
#6 Updated by Mike Cantelon almost 7 years ago
I've created a version of the HTML removal script that'll be easy for AtoM 2.2 users to run.
1) In command-line, change directory to AtoM root directory:
$ cd /directory/where/atom/lives
2) Download the HTML translation script:
$ curl https://gist.githubusercontent.com/mcantelon/082a6ceefcbfa66ede99/raw/b3322ea9d555df5ec99d09cb60ff35afb50c6d80/remove-html.php > remove-html.php
3) Run the script:
$ php symfony tools:run remove-html.php
#7 Updated by Sarah Romkey almost 7 years ago
- Status changed from QA/Review to Feedback
- Assignee changed from Dan Gillean to Mike Cantelon
- Target version changed from Release 2.3.0 to Release 2.2.0
Testing in qa-22, I made these changes to the scope and content of this fonds:
http://qa-22x.test.artefactual.com/archibald-galbraith-fonds
This <b>fonds</b> consists of <i>handwritten correspondence</i> from 1844-1884 between Archibald Galbraith and his family. The content of the correspondence from 1844-1855 mainly concern <a href="www.google.com">Archibald Galbraith's</a> experiences fighting in India, and the correspondence from 1857-1884 are focused on his later family life.
Note also that an apostrophe was transformed later in the scope and content:
There are newspaper clippings from three editions of Glasgow Herald’s Saturday Extra
Original:
There are newspaper clippings from three editions of Glasgow Herald’s Saturday Extra
#8 Updated by Sarah Romkey almost 7 years ago
Another, from the same description:
Physical description:
2 b&w photographs
was
2 b&w photographs
before transformation.
#9 Updated by Mike Cantelon almost 7 years ago
- Assignee changed from Mike Cantelon to David Hume
OK, I've fixed the issues hopefully. The script ( https://gist.githubusercontent.com/mcantelon/082a6ceefcbfa66ede99/raw/0b0bc35e04e9dbb3846b464a375b4cb7349dc07b/remove-html.php ) can be updated on the QA server.
#10 Updated by David Hume almost 7 years ago
- Status changed from Feedback to QA/Review
- Assignee changed from David Hume to Sarah Romkey
Deployed and ran (4th time sounds the charm!)
Sarah, if you could summarize your positive result and then I guess consult with Mike on next steps as mentioned.
#11 Updated by Sarah Romkey almost 7 years ago
- Target version changed from Release 2.2.0 to Release 2.3.0
- Requires documentation set to Yes
- Tested version 2.2 added
Announced to the user forum here:
https://groups.google.com/forum/#!topic/ica-atom-users/_xdBK0ucegg
Leaving this ticket in QA review and changing target version to 2.3 so that we remember to QA for that release as well.
#12 Updated by Dan Gillean almost 7 years ago
- Status changed from QA/Review to Verified
tested in AtoM 2.3 as a CLI task. To invoke:
php symfony i18n:remove-html-tags
The task will run through archival description i18n fields, and report the number of changes per information object, along with a total number of changed information objects as the final output.
#13 Updated by Dan Gillean over 6 years ago
- Related to Feature #7647: Escape HTML entities "<", ">", '"', "&" to prevent XSS exploits added
#14 Updated by Dan Gillean over 6 years ago
- Requires documentation changed from Yes to No
- Tested version 2.3 added
Documentation added to 2.3 in: https://github.com/artefactual/atom-docs/commit/2f33b3ae62c7a17d79461f9bfe7da2330a85d0fb
#17 Updated by Mike Gale over 6 years ago
- Related to Bug #9181: HTML scrubber sometimes escapes ampersands that are outside html tags... added
#18 Updated by Dan Gillean over 6 years ago
- Related to Feature #9184: Update HTML scrub script to replace HTML links with custom linking formatting used in AtoM added
#19 Updated by Dan Gillean over 6 years ago
- Target version changed from Release 2.3.0 to Release 2.2.1
We've backported this to the 2.2.1 release, and updated it to use the linking syntax described in feature #8410 (which is also backported)
#20 Updated by Dan Gillean about 4 years ago
- Related to Feature #12149: Convert existing AtoM data to Parsedown syntax added