Create script to scrub tags from content
|Assignee:||Sarah Romkey||% Done:|
|Target version:||Release 2.2.1|
|Google Code Legacy ID:||Tested version:||2.2, 2.3|
XSS prevention logic now escapes HTML in AtoM content. Some users, however, have HTML content in AtoM. We need a script to remove/transform HTML content (in information objects to start).
#1 Updated by Mike Cantelon about 5 years ago
Ideas from David J:
The script should start target the information_object_i18n data to start. We may need to expand it to cover actor_i18n, term_i18n, etc. but lets start with information objects.
For most HTML tags I think we should just delete the start and end tags and leave the contents, e.g. "<p class="foo"><b>Mr. Bean</b> is an eminent..." should become "Mr. Bean is an eminent..." . Hopefully there's some helpful library or Regex out there to make this task easier.
AtoM does have some special cases though:
<a href="http://example.org/foo">Foo</a> should become "http://example.org/foo"
<a href="mailto:firstname.lastname@example.org">email Jane Doe</a> should become "email@example.com"
<p> tags should be replaced with a double linebreak (e.g. "\n\n")
<li>item 1</li> should become "* item 1"
Within extent and medium we've used some funky definition list tags. <dl><dt>Extent</dt><dd>12 meters of stuff</dd></dl> should become "12 meters of stuff". There's several possible <dt> values, but they should be removed.
For this list I would consider 1 the highest priority and 5 the lowest priority. If any of them are going to be a lot of work, then let me know. We'll have to evaluate at that point if the effort is worth the reward.
Mike G started work on a task for this:
#6 Updated by Mike Cantelon about 5 years ago
I've created a version of the HTML removal script that'll be easy for AtoM 2.2 users to run.
1) In command-line, change directory to AtoM root directory:
$ cd /directory/where/atom/lives
2) Download the HTML translation script:
$ curl https://gist.githubusercontent.com/mcantelon/082a6ceefcbfa66ede99/raw/b3322ea9d555df5ec99d09cb60ff35afb50c6d80/remove-html.php > remove-html.php
3) Run the script:
$ php symfony tools:run remove-html.php
#7 Updated by Sarah Romkey about 5 years ago
- Status changed from QA/Review to Feedback
- Assignee changed from Dan Gillean to Mike Cantelon
- Target version changed from Release 2.3.0 to Release 2.2.0
Testing in qa-22, I made these changes to the scope and content of this fonds:
This <b>fonds</b> consists of <i>handwritten correspondence</i> from 1844-1884 between Archibald Galbraith and his family. The content of the correspondence from 1844-1855 mainly concern <a href="www.google.com">Archibald Galbraith's</a> experiences fighting in India, and the correspondence from 1857-1884 are focused on his later family life.
Note also that an apostrophe was transformed later in the scope and content:
There are newspaper clippings from three editions of Glasgow Heraldâ€™s Saturday Extra
There are newspaper clippings from three editions of Glasgow Herald’s Saturday Extra
#9 Updated by Mike Cantelon about 5 years ago
- Assignee changed from Mike Cantelon to David Hume
OK, I've fixed the issues hopefully. The script ( https://gist.githubusercontent.com/mcantelon/082a6ceefcbfa66ede99/raw/0b0bc35e04e9dbb3846b464a375b4cb7349dc07b/remove-html.php ) can be updated on the QA server.
#11 Updated by Sarah Romkey about 5 years ago
- Target version changed from Release 2.2.0 to Release 2.3.0
- Requires documentation set to Yes
- Tested version 2.2 added
Announced to the user forum here:
Leaving this ticket in QA review and changing target version to 2.3 so that we remember to QA for that release as well.
#12 Updated by Dan Gillean almost 5 years ago
- Status changed from QA/Review to Verified
tested in AtoM 2.3 as a CLI task. To invoke:
php symfony i18n:remove-html-tags
The task will run through archival description i18n fields, and report the number of changes per information object, along with a total number of changed information objects as the final output.
#14 Updated by Dan Gillean almost 5 years ago
- Requires documentation changed from Yes to No
- Tested version 2.3 added
Documentation added to 2.3 in: https://github.com/artefactual/atom-docs/commit/2f33b3ae62c7a17d79461f9bfe7da2330a85d0fb