Bug #6064

EAD DTD causing errors and warnings on Import into AtoM 2.x

Added by Jessica Bushey over 8 years ago. Updated over 7 years ago.

Status:VerifiedStart date:12/02/2013
Priority:CriticalDue date:
Assignee:Dan Gillean% Done:

0%

Category:EAD
Target version:Release 2.1.0
Google Code Legacy ID: Tested version:2.0.0, 2.0.1, 2.1
Sponsored:No Requires documentation:

Description

Tested with 3 different files (all attached):

Test 1:
File "this-is-title-proper-rad;ead.xml" will import but with warnings below:
libxml error 1549 on line 0 in input file: failed to load external entity "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd"
libxml error 517 on line 0 in input file: Could not load the external subset "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd"
When I look at the DocType Definition of this file I see:


<!DOCTYPE ead PUBLIC "+//ISBN 1-931666-00-8//DTD ead.dtd (Encoded Archival Description (EAD) Version 2002)//EN" "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd">
<ead>
<eadheader langencoding>

Test 2:
File "cookies-monster-archives;ead.xml" will NOT import and I get errors below:
(By the way, I created this archival description in AtoM 2.x and tried to roundtrip it.)

libxml error 1549 on line 0 in input file: failed to load external entity "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd"
libxml error 517 on line 0 in input file: Could not load the external subset "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd"
When I look at the DocType Definition of this file I see:


<!DOCTYPE ead PUBLIC "+//ISBN 1-931666-00-8//DTD ead.dtd (Encoded Archival Description (EAD) Version 2002)//EN" "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd">
<ead xmlns:ns2="http://www.w3.org/1999/xlink" xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<eadheader langencoding>

Test 3:
File "mssa.ms.1906.xml" will NOT import and I get the following errors below:

libxml error 522 on line 0 in input file: no DTD found!
When I look at the DocType Definition of this file I see:


<ead xmlns:ns2="http://www.w3.org/1999/xlink" xmlns="urn:isbn:1-931666-22-9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.library.yale.edu/facc/schemas/ead/ead.xsd"
id="mssa.ms.1906">
<eadheader findaidstatus=

The only way I can get AtoM 2.x to import this file is if I remove the DTD and simply replace it with <ead> then go into the <ead header>.

All the namespaces are valid, so the problem is that AtoM is not recognizing namespace attributes on the <ead> element. We need to fix the errors (i.e, when AtoM will not let us import) and we would like to fix the warnings.

mssa.ms.1906.xml Magnifier - Will not import (11.5 KB) Jessica Bushey, 12/02/2013 03:51 PM

cookies-monster-archives;ead.xml Magnifier - Will not import (2.56 KB) Jessica Bushey, 12/02/2013 03:52 PM

this-is-title-proper-rad;ead.xml Magnifier - Will import with warnings (9.5 KB) Jessica Bushey, 12/02/2013 03:52 PM

pierian-club-dundas-ont-fonds-2;ead.xml Magnifier (13.4 KB) Dan Gillean, 08/22/2014 02:46 PM

Weekend Maintenance _ Library of Congress.html Magnifier (4.25 KB) David Juhasz, 08/22/2014 03:05 PM


Related issues

Related to Access to Memory (AtoM) - Bug #6875: EAD import errors causing failure Duplicate 06/20/2014
Related to Access to Memory (AtoM) - Feature #2787: validate XML files using XSD files located in data/xsd Verified
Related to Access to Memory (AtoM) - Bug #6877: MODS import doesn't work if namespace attributes included Verified 06/21/2014

History

#1 Updated by Jessica Bushey over 8 years ago

#2 Updated by Jesús García Crespo over 8 years ago

  • Target version changed from Release 2.0.1 to Release 2.0.2

#3 Updated by José Raddaoui Marín over 8 years ago

  • Status changed from New to In progress

Hi Jessica,

I had the same problem loading "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd" but suddenly it stopped happening. I could only guess it was a problem trying to get it from the virtual machine. I think what I did was to access the page from the browser once, but I'm not sure that will solve the problem.

Without the problem loading the DTD file:

Test 1:

Imported without warnings.

Test 2:

Don't import and gives the following warnings:

libxml error 533 on line 3 in input file: No declaration for attribute xmlns:ns2 of element ead
libxml error 533 on line 3 in input file: No declaration for attribute xmlns of element ead
libxml error 533 on line 3 in input file: No declaration for attribute xmlns:xsi of element ead

Removing those attributes the file is imported without warnings. Those attributes were added in this commit: https://github.com/artefactual/atom/commit/6e72a07

Test 3:

Don't import and gives the following warning:

libxml error 522 on line 0 in input file: no DTD found!

Adding the <!DOCTYPE> tag and importing again gives this warinings:

libxml error 533 on line 6 in input file: No declaration for attribute schemaLocation of element ead
libxml error 533 on line 6 in input file: No declaration for attribute xmlns:ns2 of element ead
libxml error 533 on line 6 in input file: No declaration for attribute xmlns of element ead
libxml error 533 on line 6 in input file: No declaration for attribute xmlns:xsi of element ead
libxml error 533 on line 32 in input file: No declaration for attribute type of element extref
libxml error 533 on line 32 in input file: No declaration for attribute actuate of element extref
libxml error 533 on line 32 in input file: No declaration for attribute show of element extref
libxml error 533 on line 32 in input file: No declaration for attribute href of element extref
libxml error 533 on line 138 in input file: No declaration for attribute type of element dao
libxml error 533 on line 138 in input file: No declaration for attribute actuate of element dao
libxml error 533 on line 138 in input file: No declaration for attribute title of element dao
libxml error 533 on line 138 in input file: No declaration for attribute role of element dao
libxml error 533 on line 138 in input file: No declaration for attribute href of element dao
libxml error 533 on line 138 in input file: No declaration for attribute show of element dao
libxml error 533 on line 159 in input file: No declaration for attribute type of element dao
libxml error 533 on line 159 in input file: No declaration for attribute actuate of element dao
libxml error 533 on line 159 in input file: No declaration for attribute title of element dao
libxml error 533 on line 159 in input file: No declaration for attribute role of element dao
libxml error 533 on line 159 in input file: No declaration for attribute href of element dao
libxml error 533 on line 159 in input file: No declaration for attribute show of element dao
libxml error 533 on line 180 in input file: No declaration for attribute type of element dao
libxml error 533 on line 180 in input file: No declaration for attribute actuate of element dao
libxml error 533 on line 180 in input file: No declaration for attribute title of element dao
libxml error 533 on line 180 in input file: No declaration for attribute role of element dao
libxml error 533 on line 180 in input file: No declaration for attribute href of element dao
libxml error 533 on line 180 in input file: No declaration for attribute show of element dao

Then, removing the same xmls attributes removed in test 2, the file is imported but gives this warnings:

libxml error 201 on line 4 in input file: Namespace prefix xsi for schemaLocation on ead is not defined
libxml error 201 on line 30 in input file: Namespace prefix ns2 for type on extref is not defined
libxml error 201 on line 30 in input file: Namespace prefix ns2 for actuate on extref is not defined
libxml error 201 on line 30 in input file: Namespace prefix ns2 for show on extref is not defined
libxml error 201 on line 30 in input file: Namespace prefix ns2 for href on extref is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for type on dao is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for actuate on dao is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for title on dao is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for role on dao is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for href on dao is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for show on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for type on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for actuate on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for title on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for role on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for href on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for show on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for type on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for actuate on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for title on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for role on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for href on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for show on dao is not defined
libxml error 201 on line 4 in input file: Namespace prefix xsi for schemaLocation on ead is not defined
libxml error 201 on line 30 in input file: Namespace prefix ns2 for type on extref is not defined
libxml error 201 on line 30 in input file: Namespace prefix ns2 for actuate on extref is not defined
libxml error 201 on line 30 in input file: Namespace prefix ns2 for show on extref is not defined
libxml error 201 on line 30 in input file: Namespace prefix ns2 for href on extref is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for type on dao is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for actuate on dao is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for title on dao is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for role on dao is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for href on dao is not defined
libxml error 201 on line 136 in input file: Namespace prefix ns2 for show on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for type on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for actuate on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for title on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for role on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for href on dao is not defined
libxml error 201 on line 157 in input file: Namespace prefix ns2 for show on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for type on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for actuate on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for title on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for role on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for href on dao is not defined
libxml error 201 on line 178 in input file: Namespace prefix ns2 for show on dao is not defined
libxml error 533 on line 4 in input file: No declaration for attribute schemaLocation of element ead
libxml error 533 on line 30 in input file: No declaration for attribute type of element extref
libxml error 502 on line 30 in input file: Value "onRequest" for attribute actuate of extref is not among the enumerated set
libxml error 533 on line 136 in input file: No declaration for attribute type of element dao
libxml error 502 on line 136 in input file: Value "onRequest" for attribute actuate of dao is not among the enumerated set
libxml error 533 on line 157 in input file: No declaration for attribute type of element dao
libxml error 502 on line 157 in input file: Value "onRequest" for attribute actuate of dao is not among the enumerated set
libxml error 533 on line 178 in input file: No declaration for attribute type of element dao
libxml error 502 on line 178 in input file: Value "onRequest" for attribute actuate of dao is not among the enumerated set

So finally, the xmlns attributes in the <ead> tag makes the import fail and the ns2 attributes creates warnings. I'll ask Mike G. when he's back about those xmlns attributes added in that commit. And I hope that solving that the ns2 attributes works too.

Regards.

#4 Updated by Jessica Bushey over 8 years ago

Yes, Mike G needs to be consulted about this issue.

#5 Updated by David Juhasz over 8 years ago

  • Priority changed from Critical to Medium

#6 Updated by Dan Gillean over 8 years ago

  • Priority changed from Medium to Critical

As per David's request, I have retested the 3 original use cases posted in this issue - they remain critical problems. Namely, Test 2 and Test 3 will not import at all, despite having been roundtripped from AtoM in the first place. Remarking as a critical issue for consideration in the next release.

#7 Updated by Jessica Bushey over 8 years ago

This bug is blocking my ability to test EAD import of archival descriptions created in languages other than english.

#8 Updated by Mike Gale over 8 years ago

  • Assignee changed from José Raddaoui Marín to Mike Gale

#9 Updated by Mike Gale over 8 years ago

  • Status changed from In progress to QA/Review
  • Assignee changed from Mike Gale to Jessica Bushey

So what happened is some XML namespaces were added so that the XML generated by AtoM could be processed properly by Apache-FOP (which is used when printing finding aid PDFs). Despite these being pretty standard namespaces in XML, the EAD DTD doesn't account for them and flags it as an error.

EAD does support one namespace, namely "urn:isbn:1-931666-00-8", but it's ignored by default. If I manually set the DTD to parse it, it seems to break the import. As it's disabled by default in the LoC DTD, I don't think we should waste time debugging this issue because it's very unlikely someone will be using that namespace. All other namespaces aren't allowed in EAD.

tl;dr: I reverted the code in AtoM to just use the <ead> tag with no namespaces and that should fix this issue. I'll find some other workaround to get our EAD XML files to parse in Apache-FOP.

#10 Updated by Jessica Bushey over 8 years ago

  • Status changed from QA/Review to Feedback
  • Assignee changed from Jessica Bushey to David Juhasz

Ok. So I roundtripped an archival description from AtoM 2.x and got the following errors when trying to import the EAD.xml:

  • libxml error 1549 on line 0 in input file: failed to load external entity "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd"
    *libxml error 517 on line 0 in input file: Could not load the external subset "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd"

But AtoM allows import even with the errors. So I guess it works, but I'm worried that clients trying to import compliant EAD.xml created outside of AtoM that they won't be able to import without changing their EAD. I think we need to make sure that we are accepting standardized EAD.

#11 Updated by Jessica Bushey over 8 years ago

I'd like to point out that AtoM users are coming up against this issue as a problem (email sent to Dan dated February 12, 2014) in which Dan provides the following response:

One of our developers, who was doing some work with an XSLT, added namespace values to the <ead> element, without ensuring that this was allowed in EAD or capable of being parsed by our import script. In 2.0.1, the <ead> element will export like so:

<ead xmlns:ns2="http://www.w3.org/1999/xlink" xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

The simple workaround, until a fix is included in our 2.0.2 release, is to edit the EAD and remove the attributes, like so:

<ead>

This issue is being addressed on the following issue ticket in our bug-tracking system: https://projects.artefactual.com/issues/6064

Note that performing the above edit will allow the file to import as expected (providing the "View archival description" button, for example), but it will not remove the two warnings you provided:

libxml error 1549 on line 0 in input file: failed to load external entity "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd"
libxml error 517 on line 0 in input file: Could not load the external subset http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd

These warnings will not affect the status of the import - we are still investigating their cause however (as we have verified that the URL is correct for the EAD 2002 DTD), and you will see that Jessica's most recent comment on the related issue ticket (here) has set the ticket to feedback and highlighted this outstanding issue, which we hope to resolve.

I hope that helps - you can expect a fix for this bug in our next release.

#12 Updated by Dan Gillean almost 8 years ago

  • Target version changed from Release 2.0.2 to Release 2.1.0

#13 Updated by Dan Gillean almost 8 years ago

Devs: see #6836 for some other thoughts on the missing DTD problem.

#14 Updated by David Juhasz almost 8 years ago

Found this <http://stackoverflow.com/questions/24526493/check-for-malicious-xml-before-allowing-dtd-loading> which indicates that loading of external entities (like remote DTD files) is disabled by default in libxml >=2.9 to prevent loading of malicious files. It looks like it's possible to override this behaviour by including the LIBXML_DTDLOAD option when calling DOMDocument->load().

#15 Updated by David Juhasz almost 8 years ago

I confirmed via xmllint that cookies-monster-archives;ead.xml and mssa.ms.1906.xml are not valid EAD documents due to using unsupported namespaces in the <ead> element, so AtoM should not be required to import these EAD documents. I also confirmed that the current 2.1.0 candidate code does not include these unsupported namespaces in the <ead> element in EAD documents created by AtoM.

#16 Updated by David Juhasz almost 8 years ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from David Juhasz to Sarah Romkey

I imported the this-is-title-proper-rad;ead.xml successfully, without warning, using the latest 2.1.0 candidate code (https://github.com/artefactual/atom/commit/26b294609e73d94ef77b50eea5d7d387a71a8b6a).

#17 Updated by David Juhasz almost 8 years ago

  • Tested version 2.1 added

#18 Updated by David Juhasz almost 8 years ago

Note: I tested with a local copy of AtoM, using following environment

$ lsb_release -a | grep Description
No LSB modules are available.
Description:    Ubuntu 12.04.4 LTS

$ php -v
PHP 5.5.11-3+deb.sury.org~precise+1 (cli) (built: Apr 23 2014 12:23:08) 
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
    with Zend OPcache v7.0.4-dev, Copyright (c) 1999-2014, by Zend Technologies

$ mysql -u root -p -e "SHOW VARIABLES LIKE 'version';" 
Enter password: 
+---------------+-------------------------+
| Variable_name | Value                   |
+---------------+-------------------------+
| version       | 5.5.36-34.2-648.precise |
+---------------+-------------------------+

#19 Updated by Dan Gillean almost 8 years ago

Still getting this issue in 2x.test. Exported a file from there, pretty printed the output, then reimported. Same warnings as ever:

Will try in my local environment as well, with the same EAD file.

#20 Updated by Dan Gillean almost 8 years ago

Same warnings reproduced with the above EAD file in my local environment, after doing git pull --rebase on qa/2.1.x.

So, I don't think it's a deployment issue. Problem is still there.

#21 Updated by David Juhasz almost 8 years ago

When I use http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd in my web browser I get a "Weekend Maintenance" page (attached). I'm guessing the page went up after EST business hours today?

#22 Updated by David Juhasz almost 8 years ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from David Juhasz to Dan Gillean

#23 Updated by Dan Gillean over 7 years ago

  • Status changed from QA/Review to Feedback
  • Assignee changed from Dan Gillean to David Juhasz

Checked that LoC and specifically the EAD DTD is back in working order (http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd).

Imported the pierian-club-dundas-fonds file (attached above) which has no namespace attributes in the <ead> element.

Still getting the same 2 warnings:

libxml error 1549 on line 0 in input file: failed to load external entity "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd" 
libxml error 517 on line 0 in input file: Could not load the external subset "http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd" 

Tested on my local machine as well after doing a git pull --rebase against qa/2.1.x to make sure I have the latest code, and to ensure that it is not a deployment issue in 2x.test - same warnings.

#24 Updated by Mike Gale over 7 years ago

I took another look at this tonight on my own time since this bug was "challenging". Still nothing conclusive. However, I imported XML files (no unsupported namespaces, had warnings on my dev box before) on the following platforms, and got mixed results:

- 2x.test.artefactual.com (Ubuntu 12.04, nginx, php 5.4): warnings present
- My office dev machine (Xubuntu 14.04 upgraded from Ubuntu 12.04, nginx, php 5.5): warnings present
- VM at home, (Mint 14 "Nadia", based on Ubuntu 12.04, apache2, php5.4): warnings missing
- VM at home, (Xubuntu 14.04 vanilla, nginx, php5.5): warnings missing

I have no idea why the warnings are present on some systems, and on others they're not present at all. It makes me suspect (just a theory) there's something different between systems in terms of their environments though, even if it doesn't seem to be Ubuntu version or PHP version. Maybe a setting in PHP somewhere, maybe a certain library version in use has the warning and another doesn't, or maybe I'm completely missing something. :)

I also checked an XML file that produced the warning a short time ago against the `xmllint` tool, and it passed with flying colours. So it isn't a problem of the DTD being invalid or the site hosting it being down or anything.

David: Maybe we could try that no-warnings XML file you had in comment 16 against 2x.test.artefactual.com to see if it gets warnings. If it does, it would support my findings & would seem likely this isn't a code/XML document issue.

#25 Updated by Mike Gale over 7 years ago

I found the issue! It was a PHP setting after all. http://php.net/manual/en/filesystem.configuration.php
If the option "allow_url_fopen" is set to OFF, the DTD errors will fire, if it is set to ON the import will complete without the errors. Presumably this is because the validation code for libxml in PHP is using fopen on the DTD url.

I'll let David or Jesús decide on the next course of action to take; there may be security implications for this option etc. Note we're instructing users to disable this option in our docs: https://www.accesstomemory.org/en/docs/2.0/admin-manual/installation/linux/#php

So as it stands, if users follow our on-site installation instructions, they'll be setting it up to throw the warnings.

cheers

#26 Updated by Jesús García Crespo over 7 years ago

Nice, Mike.

I think that it's fine to update our docs with allow_url_fopen = true.
It also makes sense to add a note about this particular finding, maybe in our FAQ and/or the security section?

#27 Updated by Jesús García Crespo over 7 years ago

  • Assignee changed from David Juhasz to Dan Gillean

Dan, MikeG has updated the php5-fpm configuration file shown in the installation instructions, under PHP, see that allow_url_open now equals on. However, I suspect that a change like that won't be noticed by our users. Dan, do you want to mention this issue in the FAQ?

I've added a mention on the impact on security of PHP settings like allow_url_fopen in our docs. I think that we were using allow_url_fopen for our demo sites and somehow it ended up in the docs, but it doesn't make any sense to disable that at the configuration level, it's not the right layer to solve the problem, i.e. there's no analogous in other environments like Django, etc... and only seems to make sense if you are running untrusted code (you are hosting PHP application of customers, etc). Dan, feel free to reword it: https://github.com/artefactual/atom-docs/commit/78284fad38d40502549be86ffcfb9372f8903acc.

David Juhasz is analyzing a solution that avoids hitting remotes for retrieval of the DTDs, but I don't know if that's going to make it in AtoM 2.1. Should we just bump this to 2.2?

#28 Updated by Dan Gillean over 7 years ago

  • Status changed from Feedback to Verified

Actually, going to mark this verified as of 2.1. The general fix for this issue is to set "allow_url_fopen" to ON in your PHP configuration file - you can read the updates to our documentation from above to learn more.

Our long-term solution to improve this kind of handling in the application (and hopefully resolve #6877 as well) is outlined in #2787 - and we currently have a client who is considering sponsoring this fix for 2.2. Further attempts to improve this handling can go in that ticket, or in a new ticket.

Nice to see this finally figured out!

Also available in: Atom PDF