Allow users to upload a PDF finding aid instead of generating one from AtoM's descriptions
|Assignee:||Dan Gillean||% Done:|
|Category:||Information object||Estimated time:||48.00 hours|
|Target version:||Release 2.4.0|
|Google Code Legacy ID:||Tested version:|
In #7462, we added the ability to generate a PDF finding aid in AtoM from the descriptive hierarchy, via an XSLT that transforms the EAD XML generated from the archival unit.
This feature will give users the ability to upload a locally created PDF finding aid, instead of generating one from descriptions in AtoM. Changes will include:
- An option to delete an existing finding aid (whether generated in AtoM or uploaded)
- Note that only one option will be available per descriptive hierarchy - you cannot have a generated PDF and an uploaded one at this time. Future development work could be done to enhance this.
- An option to upload a PDF finding aid, added below the existing option to generate one
- A simple upload page, for adding the uploaded PDF from a local source
- The requirement to delete an existing finding aid before the option to generate or upload a new one is available in the user interface.
We will be using the job scheduler to implement the upload, to avoid timeouts. Developer tasks for this include:
- Pass to arGenerateFindingAidJob new parameters (location of user's upload, workflow type option) via Gearman
- Pass parameters for upload and generate finding aid workflows
- Refactor arGenerateFindingAidJob::runJob() to support the different workflows.
- Add upload workflow logic: store uploaded PDF and call PDF fulltext indexing
- Index uploaded PDF
- Update search index mapping and invoke pdf2text
#5 Updated by Jesús García Crespo over 4 years ago
- Status changed from Code Review to Feedback
- Assignee changed from Jesús García Crespo to José Raddaoui Marín
Looking good but I've added some comments in the PR.
Also... I tried to upload and delete and it works! But when I tried to generate it I got this error:
atom_worker_1 | 2016-04-08 22:52:53 > Job 434 "arFindingAidJob": Generating finding aid (ccha-2sta-k2mm)... atom_worker_1 | 2016-04-08 22:53:01 > Job 434 "arFindingAidJob": Running: java -jar '/atom/src/lib/task/pdf/saxon9he.jar' -s:'/tmp/phpgLoOMc' -xsl:'/atom/src/lib/task/pdf/ead-pdf-inventory-summary.xsl' -o:'/tmp/phplaicjM' 2>&1 atom_worker_1 | 2016-04-08 22:53:08 > Job 434 "arFindingAidJob": Transforming the EAD with Saxon has failed. atom_worker_1 | 2016-04-08 22:53:08 > Job 434 "arFindingAidJob": ERROR(SAXON): Error atom_worker_1 | 2016-04-08 22:53:08 > Job 434 "arFindingAidJob": ERROR(SAXON): I/O error reported by XML parser processing file:/tmp/phpgLoOMc: lcweb2.loc.gov atom_worker_1 | 2016-04-08 22:53:08 > Job 434 "arFindingAidJob": ERROR(SAXON): Transformation failed: Run-time errors were reported atom_worker_1 | 2016-04-08 22:53:18 > Job 434 "arFindingAidJob": Job finished. atom_worker_1 | 2016-04-08 23:10:58 > Job 435 "arFindingAidJob": Uploading finding aid (ccha-2sta-k2mm)... atom_worker_1 | 2016-04-08 23:10:58 > Job 435 "arFindingAidJob": Finding aid uploaded successfully: /atom/src/downloads/ccha-2sta-k2mm.pdf atom_worker_1 | 2016-04-08 23:10:58 > Job 435 "arFindingAidJob": Obtaining finding aid transcript... atom_worker_1 | 2016-04-08 23:10:58 > Job 435 "arFindingAidJob": Obtaining the transcript has failed. atom_worker_1 | 2016-04-08 23:10:58 > Job 435 "arFindingAidJob": Job finished.
This is probably unrelated to the work that you are doing. Most likely this is a problem in my Docker setup but I wanted to see if you can reproduce before I start digging. Have you seen that error before? It seems to come from saxon:
I/O error reported by XML parser processing.
#6 Updated by Jesús García Crespo over 4 years ago
I think the saxon problem it's solved. It looks like it was a connectivity issue when it tried to access to lcweb2.loc.gov?
When the PDF is deleted the status is "File missing". Is that on purpose?
When a new description is created the status is "Unknown", which is also kind of misleading.
Do you think we should put more thought on these statuses? Dan, what do you think?
#7 Updated by José Raddaoui Marín over 4 years ago
Thanks Sevein! Sorry, first time I see that saxon issue.
I was talking with Dan last Friday about some improvements, I'll implement those alongside your feedback in the PR and the changes needed for #9655 and I'll assign the ticket back to you for another review.
I'll wait to hear Dan's suggestions about the statuses.
#8 Updated by José Raddaoui Marín over 4 years ago
- Status changed from Feedback to Code Review
- Assignee changed from José Raddaoui Marín to Jesús García Crespo
Hi Sevein, about the partial updates to the ES documents, there are two ways to do it, scripts and partial documents. Both methods reindex the whole document after the script is executed or the partial document is merged, so it shouldn't be a big difference in performance and they both work in almost every version the same way, but:
- Requires different configurantions based on the ES version.
- The configuration changes have to be set in the elasticsearch.yml global config. so they can't be set only for the AtoM index.
- Instead of adding those configurations the scripts can be added to the global ES 'config/scripts' folder, but that will require copying files from the AtoM folder on each deploy.
- Allows us to do more complex updates, for example deleting a field from a document.
- Doesn't require any configuration changes or files movement.
- Fields can`t be deleted from a document, but they can be set to 'null'.
We don't usually add 'null' values to the ES documents (we don't add the field) and that would require the use of scripts for the updates. But, to use partial documents, we could use 'null' values by default in the fields needed, for example, if a 'default_value' is not set in the index configuration, 'null' values and not added fields work the same for the 'missing' and 'exists' filters; and other changes that may be needed should be easy fixes. I think that may be better as it will avoid us all the deployment issues to use scripts.
More info in:
I've added the document updates using partial data to the PR, setting the FA transcript to 'null' by default, but I've also tested with scripts and script files and both update methods can be implemented using Elastica, so it should be easy to change it if you think that scrips are better.
#10 Updated by Dan Gillean over 4 years ago
Regarding JGC's comments @6:
To be fair to Radda, the finding aid statuses are old - yes, we should have put more time and consideration into them, but that is a result of not having enough time on the original feature, not this work. We have the status messages documented as such:
#11 Updated by Mike Gale over 4 years ago
The "file missing" status was meant to indicate the database says a finding aid exists, but AtoM can't see it on the file system. So if this status is coming up in other situations (e.g. user changing pdf settings), that's a bug. The unknown status was meant to indicate there aren't any database entries indicating a PDF exists, perhaps this naming choice wasn't the best one, maybe something like "None" would be better.
#13 Updated by José Raddaoui Marín about 4 years ago
- Status changed from Code Review to QA/Review
- Assignee changed from Jesús García Crespo to Dan Gillean
- Target version deleted (
This feature and #9655 are ready for QA/review. As we don't have a 'qa/2.4.x' branch yet the work is in a dev. branch: 'dev/issue-9627', which is based on 'qa/2.3.x' at the moment. It doesn't require an 'sql-upgrade' if the database is at the current 'qa/2.3.x' version, but it requires to clear all caches (restart php-fpm), a search index rebuilt and to restart the AtoM worker.
About the AtoM worker, Sevein pointed out in the PR that we need to document the filename change for the 'arGenerateFindingAidJob' class to 'arFindingAidJob'. Sevein's notes in the PR:
The reason why I was thinking that changing the name of arGenerateFindingAidJob may not be a good idea is because that name is used from configuration files not tracked by git, e.g. we track config/gearman.yml but the user may override that from apps/qubit/config/gearman.yml. If you want to go ahead with the new name make sure that the name change is documented in the release notes (put a note on the ticket and mark as requires documentation?). Thanks!
Some notes about the upload form:
- It uses a mime type validator based on the finding aid setting, only PDF files will be accepted if the setting is set to PDF, and the same for RTF.
- It uses the accept attribute to suggest the files browser the mime type. In my tests I've noticed that PDF works fine in Chrome and Mozilla, but RTF doesn't work in Chrome. I was using W10 as the OS.
Also, about the finding aid format. Before, if that setting was changed, the finding aid status showed 'File missing' for finding aids generated in the other format. Now, you should be able to download and delete that finding aid and the format only matters for the upload and generate processes.
There are other changes in the finding aid status to reflect better the current options.
#15 Updated by José Raddaoui Marín about 4 years ago
- We're now checking the user's update permissions over the top-level description to show the generate, upload and delete actions.
- Uploads are allowed over draft descriptions.
- The generate option is only allowed for drafts if the public finding aid setting is set to no.
#20 Updated by Dan Gillean over 3 years ago
- Requires documentation deleted (
Documentation updated for 2.4 in https://github.com/artefactual/atom-docs/commit/a7d6429783c54100a186c6286b59528f47256080