Feature #9627

Allow users to upload a PDF finding aid instead of generating one from AtoM's descriptions

Added by José Raddaoui Marín over 4 years ago. Updated over 3 years ago.

Status:VerifiedStart date:01/28/2016
Priority:MediumDue date:
Assignee:Dan Gillean% Done:

0%

Category:Information objectEstimated time:48.00 hours
Target version:Release 2.4.0
Google Code Legacy ID: Tested version:
Sponsored:Yes Requires documentation:

Description

In #7462, we added the ability to generate a PDF finding aid in AtoM from the descriptive hierarchy, via an XSLT that transforms the EAD XML generated from the archival unit.

This feature will give users the ability to upload a locally created PDF finding aid, instead of generating one from descriptions in AtoM. Changes will include:

  • An option to delete an existing finding aid (whether generated in AtoM or uploaded)
    • Note that only one option will be available per descriptive hierarchy - you cannot have a generated PDF and an uploaded one at this time. Future development work could be done to enhance this.
  • An option to upload a PDF finding aid, added below the existing option to generate one
  • A simple upload page, for adding the uploaded PDF from a local source
  • The requirement to delete an existing finding aid before the option to generate or upload a new one is available in the user interface.

We will be using the job scheduler to implement the upload, to avoid timeouts. Developer tasks for this include:

  • Pass to arGenerateFindingAidJob new parameters (location of user's upload, workflow type option) via Gearman
    • Pass parameters for upload and generate finding aid workflows
  • Refactor arGenerateFindingAidJob::runJob() to support the different workflows.
    • Add upload workflow logic: store uploaded PDF and call PDF fulltext indexing
  • Index uploaded PDF
    • Update search index mapping and invoke pdf2text

Related issues

Related to Access to Memory (AtoM) - Bug #9682: Changes in information object slugs break finding aid dow... Verified 04/11/2016
Related to Access to Memory (AtoM) - Feature #9700: Show finding aid links at all levels Verified 01/28/2016
Related to Access to Memory (AtoM) - Task #9787: Update Nginx configuration with downloads location Won't fix 05/03/2016
Related to Access to Memory (AtoM) - Feature #9655: Improvements to search to better support searching indexe... Verified 01/28/2016
Related to AtoM Wishlist - Feature #11048: Command-line task to automatically generate finding aids ... New 04/07/2017

History

#2 Updated by José Raddaoui Marín over 4 years ago

  • Status changed from New to Code Review
  • Assignee changed from José Raddaoui Marín to Nick Wilkinson

Ready for code review in PR 310

#3 Updated by José Raddaoui Marín over 4 years ago

  • Target version deleted (Release 2.3.0)

#4 Updated by Jesús García Crespo over 4 years ago

  • Assignee changed from Nick Wilkinson to Jesús García Crespo

#5 Updated by Jesús García Crespo over 4 years ago

  • Status changed from Code Review to Feedback
  • Assignee changed from Jesús García Crespo to José Raddaoui Marín

Looking good but I've added some comments in the PR.
Also... I tried to upload and delete and it works! But when I tried to generate it I got this error:

atom_worker_1 | 2016-04-08 22:52:53 > Job 434 "arFindingAidJob": Generating finding aid (ccha-2sta-k2mm)...
atom_worker_1 | 2016-04-08 22:53:01 > Job 434 "arFindingAidJob": Running: java -jar '/atom/src/lib/task/pdf/saxon9he.jar' -s:'/tmp/phpgLoOMc' -xsl:'/atom/src/lib/task/pdf/ead-pdf-inventory-summary.xsl' -o:'/tmp/phplaicjM' 2>&1
atom_worker_1 | 2016-04-08 22:53:08 > Job 434 "arFindingAidJob": Transforming the EAD with Saxon has failed.
atom_worker_1 | 2016-04-08 22:53:08 > Job 434 "arFindingAidJob": ERROR(SAXON): Error
atom_worker_1 | 2016-04-08 22:53:08 > Job 434 "arFindingAidJob": ERROR(SAXON):   I/O error reported by XML parser processing file:/tmp/phpgLoOMc: lcweb2.loc.gov
atom_worker_1 | 2016-04-08 22:53:08 > Job 434 "arFindingAidJob": ERROR(SAXON): Transformation failed: Run-time errors were reported
atom_worker_1 | 2016-04-08 22:53:18 > Job 434 "arFindingAidJob": Job finished.
atom_worker_1 | 2016-04-08 23:10:58 > Job 435 "arFindingAidJob": Uploading finding aid (ccha-2sta-k2mm)...
atom_worker_1 | 2016-04-08 23:10:58 > Job 435 "arFindingAidJob": Finding aid uploaded successfully: /atom/src/downloads/ccha-2sta-k2mm.pdf
atom_worker_1 | 2016-04-08 23:10:58 > Job 435 "arFindingAidJob": Obtaining finding aid transcript...
atom_worker_1 | 2016-04-08 23:10:58 > Job 435 "arFindingAidJob": Obtaining the transcript has failed.
atom_worker_1 | 2016-04-08 23:10:58 > Job 435 "arFindingAidJob": Job finished.

This is probably unrelated to the work that you are doing. Most likely this is a problem in my Docker setup but I wanted to see if you can reproduce before I start digging. Have you seen that error before? It seems to come from saxon: I/O error reported by XML parser processing.

#6 Updated by Jesús García Crespo over 4 years ago

I think the saxon problem it's solved. It looks like it was a connectivity issue when it tried to access to lcweb2.loc.gov?

When the PDF is deleted the status is "File missing". Is that on purpose?
When a new description is created the status is "Unknown", which is also kind of misleading.
Do you think we should put more thought on these statuses? Dan, what do you think?

#7 Updated by José Raddaoui Marín over 4 years ago

Thanks Sevein! Sorry, first time I see that saxon issue.

I was talking with Dan last Friday about some improvements, I'll implement those alongside your feedback in the PR and the changes needed for #9655 and I'll assign the ticket back to you for another review.

I'll wait to hear Dan's suggestions about the statuses.

#8 Updated by José Raddaoui Marín over 4 years ago

  • Status changed from Feedback to Code Review
  • Assignee changed from José Raddaoui Marín to Jesús García Crespo

Hi Sevein, about the partial updates to the ES documents, there are two ways to do it, scripts and partial documents. Both methods reindex the whole document after the script is executed or the partial document is merged, so it shouldn't be a big difference in performance and they both work in almost every version the same way, but:

Using scripts:

- Requires different configurantions based on the ES version.
- The configuration changes have to be set in the elasticsearch.yml global config. so they can't be set only for the AtoM index.
- Instead of adding those configurations the scripts can be added to the global ES 'config/scripts' folder, but that will require copying files from the AtoM folder on each deploy.
- Allows us to do more complex updates, for example deleting a field from a document.

Using partial documents:

- Doesn't require any configuration changes or files movement.
- Fields can`t be deleted from a document, but they can be set to 'null'.

We don't usually add 'null' values to the ES documents (we don't add the field) and that would require the use of scripts for the updates. But, to use partial documents, we could use 'null' values by default in the fields needed, for example, if a 'default_value' is not set in the index configuration, 'null' values and not added fields work the same for the 'missing' and 'exists' filters; and other changes that may be needed should be easy fixes. I think that may be better as it will avoid us all the deployment issues to use scripts.

More info in:

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/partial-updates.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html
https://github.com/elastic/elasticsearch/issues/5853
https://github.com/elastic/elasticsearch/issues/6418

I've added the document updates using partial data to the PR, setting the FA transcript to 'null' by default, but I've also tested with scripts and script files and both update methods can be implemented using Elastica, so it should be easy to change it if you think that scrips are better.

#9 Updated by José Raddaoui Marín over 4 years ago

  • Related to Bug #9682: Changes in information object slugs break finding aid download/delete added

#10 Updated by Dan Gillean over 4 years ago

Regarding JGC's comments @6:

To be fair to Radda, the finding aid statuses are old - yes, we should have put more time and consideration into them, but that is a result of not having enough time on the original feature, not this work. We have the status messages documented as such:

From: https://www.accesstomemory.org/en/docs/2.3/user-manual/reports-printing/print-finding-aid/#generate-finding-aid

#11 Updated by Mike Gale over 4 years ago

Hi,

The "file missing" status was meant to indicate the database says a finding aid exists, but AtoM can't see it on the file system. So if this status is coming up in other situations (e.g. user changing pdf settings), that's a bug. The unknown status was meant to indicate there aren't any database entries indicating a PDF exists, perhaps this naming choice wasn't the best one, maybe something like "None" would be better.

regards

#12 Updated by Dan Gillean over 4 years ago

  • Target version set to Release 2.4.0

#13 Updated by José Raddaoui Marín about 4 years ago

  • Status changed from Code Review to QA/Review
  • Assignee changed from Jesús García Crespo to Dan Gillean
  • Target version deleted (Release 2.4.0)

This feature and #9655 are ready for QA/review. As we don't have a 'qa/2.4.x' branch yet the work is in a dev. branch: 'dev/issue-9627', which is based on 'qa/2.3.x' at the moment. It doesn't require an 'sql-upgrade' if the database is at the current 'qa/2.3.x' version, but it requires to clear all caches (restart php-fpm), a search index rebuilt and to restart the AtoM worker.

About the AtoM worker, Sevein pointed out in the PR that we need to document the filename change for the 'arGenerateFindingAidJob' class to 'arFindingAidJob'. Sevein's notes in the PR:

The reason why I was thinking that changing the name of arGenerateFindingAidJob may not be a good idea is because that name is used from configuration files not tracked by git, e.g. we track config/gearman.yml but the user may override that from apps/qubit/config/gearman.yml. If you want to go ahead with the new name make sure that the name change is documented in the release notes (put a note on the ticket and mark as requires documentation?). Thanks!

Some notes about the upload form:

- It uses a mime type validator based on the finding aid setting, only PDF files will be accepted if the setting is set to PDF, and the same for RTF.
- It uses the accept attribute to suggest the files browser the mime type. In my tests I've noticed that PDF works fine in Chrome and Mozilla, but RTF doesn't work in Chrome. I was using W10 as the OS.

Also, about the finding aid format. Before, if that setting was changed, the finding aid status showed 'File missing' for finding aids generated in the other format. Now, you should be able to download and delete that finding aid and the format only matters for the upload and generate processes.

There are other changes in the finding aid status to reflect better the current options.

#14 Updated by José Raddaoui Marín about 4 years ago

  • Category set to Information object
  • Target version set to Release 2.4.0

#15 Updated by José Raddaoui Marín about 4 years ago

Other notes:

- We're now checking the user's update permissions over the top-level description to show the generate, upload and delete actions.
- Uploads are allowed over draft descriptions.
- The generate option is only allowed for drafts if the public finding aid setting is set to no.

#16 Updated by Dan Gillean about 4 years ago

  • Related to Feature #9700: Show finding aid links at all levels added

#17 Updated by José Raddaoui Marín about 4 years ago

  • Related to Task #9787: Update Nginx configuration with downloads location added

#18 Updated by Dan Gillean about 4 years ago

  • Status changed from QA/Review to Verified

#20 Updated by Dan Gillean over 3 years ago

  • Requires documentation deleted (Yes)

#21 Updated by Dan Gillean over 3 years ago

  • Related to Feature #9655: Improvements to search to better support searching indexed finding aid text added

#22 Updated by Dan Gillean over 3 years ago

  • Related to Feature #11048: Command-line task to automatically generate finding aids from existing descriptions based on Finding aid settings added

Also available in: Atom PDF