Feature #8158

Improve OAI resumptionToken implementation

Added by Dan Gillean about 7 years ago. Updated about 7 years ago.

Status:VerifiedStart date:03/27/2015
Priority:MediumDue date:
Assignee:Dan Gillean% Done:

0%

Category:OAI-PMH
Target version:Release 2.2.0
Google Code Legacy ID: Tested version:
Sponsored:No Requires documentation:

Description

Currently, a call to list identifiers can be expressed in OAI like so:

http://example-atom-site.com/;oai?verb=ListIdentifiers&metadataPrefix=oai_dc

With most actively used AtoM sites, there will be more than 100 results return, at which point the response terminates and provides a resumptionToken so a harvester can continue the request. Currently, this output looks like so:

<resumptionToken>from=&until=&cursor=100</resumptionToken>

It has been noted in the User Forum that manually copying the resumptionToken and inserting it into the URL for the subsequent request will fail. Instead, the user must manually encode special characters for use in a URL, like so:

http://example-atom-site.com/;oai?verb=ListIdentifiers&metadataPrefix=oai_dc&resumptionToken=cursor%3D100

Looking at the OAI standard, it notes the following:

Before including a resumptionToken in the URL of a subsequent request, a harvester must encode any special characters in it.

However, the Digital Library Federation's Best Practices for OAI Data Provider Implementations and Shareable Metadata notes that:

resumptionTokens in the response should not be URL encoded. This is different from an OAI request, in which resumptionTokens MUST be URL encoded. It is a best practice not to use characters in resumptionTokens that require URL encoding.

All other examples I have seen for resumption tokens are different than AtoM's current implementation. Because of this, we should improve the way AtoM generates resumption tokens so that URL encoding is not required.

Some examples from other sites:

Example listed in the DLF Best Practice guidelines, linked above:

 <resumptionToken expirationDate="2005-07-26T16:57:24Z" completeListSize="31979" cursor="4">lr42e519f4d1e58</resumptionToken>

This resumptionToken indicates when it will expire, how many incomplete lists have been returned, and what the complete number of records is for the ListRecords request. As stated above it is a best practice to include both these attributes in a resumptionToken.

From: http://ecommons.usask.ca/oai/request?verb=ListRecords&metadataPrefix=oai_dc (DSpace)

<resumptionToken expirationDate="2015-03-27T21:34:54Z">0001-01-01T00:00:00Z/9999-12-31T23:59:59Z//oai_dc/100</resumptionToken>

Another DSpace example (pretty much same as above): http://repositorio-tematico.up.pt/oaiextended/request?verb=ListRecords&metadataPrefix=oai_dc&set=rap

History

#1 Updated by Mike Cantelon about 7 years ago

  • Status changed from New to QA/Review
  • Assignee changed from Mike Cantelon to Dan Gillean

This was fixed my Mark Triggs and I've merged to qa/2.2.x.

#2 Updated by Dan Gillean about 7 years ago

  • Status changed from QA/Review to Feedback
  • Target version set to Release 2.2.0

Hi Mark,

One thing I've noticed:

The first resumptionToken seems to work perfectly. However, the same resumption token is issued in the next request - meaning the harvester probably won't be able to page past the first 2 sets of results. Shouldn't it issue a different resumptionToken each time, so the harvester can continue on to the next batch of a truncated set?

#3 Updated by Mark Triggs about 7 years ago

Hi Dan,

It definitely should give a different resumption token on each page or it's just not going to work :) Are you sure they're exactly the same? I got tricked a couple of times because they're only different by one character. For example, here are two consecutive ones I saw:

  eyJmcm9tIjoiIiwidW50aWwiOiIiLCJjdXJzb3IiOjE2MDAsIm1ldGFkYXRhUHJlZml4Ijoib2FpX2RjIiwic2V0Ijoib2FpOnZpcnR1YWw6dG9wLWxldmVsLXJlY29yZHMifQ==
  eyJmcm9tIjoiIiwidW50aWwiOiIiLCJjdXJzb3IiOjE3MDAsIm1ldGFkYXRhUHJlZml4Ijoib2FpX2RjIiwic2V0Ijoib2FpOnZpcnR1YWw6dG9wLWxldmVsLXJlY29yZHMifQ==
                                             ^

Which at first blush seem identical, except for the character I marked with a '^'. The nature of the tokens is that they're mostly only changing by one character (offset=100 becomes offset=200), so the resulting base64-encoded version only changes by one character as well. I wasted an embarrassingly long amount of time "debugging" this ;)

#4 Updated by Dan Gillean about 7 years ago

  • Status changed from Feedback to Verified

OMFG hahahaha!

You're totally right, Mark. I should have kept going with the tests before updating the ticket. It works. I've iterated through 4 consecutive pages, and I have checked that they are different than the records previously returned, and that the curson position is iterating in the returned header.

Thanks, this is great!

I'll update the new docs with this functional example, and remove the warning I had there about URL encoding etc.

#5 Updated by Dan Gillean about 7 years ago

V. simple docs update here, to correct the resumptionToken example and remove the warning:

Also available in: Atom PDF