Task #12826

Improve CSV Import term creation logic performance

Added by Steve Breker 3 months ago. Updated 3 months ago.

Status:VerifiedStart date:02/18/2019
Priority:MediumDue date:
Assignee:Corinne Rogers% Done:

0%

Category:CSV importEstimated time:20.00 hours
Target version:Release 2.5.0
Google Code Legacy ID: Tested version:2.5
Sponsored:Yes Requires documentation:No

Description

There is code in the csv:import logic that will check the database to see if a given description import spreadsheet term already exists, and if not it will create it.

Currently this logic runs once for each individual term in the CSV cell e.g. place terms. This means for every term in a cell, it will fetch all place terms in the database, check if it exists, add it if necessary, then throw away the result set, and then repeat for the next place term in the row. This could be changed so that all place terms (for example) are fetched only once per CSV row so that they can be reused. In the case of a recent import, there were > 200K place terms, and multiple place terms per row.

Change the logic so that it retrieves all existing terms for a taxonomy once only, while still ensuring that terms that do not yet exist in the database are still created.

Resulting functionality should be exactly the same as before:
- new terms being imported should be added to the database if they do not already exist

History

#3 Updated by Dan Gillean 3 months ago

  • Tracker changed from Bug to Task
  • Category set to CSV import
  • Assignee set to Steve Breker
  • Target version set to Release 2.5.0
  • Estimated time set to 20.00
  • Sponsored changed from No to Yes
  • Requires documentation set to No

#4 Updated by Steve Breker 3 months ago

This enhancement is ready for code review.

I have tested to ensure that new terms are not duplicated, and that calls to this function can be made both as a single term string (the old way) or be passed as a list as an array.

Benchmarking notes using the PAS database (All times are in seconds):
https://docs.google.com/spreadsheets/d/1d6SKDHDLfIh-jzblmittnwhlPwDTcPH4LWZ31b94aZE/edit?usp=sharing

What I have found is that the original query had performance problems on the database side - all terms in the taxonomy had to be retrieved and then filtered using the term_i18n.name. This process was being performed for every term contained in the import spreadsheet and was not scaling for large databases like PAS where certain taxonomies contain a large number of terms. This change pulls all terms in the taxonomy back to the csv:import task once, then uses that data to find or create the new terms being added in the spreadsheet.

In testing, this reduces the time to load the PAS homestead data by about 30-40%.

https://github.com/artefactual/atom/pull/844

#5 Updated by Steve Breker 3 months ago

  • Status changed from New to QA/Review
  • Assignee changed from Steve Breker to Dan Gillean

CR completed by Mike C.

Ready for QA.

We can check the demo data on the PAS server ensuring that terms imported the same, and in the same order as on the PAS2 server.

#6 Updated by Corinne Rogers 3 months ago

QA review - taxonomies compared between demo sites (1) http://pas.datamigration.accesstomemory.org and (2) http://pas2.datamigration.accesstomemory.org. In many cases terms in (1) were subsets of terms in (2) but terms in (1) were all found in (2); order was sometimes different when viewing "Most recent" but otherwise all appears as expected; no duplicates; links to descriptions active when present; translations available when present.

#7 Updated by Corinne Rogers 3 months ago

  • Status changed from QA/Review to In progress
  • Assignee changed from Dan Gillean to Steve Breker

Back to Steve to fix issue with order of terms

#8 Updated by Steve Breker 3 months ago

  • Status changed from In progress to Code Review
  • Assignee changed from Steve Breker to Mike Cantelon

I have modified the code to ensure the order that the terms are found/added matches the original logic. This impacts the order in which the relations are created and ensures that the terms on a description are displayed in the order in which they occurred in the original CSV which was the original behaviour before this performance enhancement.

https://github.com/artefactual/atom/pull/849

#9 Updated by Mike Cantelon 3 months ago

  • Status changed from Code Review to Feedback
  • Assignee changed from Mike Cantelon to Steve Breker

Looks good to me!

#10 Updated by Steve Breker 3 months ago

  • Status changed from Feedback to QA/Review
  • Assignee changed from Steve Breker to Corinne Rogers

Ready for QA

#11 Updated by Corinne Rogers 3 months ago

Spot-checked several descriptions and the access terms are in the same order in both test instances.

#12 Updated by Dan Gillean 3 months ago

  • Status changed from QA/Review to Verified

Also available in: Atom PDF