Improve CSV Import term creation logic performance
|Assignee:||Corinne Rogers||% Done:|
|Category:||CSV import||Estimated time:||20.00 hours|
|Target version:||Release 2.5.0|
|Google Code Legacy ID:||Tested version:||2.5|
There is code in the csv:import logic that will check the database to see if a given description import spreadsheet term already exists, and if not it will create it.
Currently this logic runs once for each individual term in the CSV cell e.g. place terms. This means for every term in a cell, it will fetch all place terms in the database, check if it exists, add it if necessary, then throw away the result set, and then repeat for the next place term in the row. This could be changed so that all place terms (for example) are fetched only once per CSV row so that they can be reused. In the case of a recent import, there were > 200K place terms, and multiple place terms per row.
Change the logic so that it retrieves all existing terms for a taxonomy once only, while still ensuring that terms that do not yet exist in the database are still created.
Resulting functionality should be exactly the same as before:
- new terms being imported should be added to the database if they do not already exist
#1 Updated by Steve Breker 3 months ago
Start of logic where terms are checked and added:
Term query that grabs all term types of a given taxonomy (e.g. places):
Similar work was completed recently for notes:
#4 Updated by Steve Breker 3 months ago
This enhancement is ready for code review.
I have tested to ensure that new terms are not duplicated, and that calls to this function can be made both as a single term string (the old way) or be passed as a list as an array.
Benchmarking notes using the PAS database (All times are in seconds):
What I have found is that the original query had performance problems on the database side - all terms in the taxonomy had to be retrieved and then filtered using the term_i18n.name. This process was being performed for every term contained in the import spreadsheet and was not scaling for large databases like PAS where certain taxonomies contain a large number of terms. This change pulls all terms in the taxonomy back to the csv:import task once, then uses that data to find or create the new terms being added in the spreadsheet.
In testing, this reduces the time to load the PAS homestead data by about 30-40%.
#6 Updated by Corinne Rogers 3 months ago
QA review - taxonomies compared between demo sites (1) http://pas.datamigration.accesstomemory.org and (2) http://pas2.datamigration.accesstomemory.org. In many cases terms in (1) were subsets of terms in (2) but terms in (1) were all found in (2); order was sometimes different when viewing "Most recent" but otherwise all appears as expected; no duplicates; links to descriptions active when present; translations available when present.
#8 Updated by Steve Breker 3 months ago
- Status changed from In progress to Code Review
- Assignee changed from Steve Breker to Mike Cantelon
I have modified the code to ensure the order that the terms are found/added matches the original logic. This impacts the order in which the relations are created and ensures that the terms on a description are displayed in the order in which they occurred in the original CSV which was the original behaviour before this performance enhancement.