launchpad-reviewers team mailing list archive
-
launchpad-reviewers team
-
Mailing list archive
-
Message #32899
[Merge] ~ilkeremrekoc/launchpad:chunk-librarian-gc into launchpad:master
İlker Emre Koç has proposed merging ~ilkeremrekoc/launchpad:chunk-librarian-gc into launchpad:master.
Commit message:
Change librarian-gc to query database in chunks
Requested reviews:
Launchpad code reviewers (launchpad-reviewers)
For more details, see:
https://code.launchpad.net/~ilkeremrekoc/launchpad/+git/launchpad/+merge/491294
We are facing librarian-gc processes getting killed due to the time
they take for querying the whole LibraryFileContent table. This change
will make the query use keyset pagination to get chunks of 1 million
rows instead.
--
Your team Launchpad code reviewers is requested to review the proposed merge of ~ilkeremrekoc/launchpad:chunk-librarian-gc into launchpad:master.
diff --git a/lib/lp/services/librarianserver/librariangc.py b/lib/lp/services/librarianserver/librariangc.py
index bf2c4cf..2966295 100644
--- a/lib/lp/services/librarianserver/librariangc.py
+++ b/lib/lp/services/librarianserver/librariangc.py
@@ -34,6 +34,7 @@ debug = False
STREAM_CHUNK_SIZE = 64 * 1024
+DATABASE_CHUNK_SIZE = 1_000_000
def file_exists(content_id):
@@ -696,24 +697,45 @@ def delete_unwanted_disk_files(con):
swift_enabled = getFeatureFlag("librarian.swift.enabled") or False
- cur = con.cursor(name="librariangc_disk_lfcs")
+ cur = con.cursor() # nameless cursor for multiple executions
# Calculate all stored LibraryFileContent ids that we want to keep.
# Results are ordered so we don't have to suck them all in at once.
- cur.execute(
+ def get_next_wanted_content_id_generator():
"""
- SELECT id FROM LibraryFileContent ORDER BY id
+ Generator that yields IDs from LibraryFileContent in chunks using
+ keyset pagination. Fetches rows in batches of DATABASE_CHUNK_SIZE for
+ efficiency and stops when done.
+
+ yields: int: Next ID from the table.
"""
- )
- content_id_iter = iter(cur)
+
+ last_id = 0
+ while True:
+ cur.execute(
+ """
+ SELECT id
+ FROM LibraryFileContent
+ WHERE id > %s
+ ORDER BY id
+ LIMIT %s
+ """,
+ (last_id, DATABASE_CHUNK_SIZE),
+ )
+
+ count = 0
+ for row in cur: # stream rows
+ yield row[0]
+ last_id = row[0]
+ count += 1
+
+ if count == 0:
+ break # no more rows
+
+ content_id_iter = get_next_wanted_content_id_generator()
def get_next_wanted_content_id():
- try:
- result = next(content_id_iter)
- except StopIteration:
- return None
- else:
- return result[0]
+ return next(content_id_iter, None)
removed_count = 0
content_id = next_wanted_content_id = -1
Follow ups