FastCDC chunking #26

Merged
jmarya merged 5 commits from fastcdc into main 2025-12-26 16:10:03 +00:00
Owner

This switches blob storage again to FastCDC for content-defined chunking, so small changes in large files don’t waste space. Old archives still work fine because reading hasn’t changed, but to actually get the new chunking benefits, the records should be copied to a new archive. That’s why there’s now a copy command, which duplicates all files and records from one archive to another. copy needs to be fixed up though, it's pretty basic. Future improvements could make copy even smarter, maybe only copy certain domains or paths over.

This switches blob storage again to FastCDC for content-defined chunking, so small changes in large files don’t waste space. Old archives still work fine because reading hasn’t changed, but to actually get the new chunking benefits, the records should be copied to a new archive. That’s why there’s now a `copy` command, which duplicates all files and records from one archive to another. `copy` needs to be fixed up though, it's pretty basic. Future improvements could make `copy` even smarter, maybe only copy certain domains or paths over.
feat: fastcdc chunking
All checks were successful
ci/woodpecker/pr/test/3 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
ci/woodpecker/pr/test/2 Pipeline was successful
1be7edd2fd
feat: copy cmd
Some checks failed
ci/woodpecker/pr/test/1 Pipeline failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/3 Pipeline failed
c2aef66b17
feat: copy progress bar
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/3 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
75d39f3dfc
perf: 8 async copies
All checks were successful
ci/woodpecker/pr/test/1 Pipeline was successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/3 Pipeline was successful
1dd5283724
jmarya changed title from WIP: FastCDC chunking to FastCDC chunking 2025-12-26 00:53:47 +00:00
Author
Owner

The copy command is now implemented as a logical copy over all records stored in the archive.
Copying over the prod archive into the new fastcdc based archives yielded another massive improvement:
The original archive turned from roughly 117G to 44G, or roughly 37% the original size.

The copy command is now implemented as a logical copy over all records stored in the archive. Copying over the prod archive into the new fastcdc based archives yielded another massive improvement: The original archive turned from roughly 117G to 44G, or roughly 37% the original size.
jmarya merged commit 12ed3cbc8a into main 2025-12-26 16:10:03 +00:00
jmarya deleted branch fastcdc 2025-12-26 16:10:04 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
jmarya/webarc!26
No description provided.