-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Reuse space more in gridstore #6445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| #[derive(Debug)] | ||
| enum PointerUpdate { | ||
| Set(ValuePointer), | ||
| Unset(ValuePointer), | ||
| #[derive(Debug, Default)] | ||
| struct PointerUpdates { | ||
| /// Whether the latest pointer is set (true) or unset (false) | ||
| latest_is_set: bool, | ||
| /// List of pointers where the value was written | ||
| history: SmallVec<[ValuePointer; 1]>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main change of the PR
📝 WalkthroughWalkthroughThe changes introduce and standardize the use of the 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (13)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
lib/gridstore/src/tracker.rs (3)
39-45: Consider documenting the invariants ofPointerUpdates
Because correctness of the clean-up logic relies onhistorynever being empty oncelatest_is_setbecomestrue/false, a short doc-comment describing that assumption will help future maintainers.
172-190: Optionally deduplicateold_pointersbefore returning
With the fix above duplicates should be rare, but if external callers treat the returned vector as a set, it may be safer (and cheaper long-term) to weed out duplicates here:old_pointers.sort_unstable(); old_pointers.dedup();Not essential, but helps keep the freeing path idempotent.
291-293: Minor ergonomics—avoid doubleOptionwithand_then
Readability can be improved by replacing themap(..).or_else(..).flatten()dance with:self.pending_updates .get(&point_offset) .and_then(PointerUpdates::latest) .or_else(|| self.get_raw(point_offset).copied())Same behaviour, one fewer
Optionlayer.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
Cargo.toml(1 hunks)lib/collection/Cargo.toml(1 hunks)lib/gridstore/Cargo.toml(1 hunks)lib/gridstore/src/gridstore.rs(1 hunks)lib/gridstore/src/tracker.rs(8 hunks)lib/segment/Cargo.toml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: lint
- GitHub Check: integration-tests
- GitHub Check: storage-compat-test
🔇 Additional comments (8)
Cargo.toml (1)
208-208: Properly adding smallvec dependency.Adding the smallvec crate as a workspace dependency aligns with the PR objective to improve space reuse in gridstore by tracking update history.
lib/gridstore/Cargo.toml (1)
20-20: LGTM - Adding smallvec as a workspace dependency.This addition is consistent with the root Cargo.toml change and supports the implementation of pointer update history tracking in the tracker module.
lib/collection/Cargo.toml (1)
46-46: Good standardization of smallvec dependency.Converting from a fixed version to a workspace dependency enhances consistency across the project.
lib/segment/Cargo.toml (1)
94-94: Good standardization of smallvec dependency.Converting from a fixed version to a workspace dependency enhances consistency across the project.
lib/gridstore/src/gridstore.rs (1)
736-767: Good test refactoring with explicit block offset reuse test.The refactored test with a closure nicely validates the PR objective of improved space reuse. The test now explicitly verifies that after flushing updates, the first block offset becomes available again for reuse.
The closure eliminates code duplication and makes the test more maintainable while clearly demonstrating the improved behavior.
lib/gridstore/src/tracker.rs (3)
9-9: Good call introducingSmallVecto avoid heap allocations in the common case
The import is correct and sits with the other extern-crate imports, keeping the module header tidy.
309-313: Looks good – offset tracking logic preserved
Ensuresnext_pointer_offsetonly grows.
321-324: Behaviourally sound, but double-check race conditions in the caller
unsetrelies ongetreturning the latest flushed/pending value. IfTrackeris ever shared across threads, the caller must synchronise calls toset/unset/flush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the gridstore implementation to more effectively manage and free outdated pointers during flushes by tracking the complete history of pointer updates. Key changes include:
- Replacing the previous pointer update enum with a new PointerUpdates struct to record update history.
- Adjusting the persistence logic to free all outdated pointer locations.
- Updating tests and Cargo.toml dependency references for consistency.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| lib/segment/Cargo.toml | Changed smallvec to be workspace-managed. |
| lib/gridstore/src/tracker.rs | Introduced PointerUpdates struct and updated flush logic to correctly free earlier pointer updates. |
| lib/gridstore/src/gridstore.rs | Modified tests to verify block offset reuse after flush. |
| lib/gridstore/Cargo.toml | Updated smallvec dependency to workspace. |
| lib/collection/Cargo.toml | Updated smallvec dependency to workspace. |
| Cargo.toml | Updated smallvec dependency version reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
lib/gridstore/src/tracker.rs (2)
78-88: Consider using a reference instead of consuming self.The
outdated_pointersmethod currently consumesselfby taking ownership, which means thePointerUpdatesinstance can't be used after calling this method. Consider changing the signature to use a reference if you need to preserve the instance.- fn outdated_pointers(self) -> impl Iterator<Item = ValuePointer> { + fn outdated_pointers(&self) -> impl Iterator<Item = ValuePointer> + '_ { - let take = if self.latest_is_set { + let take = if self.latest_is_set { // all but the latest one - self.history.len().saturating_sub(1) + self.history.len().saturating_sub(1) } else { // all of them - self.history.len() + self.history.len() }; - self.history.into_iter().take(take) + self.history.iter().take(take).copied() }
171-201: Consider handling errors more robustly.If
mmap.flush()fails, the method will return an error, but the pending updates have already been taken and will be lost. Consider handling this more robustly to preserve updates in case of errors.pub fn write_pending_and_flush(&mut self) -> std::io::Result<Vec<ValuePointer>> { // Write pending updates from memory - let mut pending_updates = std::mem::take(&mut self.pending_updates); + // Take a clone instead of ownership to preserve updates in case of errors + let mut pending_updates = self.pending_updates.clone(); + self.pending_updates.clear(); let mut old_pointers = Vec::new(); for (point_offset, updates) in pending_updates.drain() { match updates.latest() { // ... existing code ... } old_pointers.extend(updates.outdated_pointers()); } // increment header count if necessary self.persist_pointer_count(); // Flush the mmap - self.mmap.flush()?; + if let Err(e) = self.mmap.flush() { + // Restore the pending updates on error + self.pending_updates = pending_updates; + return Err(e); + } Ok(old_pointers) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/gridstore/src/tracker.rs(8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: lint
🔇 Additional comments (10)
lib/gridstore/src/tracker.rs (10)
9-9: Good addition of the SmallVec dependency.The import of
smallvec::SmallVecis appropriate for this use case, as it provides a vector-like container that's optimized for the common case where there are few elements, avoiding heap allocations.
39-45: Well-structured replacement for PointerUpdate.This new
PointerUpdatesstruct effectively implements the core of the PR's objective, tracking all locations where values have been written between flushes, not just the last one.
47-62: Implementation looks good with proper safeguards.The implementation for
setandunsetis clear and includes the defensive check to prevent duplicate pointers in the history as suggested in previous review comments.
69-75: Clean implementation of latest pointer resolution.The
latestmethod correctly handles both set and unset states, returning the appropriate value based on thelatest_is_setflag.
78-88: Good implementation with safeguard against underflow.The
outdated_pointersmethod correctly identifies which pointers should be freed based on the latest state, and includes thesaturating_subsafeguard as suggested in a previous review.
107-107: Appropriate field type update.The
pending_updatesfield has been correctly updated to use the newPointerUpdatesstruct.
175-193: Core improvement for space reuse implemented correctly.This is the main change of the PR, as noted in previous comments. The implementation now properly:
- Handles the latest pointer state (set/unset)
- Collects all outdated pointers from the update history
- Ensures they're properly freed
This directly addresses the PR's objective of improving space reuse by tracking all locations where values were written between flushes.
294-294: Appropriate method update.The
getmethod has been properly updated to use the newPointerUpdates.latest()method.
312-315: Clean implementation of set method.The method now correctly creates or retrieves a
PointerUpdatesinstance and delegates to itssetmethod.
324-327: Well-implemented unset method.The method correctly creates or retrieves a
PointerUpdatesinstance and delegates to itsunsetmethod.
| /// If this is `true`, then history must have at least one element. | ||
| latest_is_set: bool, | ||
| /// List of pointers where the value has been written | ||
| history: SmallVec<[ValuePointer; 1]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I like the use of SmallVec here.
This PR fixes one thing we overlooked in gridstore.
Pending updates only stored the last place where a value is. So, when flushing, we would only free the previous stored value. If multiple updates happened in between flushes, we'd end up with storage that is never marked as free.
With these changes, now we keep track of all the places that values have been written in between flushes, so that we only keep the latest one, and are able to mark all the previous places as free.
Testing
For testing, I modified one test, so that it failed without the changes, and also did manual validation by making sure that
bfb -n 5000000 --max-id 10 -d 128 --float-payloads true --segments 1 --on-disk-payload --text-payloads --text-payload-length 100 --skip-field-indicesdoesn't grow the size of payload storage forever. It now reduces in between flushes, and it is minimal after updates have finished.