-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Support for Segment Anything Model 2 (SAM 2) #32394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@haithamkhedr Hi haitham! I saw that you were directly working on SAM2. We have closed #32317 for to clarify this is the main PR with direct authors. |
|
cc @NielsRogge , @qubvel and @amyeroberts ! |
|
@haithamkhedr Awesome - super excited to see SAM 2 available in transformers! 🥳 Let us know when the PR's ready for review or if you have any qs in the meantime |
|
For the HF team here but this was also requested upstream by other users: |
Hi @amyeroberts, it would be great to get an initial review on this PR now, it's currently functional and would appreciate your feedback. Thanks! |
|
Hi @haithamkhedr thanks for working on adding the model! Thats great that the model is already functional and we are looking forward for adding it, however, You can also get inspiration by looking at the Let me know if you have any specific questions regarding implementation! |
Hi @qubvel, thanks for the feedback. So overall, the interaction with the models has to be through a |
|
There was a "stateful issue" also for serving inference on torchserve but there was a sort of workaround pytorch/serve#2743 (comment) |
|
There will be something similar with onnx exports when they will be available for the video mode (currently only users image exports) microsoft/onnxruntime#20943 |
I suppose you can make something similar to text models and the @amyeroberts what do you think regarding this design? |
This should work, the state is updated in place so looping over |
|
And also |
I haven't yet seen something similar in the codebase, but it's better to ask @ArthurZucker or @amyeroberts. Here is a raw design I have in mind, the model is "stateless" (without memory) and the state or memory is passed at each step. It also allows user interaction, such as adding points at any frame. Moreover, it allows to use model at the same time with different state objects without resetting or creating several different instances of the model. Let me know what you think about it import torch
from typing import Optional
class Sam2VideoState:
"""Store frame-wise information for video segmentation:
points, labels, and past model hidden states used in the model.
"""
...
class Sam2ForVideoSegmentation(Sam2PretrainedModel):
...
def forward(
self,
pixel_values: torch.Tensor,
points: Optional[torch.Tensor] = None,
labels: Optional[torch.Tensor] = None,
state: Optional[Sam2VideoState] = None,
) -> Sam2ForVideoSegmentationOutput:
if state is None:
state = Sam2VideoState()
if points is not None:
state.add_points(points) # add points at the current frame
if labels is not None:
state.add_labels(labels) # add labels at the current frame
# Forward pass with `state`. The state is used to access model's hidden states from previous frames.
# State is not modified inside the model, hidden states are returned in `output` and updated
# at the next step.
output = self.video_model(pixel_values, state)
# Add the current model's hidden states to the `state`
state.add_hidden_states(output.hidden_states)
state.step()
return Sam2ForVideoSegmentationOutput(
mask_logits=output.mask_logits,
object_logits=output.object_logits,
state=state,
... # other outputs
)
model = Sam2ForVideoSegmentation.from_pretrained("model_name")
image_processor = Sam2ImageProcessor.from_pretrained("model_name")
state = None
for i, frame in enumerate(video):
# points and labels can be passed for specific frames
inputs = image_processor(images=frame, points=points, labels=labels)
outputs = model(**inputs, state=state)
state = outputs.state
frame_annotations = image_processor.post_process_image_segmentation(
**outputs, target_size=frame.size
)
# save/plot annotations |
|
My understanding is that SAM2 requires torch >=2.3.1. Will that be the case for this transformers implementation as well? |
|
I think @qubvel's suggestion is a good one and aligns well with transformer patterns. For other models which handle state e.g. RWKV we pass |
|
Have you evaluated what is the overhead in this case for exchanging cache/state on every frame? |
Thanks for drafting this design. The main concern I have is that it requires the user to use 2 models to be able to do prediction on videos and do some bookkeeping on video frames, whereas they really only need one model. Will this require splitting the checkpoints across these 2 models ( |
|
@haithamkhedr ImageProcessor is not a model, it's an object responsible for image/frame preprocessing (resizing, normalizing..) and postprocessing (e.g. threshold filtering, applying final activation to logits, ..). This pattern is used across all our vision models. |
@haithamkhedr Thanks for your work! Will this be realized? Really looking forward to use it in the live streaming mode as @qubvel suggested |
|
@haithamkhedr Will you continue? It would be great to get this contribution merged into HF. |
|
Hi, are there any updates on integrating SAM 2 into HF Transformers? We are trying to fine-tune it with the HF trainer, and it would be great if SAM 2 were in native HF format! |
|
@haithamkhedr Thanks for all the work in this PR adding this model to the library! This is a model that the community are really excited about having in transformers, and so there's also a lot of interest in adding the model themselves. As there hasn't been any recent activity here, two contributors @RUFFY-369 and @SangbumChoi, have (re)started an effort on #32317, which will likely be the PR to be merged in. If you're still interested in collaborating, what I would suggest is helping with that effort and you can be added as a co-author on that PR. |
|
This is currently not in active development. Closing for now |
This PR integrates SAM 2 models into hugging face
transformers(Closes #32308). Sample usage:Video Predictor
Image Predictor
TODO: