-
Notifications
You must be signed in to change notification settings - Fork 677
Fixing DoRA docs, adding to mem opt tutorial #1918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1918
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 0d76a65 with merge base 1bbd749 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
":ref:`glossary_opt_in_bwd`", "Helps reduce memory usage when using stateful optimizers, particularly when full-finetuning large models with high gradient memory usage. This is not compatible with ``gradient_accumulation_steps``, so training may slow down due to reduced model throughput." | ||
":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training." | ||
":ref:`glossary_qlora`", "When you need even more memory savings than LoRA, at the potential cost of some training speed. Useful for very large models or limited hardware." | ||
":ref:`glossary_dora`", "Like LoRA, DoRA can provide significant memory savings and training speed-ups. DoRA may improve performance over LoRA, particularly when using small rank updates." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know when someone would choose dora over lora?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly not sure, according to the paper it's just straight up better
is a scalar vector that adjusts the scale, while the direction component corresponds to the original LoRA decomposition and | ||
updates the orientation of weights. | ||
|
||
DoRA adds a small overhead to LoRA training due to the addition of the magnitude parameter, but it has been shown to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perf or memory overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not 100% but there's an added parameter and extra computation so I'd say both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more typos, stamping to unblock
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1918 +/- ##
==========================================
- Coverage 67.77% 67.63% -0.15%
==========================================
Files 304 304
Lines 16199 16241 +42
==========================================
+ Hits 10979 10984 +5
- Misses 5220 5257 +37 ☔ View full report in Codecov by Sentry. |
Context
What is the purpose of this PR? Is it to
Fix DoRA docstring, add to the rest of the docs