KEMBAR78
Clean up RemoteCache classes by aorenste · Pull Request #134032 · pytorch/pytorch · GitHub
Skip to content

Conversation

aorenste
Copy link
Contributor

@aorenste aorenste commented Aug 20, 2024

Summary:
The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.

Update them to be more consistent:

  1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile

  2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)

  3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.

Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178859

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134032

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f972b3b with merge base 5dad6a5 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

aorenste added a commit to aorenste/pytorch that referenced this pull request Aug 21, 2024
Summary:
Pull Request resolved: pytorch#134032

The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.

Update them to be more consistent:

1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile

2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)

3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.

Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178859
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 27, 2024
@s-deeper
Copy link

s-deeper commented Aug 28, 2024

This was suggest for 2.4.1 cherry-picking. Can you evaluate the relationship with the crash at #132400 (comment) ?

@aorenste
Copy link
Contributor Author

This was suggest for 2.4.1 cherry-picking. Can you evaluate the relationship with the crash at #132400 (comment) ?

That does seem like a problem fixed by this PR - but the PR isn't ready to land yet (hoping today or tomorrow) so definitely don't cherry-pick it until it lands.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

@evkogs
Copy link

evkogs commented Aug 28, 2024

Remote cache is crashing I don't know if we could cherry-pick #134032

Yep, it does, I confirm the issue

So I hope this PR fixes it!

[rank14]: Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                              
[rank14]:   File "/home/ubuntu/efs_gpu/libs/sd-scripts/./sdxl_train.py", line 960, in <module>                                                                                                                                                                                                                                                                                            
[rank14]:     train(args)                                                                                                                                                                                                                                                                                                                                                                 
[rank14]:   File "/home/ubuntu/efs_gpu/libs/sd-scripts/./sdxl_train.py", line 747, in train                                                                                                                                                                                                                                                                                               
[rank14]:     accelerator.backward(loss)                                                                                                                                                                                                                                                                                                                                                  
[rank14]:   File "/home/ubuntu/efs_gpu/libs/accelerate/src/accelerate/accelerator.py", line 2195, in backward                                                                                                                                                                                                                                                                             
[rank14]:     self.scaler.scale(loss).backward(**kwargs)                                                                                                                                                                                                                                                                                                                                  
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_tensor.py", line 521, in backward                                                                                                                                                                                                                                           
[rank14]:     torch.autograd.backward(                                                                                                                                                                                                                                                                                                                                                    
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/autograd/__init__.py", line 289, in backward                                                                                                                                                                                                                                 
[rank14]:     _engine_run_backward(                                                                                                                                                                                                                                                                                                                                                       
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward                                                                                                                                                                                                                        
[rank14]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                                                                                                                                                                                                       
[rank14]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                       
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/autograd/function.py", line 306, in apply                                                                                                                                                                                                                                    
[rank14]:     return user_fn(self, *args)                                                                                                                                                                                                                                                                                                                                                 
[rank14]:            ^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                                 
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1861, in backward                                                                                                                                                                                                        
[rank14]:     out = call_compiled_backward()                                                                                                                                                                                                                                                                                                                                              
[rank14]:           ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                              
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1809, in call_compiled_backward                                                                                                                                                                                          
[rank14]:     out = call_func_at_runtime_with_args(                                                                                                                                                                                                                                                                                                                                       
[rank14]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                       
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 120, in call_func_at_runtime_with_args                                                                                                                                                                                              
[rank14]:     out = normalize_as_list(f(args))                                                                                                                                                                                                                                                                                                                                            
[rank14]:                             ^^^^^^^                                                                                                                                                                                                                                                                                                                                             
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn                                                                                                                                                                                                                                     
[rank14]:     return fn(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                  
[rank14]:            ^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                                  
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 1131, in __call__                                                                                                                                                                                                                              
[rank14]:     return self.current_callable(inputs)                                                                                                                                                                                                                                                                                                                                        
[rank14]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                        
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 944, in run                                                                                                                                                                                                                                   
[rank14]:     return model(new_inputs)                                                                                                                                                                                                                                                                                                                                                    
[rank14]:            ^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                                    
[rank14]:   File "/tmp/inductor_cache/7u/c7uxyl6sa2vhqznxqbjjnq6jlo2l4wwfegkn2ofv4b6b5srjtqac.py", line 9192, in call                                                                                                                                                                                                                                                                     
[rank14]:     triton_poi_fused__to_copy_0.run(tangents_1, buf0, 64512, grid=grid(64512), stream=stream2)                                                                                                                                                                                                                                                                                  
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 820, in run                                                                                                                                                                                                                    
[rank14]:     self.autotune_to_one_config(*args, grid=grid, **kwargs)                                                                                                                                                                                                                                                                                                                     
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 722, in autotune_to_one_config                                                                                                                                                                                                 
[rank14]:     self.save_cache_hook(self.launchers[0].config, time_taken_ns)                                                                                                                                                                                                                                                                                                               
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1122, in save_cache_hook                                                                                                                                                                                                       
[rank14]:     remote_cache.put(remote_cache_key, data)                                                                                                                                                                                                                                                                                                                                    
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/torch/_inductor/remote_cache.py", line 44, in put                                                                                                                                                                                                                                  
[rank14]:     return self._redis.set(self._get_key(key), data)                                                                                                                                                                                                                                                                                                                            
[rank14]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                            
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/redis/commands/core.py", line 2333, in set                                                                                                                                                                                                                                         
[rank14]:     return self.execute_command("SET", *pieces, **options)                                                                                                                                                                                                                                                                                                                      
[rank14]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                      
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/redis/client.py", line 548, in execute_command                                                                                                                                                                                                                                     
[rank14]:     return conn.retry.call_with_retry(                                                                                                                                                                                                                                                                                                                                          
[rank14]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                          
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/redis/retry.py", line 62, in call_with_retry                                                                                                                                                                                                                                       
[rank14]:     return do()                                                                                                                                                                                                                                                                                                                                                                 
[rank14]:            ^^^^                                                                                                                                                                                                                                                                                                                                                                 
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/redis/client.py", line 549, in <lambda>                                                                                                                                                                                                                                            
[rank14]:     lambda: self._send_command_parse_response(                                                                                                                                                                                                                                                                                                                                  
[rank14]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                  
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/redis/client.py", line 524, in _send_command_parse_response                                                                                                                                                                                                                        
[rank14]:     conn.send_command(*args)                                                                                                                                                                                                                                                                                                                                                    
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/redis/connection.py", line 477, in send_command                                                                                                                                                                                                                                    
[rank14]:     self._command_packer.pack(*args),                                                                                                                                                                                                                                                                                                                                           
[rank14]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                            
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/redis/connection.py", line 71, in pack                                                                                                                                                                                                                                             
[rank14]:     raise DataError(value).with_traceback(traceback)                                                                                                                                                                                                                                                                                                                            
[rank14]:   File "/home/ubuntu/efs_gpu/miniconda3/envs/diffusion_torch2.4/lib/python3.11/site-packages/redis/connection.py", line 68, in pack                                                                                                                                                                                                                                             
[rank14]:     output.append(hiredis.pack_command(args))                                                                                                                                                                                                                                                                                                                                   
[rank14]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                    
[rank14]: redis.exceptions.DataError: A tuple item must be str, int, float or bytes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

@bhack
Copy link
Contributor

bhack commented Aug 30, 2024

binary-manywheel failures aren't related to this PR. See #134236

@aorenste aorenste marked this pull request as ready for review August 30, 2024 19:47
@aorenste
Copy link
Contributor Author

Passes all tests - ready for review.

@oulgen
Copy link
Contributor

oulgen commented Aug 30, 2024

Looks good to me, you can go ahead and land it, already approved

bhack pushed a commit to bhack/pytorch that referenced this pull request Aug 30, 2024
Summary:
Pull Request resolved: pytorch#134032

The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.

Update them to be more consistent:

1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile

2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)

3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.

Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178859
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

Summary:
Pull Request resolved: pytorch#134032

The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.

Update them to be more consistent:

1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile

2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)

3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.

Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178859
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61178859

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@atalman atalman added this to the 2.5.0 milestone Sep 4, 2024
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
Summary:
The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.

Update them to be more consistent:

1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile

2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)

3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.

Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178859

Pull Request resolved: pytorch#134032
Approved by: https://github.com/oulgen, https://github.com/bhack
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
Summary:
The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.

Update them to be more consistent:

1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile

2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)

3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.

Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178859

Pull Request resolved: pytorch#134032
Approved by: https://github.com/oulgen, https://github.com/bhack
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants