[Inductor] Extend Inplacing to Reduction Kernels

### 🚀 The feature, motivation and pitch

Currently inplacing logic only extends to Pointwise uses. We should extend this to reductions. 

For example, `matmul_output` of  LayerNorm can be inplaced below:

```
import torch
torch.set_default_device("cuda")
torch.set_grad_enabled(False)

batch_size = 32
seq_length = 50
hidden_size = 768

inp = torch.randn(batch_size, seq_length, hidden_size)
weight = torch.randn(hidden_size, hidden_size)

layer_norm = torch.nn.LayerNorm(hidden_size)

@torch.compile()
def foo(inp, weight):
    matmul_output = inp @ weight
    final_output = layer_norm(matmul_output)
    return final_output

foo(inp, weight)
```

This can help with both perf and memory.

### Alternatives

_No response_

### Additional context

_No response_

cc @ezyang @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor] Extend Inplacing to Reduction Kernels #132826

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Inductor] Extend Inplacing to Reduction Kernels #132826

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions