KEMBAR78
feat: lfilter with filterbanks support by yoyolicoris · Pull Request #1587 · pytorch/audio · GitHub
Skip to content

Conversation

@yoyolicoris
Copy link
Contributor

@yoyolicoris yoyolicoris commented Jun 18, 2021

resolves #1528 , relates to #1561

This merge add the ability to apply multiple filters in F.lfilter.
When the input shape is (..., time), if the shape of filter coefficients is (number_of_filters, filter_order), the shape of output will be (...., number_of_filters, time), which is the input shape plus one additional dimension from the filters.

Before feeding to the cpp backend, the python frontend first duplicate the input number_of_filters times and stacking them to form a shape of (...., number_of_filters, time).

The cpp backend assume the shape of waveform and *_coeffs are (batch, number_of_filters, time) and (number_of_filters, filter_order) (before are just (batch, time) and (filter_order,)). Different filters is applied parallelly at the second dimension of waveform.

cc @mthrok

@yoyolicoris yoyolicoris force-pushed the feat/lfilter-filter-banks branch from 5c239cb to e889232 Compare June 21, 2021 02:07
Copy link
Contributor

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good. The semantic around filter bank looks straightforward though documentation can be improved.

If I understand correctly, the implementation cannot handle the case where one wants to apply different filters per channel. (though I am not sure if that's valid use-case.) Is that correct?

Returns:
Tensor: Waveform with dimension of ``(..., time)``.
Tensor: Waveform with dimension of ``(..., *, time)``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, things about shape and filter bank (noted as * in the current form) needs more clarification.
Specifically, the distinction between the cases of 1D coeffs and 2D coeffs should be explained.
(even though I think the use of * here is very sophisticated.)

Copy link
Contributor Author

@yoyolicoris yoyolicoris Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. How about

Suggested change
Tensor: Waveform with dimension of ``(..., *, time)``.
Tensor: Waveform with dimension of either ``(..., *, time)`` if ``a_coeffs`` and ``b_coeffs`` are 2D Tensors,
or ``(..., time)`` otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but can we simply replace the * with something like num_filters?

(And in the docstring of a_coeffs, b_coeffs, can we explicitly say either 1D with shape (num_order + 1) or 2D with shape (num_filters, num_order + 1)?)

The thing is, ellipsis is used in many docstrings, but asterisk is very unique to this function, so I think that explicitly mentioning the difference of 1D and 2D are easier to understand.

@yoyolicoris
Copy link
Contributor Author

Overall, it looks good. The semantic around filter bank looks straightforward though documentation can be improved.

If I understand correctly, the implementation cannot handle the case where one wants to apply different filters per channel. (though I am not sure if that's valid use-case.) Is that correct?

No, it cannot apply different filters per channel. All signals are applied with same set of filters, regardless.

Copy link
Contributor

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yoyololicon

Sorry for not getting back sooner.
I think this PR is good, but if you can consider the comments, that would be great.
Thanks!

waveform = waveform.reshape(-1, 1, shape[-1])
if a_coeffs.ndim > 1:
shape = shape[:-1] + (a_coeffs.shape[0], shape[-1])
waveform = waveform.repeat(1, a_coeffs.shape[0], 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating the waveform manually and computing the shape manually, how about overwriting shape after repeating? I think that will be slightly easier to understand.

waveform = waveform.repeat(...)
shape = waveform.size()

Returns:
Tensor: Waveform with dimension of ``(..., time)``.
Tensor: Waveform with dimension of ``(..., *, time)``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but can we simply replace the * with something like num_filters?

(And in the docstring of a_coeffs, b_coeffs, can we explicitly say either 1D with shape (num_order + 1) or 2D with shape (num_filters, num_order + 1)?)

The thing is, ellipsis is used in many docstrings, but asterisk is very unique to this function, so I think that explicitly mentioning the difference of 1D and 2D are easier to understand.

@yoyolicoris yoyolicoris requested a review from mthrok June 28, 2021 15:02
@yoyolicoris
Copy link
Contributor Author

Benchmarks

I have managed to build the sources with cuda support using docker container.
The benchmarks was conducted on a linux machine with i5 4790K, 32G of RAM, and a RTX 3070.
The result indicate that the runtime on CUDA is roughly doubled.

Before

CPU

[-------------- IIR filter --------------]
                   |  forward  |  backward
1 threads: -------------------------------
      [32, 256]    |    307.2  |    808.0 
      [32, 1024]   |    614.9  |   1512.9 
      [32, 4096]   |   2036.1  |   5199.6 
      [64, 256]    |    399.5  |    992.8 
      [64, 1024]   |   1027.6  |   2417.2 
      [64, 4096]   |   7028.7  |  13454.1 
      [128, 256]   |    624.4  |   1493.7 
      [128, 1024]  |   1986.3  |   4805.7 
      [128, 4096]  |  17420.1  |  31806.2 
2 threads: -------------------------------
      [32, 256]    |    291.6  |    813.7 
      [32, 1024]   |    531.0  |   1417.3 
      [32, 4096]   |   1754.9  |   4138.1 
      [64, 256]    |    357.4  |    968.6 
      [64, 1024]   |    797.2  |   1973.9 
      [64, 4096]   |   6914.0  |  13116.0 
      [128, 256]   |    535.3  |   1398.3 
      [128, 1024]  |   1522.3  |   4060.9 
      [128, 4096]  |  17863.6  |  30650.4 
4 threads: -------------------------------
      [32, 256]    |    268.3  |    822.9 
      [32, 1024]   |    491.3  |   1395.4 
      [32, 4096]   |   1539.7  |   4215.9 
      [64, 256]    |    347.6  |    996.4 
      [64, 1024]   |    708.5  |   1801.9 
      [64, 4096]   |   7666.7  |  13461.7 
      [128, 256]   |    479.9  |   1306.3 
      [128, 1024]  |   1428.4  |   4137.1 
      [128, 4096]  |  18735.3  |  31495.1 

Times are in microseconds (us).

CUDA

[-------------- IIR filter --------------]
                   |  forward  |  backward
1 threads: -------------------------------
      [32, 256]    |    10.7   |    19.6  
      [32, 1024]   |    41.7   |    76.5  
      [32, 4096]   |   159.8   |   297.6  
      [64, 256]    |    10.3   |    19.7  
      [64, 1024]   |    41.8   |    75.9  
      [64, 4096]   |   157.9   |   299.2  
      [128, 256]   |    10.5   |    19.8  
      [128, 1024]  |    41.3   |    75.9  
      [128, 4096]  |   164.3   |   301.0  
2 threads: -------------------------------
      [32, 256]    |    10.6   |    19.6  
      [32, 1024]   |    40.2   |    77.6  
      [32, 4096]   |   161.9   |   302.6  
      [64, 256]    |    10.7   |    19.9  
      [64, 1024]   |    40.0   |    75.6  
      [64, 4096]   |   159.8   |   299.9  
      [128, 256]   |    10.9   |    19.8  
      [128, 1024]  |    41.4   |    75.2  
      [128, 4096]  |   170.6   |   301.2  
4 threads: -------------------------------
      [32, 256]    |    10.8   |    19.6  
      [32, 1024]   |    40.3   |    76.8  
      [32, 4096]   |   160.5   |   301.7  
      [64, 256]    |    10.6   |    19.7  
      [64, 1024]   |    40.3   |    75.5  
      [64, 4096]   |   163.7   |   302.8  
      [128, 256]   |    10.7   |    19.8  
      [128, 1024]  |    42.8   |    75.9  
      [128, 4096]  |   165.5   |   299.8  

Times are in milliseconds (ms).

After

CPU

[-------------- IIR filter --------------]
                   |  forward  |  backward
1 threads: -------------------------------
      [32, 256]    |    322.1  |    835.5 
      [32, 1024]   |    635.2  |   1538.1 
      [32, 4096]   |   2164.4  |   5424.1 
      [64, 256]    |    410.9  |   1026.8 
      [64, 1024]   |   1010.5  |   2299.0 
      [64, 4096]   |   6981.6  |  12721.7 
      [128, 256]   |    616.4  |   1451.6 
      [128, 1024]  |   1886.5  |   4810.8 
      [128, 4096]  |  17436.0  |  31758.8 
2 threads: -------------------------------
      [32, 256]    |    305.4  |    854.4 
      [32, 1024]   |    550.3  |   1464.6 
      [32, 4096]   |   1733.3  |   4914.4 
      [64, 256]    |    370.5  |   1003.6 
      [64, 1024]   |    791.4  |   1927.2 
      [64, 4096]   |   6829.2  |  12608.5 
      [128, 256]   |    536.6  |   1383.8 
      [128, 1024]  |   1429.7  |   4541.2 
      [128, 4096]  |  17876.0  |  31664.5 
4 threads: -------------------------------
      [32, 256]    |    283.0  |    850.6 
      [32, 1024]   |    492.4  |   1446.2 
      [32, 4096]   |   1586.0  |   4792.5 
      [64, 256]    |    362.8  |    978.7 
      [64, 1024]   |    735.8  |   1842.1 
      [64, 4096]   |   7331.0  |  12645.4 
      [128, 256]   |    492.1  |   1351.4 
      [128, 1024]  |   1265.6  |   3954.7 
      [128, 4096]  |  18502.5  |  33391.6 

Times are in microseconds (us).

CUDA

[-------------- IIR filter --------------]
                   |  forward  |  backward
1 threads: -------------------------------
      [32, 256]    |    19.6   |    37.9  
      [32, 1024]   |    77.4   |   148.7  
      [32, 4096]   |   314.3   |   596.4  
      [64, 256]    |    19.7   |    38.5  
      [64, 1024]   |    82.8   |   147.6  
      [64, 4096]   |   328.1   |   589.7  
      [128, 256]   |    20.5   |    38.0  
      [128, 1024]  |    82.3   |   148.7  
      [128, 4096]  |   323.4   |   595.7  
2 threads: -------------------------------
      [32, 256]    |    19.9   |    37.9  
      [32, 1024]   |    77.4   |   150.9  
      [32, 4096]   |   308.5   |   597.6  
      [64, 256]    |    19.7   |    38.6  
      [64, 1024]   |    82.7   |   147.6  
      [64, 4096]   |   330.5   |   590.1  
      [128, 256]   |    20.8   |    38.2  
      [128, 1024]  |    80.9   |   149.3  
      [128, 4096]  |   330.8   |   595.9  
4 threads: -------------------------------
      [32, 256]    |    19.8   |    37.9  
      [32, 1024]   |    78.2   |   150.8  
      [32, 4096]   |   309.5   |   598.0  
      [64, 256]    |    20.1   |    38.1  
      [64, 1024]   |    82.6   |   148.9  
      [64, 4096]   |   323.4   |   589.2  
      [128, 256]   |    20.9   |    38.2  
      [128, 1024]  |    81.4   |   148.7  
      [128, 4096]  |   322.0   |   596.0  

Times are in milliseconds (ms).

@mthrok
Copy link
Contributor

mthrok commented Jul 2, 2021

Hi @yoyololicon

Thanks for the benchmark.
The shapes in the table, are they the size of input batch?
I am also curious to learn the performance difference of multiple filters.

What do you think is the feasibility of splitting CPU/CUDA code?
I assume that the extra overhead for CUDA case is coming from the newly added for-loop.

BTW, I found an attempt to use RNN in #226. Although it is also pointed out that RNN without activation would not improve the performance as much. https://discuss.pytorch.org/t/vanillarnn-without-activation-function/26507/2. I wonder if this could help us in custom CUDA lfilter code.

@yoyolicoris
Copy link
Contributor Author

yoyolicoris commented Jul 3, 2021

Hi @mthrok ,

Thanks for the benchmark.
The shapes in the table, are they the size of input batch?
I am also curious to learn the performance difference of multiple filters.

Yes, they are the size of input. I can do the multiple filters case later.

What do you think is the feasibility of splitting CPU/CUDA code?
I assume that the extra overhead for CUDA case is coming from the newly added for-loop.

The overhead concern is technically true, although the actual implementation is done by replacing addmv and a for-loop with bmm and some extra transpose operations.
To completely avoid these overheads, I think writing custom cuda kernels is needed.

BTW, I found an attempt to use RNN in #226. Although it is also pointed out that RNN without activation would not improve the performance as much. https://discuss.pytorch.org/t/vanillarnn-without-activation-function/26507/2. I wonder if this could help us in custom CUDA lfilter code.

Hmm, I can't see much potentials in this direction of using RNN, in our case for now, cuz RNN is strictly a case of first-order IIR, and cannot generalize to arbitrary order.
I do see some possibilities of using RNN if we can implement lfilter in state space formulation, but I haven't look into it carefully yet.

@mthrok
Copy link
Contributor

mthrok commented Jul 12, 2021

@yoyololicon Sorry for not getting back to you. I am very busy these days and I should be able to get back to this at the end of this week.

@mthrok
Copy link
Contributor

mthrok commented Jul 21, 2021

@yoyololicon Again, really sorry for taking a long time. Let's continue to the next step.

assert shape == waveform.size() == output_waveform.size()
assert input_shape == waveform.size()
assert target_shape == output_waveform.size()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: We should add tests to verify that the resulting signals are same whether multiple filters are applied individually or as a bank.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will add batch consistency tests and also benchmarks of multiple filters in later commits.

@mthrok mthrok merged commit aa0dd03 into pytorch:master Jul 21, 2021
@yoyolicoris yoyolicoris deleted the feat/lfilter-filter-banks branch July 21, 2021 23:19
nateanl pushed a commit to nateanl/audio that referenced this pull request Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multiple filters in lfilter

3 participants