KEMBAR78
GitHub Β· Where software is built
Skip to content

Add RWKV2 (fast)Β #17230

@leondz

Description

@leondz

Model description

I would like to implement a new model architecture.

Short description

RWKV v2 is an "RNN with transformer-level performance, without using attention. Similar to Apple's Attention Free Transformer. All trained models open-source. Inference is very fast (even on CPUs) and might work on cell phones. There's also a GPT-type implementation." -- (Hochreiter's description)

RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. RWKV can leverage GPUs, but doesn't need to.

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

Implementation and weights

There's an implementation at BlinkDL/RWKV-LM which also gives a detailed description of the model internals and some performance benchmarks. Model weights currently are being trained for a few datasets, including the Pile (see e.g. BlinkDL/RWKV-v2-RNN-Pile) and Danish Gigaword by me. Both will be openly available - some checkpoints for the Pile already are, even though it's an ongoing process.

Status

The model seems quite exciting and I'm able to replicate preliminary results. I'm already talking with @BlinkDL about the implementation. I'm happy to implement/port the model architecture (for both RNN and GPT variants), tokenizer, and tests myself (and have already started) and would appreciate help and advice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions