KEMBAR78
A brief summary of the potential issues during the replication and corresponding solutons · Issue #81 · tatsu-lab/stanford_alpaca · GitHub
Skip to content

A brief summary of the potential issues during the replication and corresponding solutons #81

@puyuanliu

Description

@puyuanliu

1. module transformers has no attribute LLaMATokenizer or 'missing key 'llama'.

First, install the SentencePiece then install transformers from huggingface git repo. i.e., pip install sentencepiece, pip install git+https://github.com/huggingface/transformers.git
The installation order matters.

2. CUDA OOM at the beginning of the training.

Use -fp 16 instead of -bp 16. Lower the batch size and gradient accumulation steps.

3. CUDA OOM during model saving.

Assume you are using torch=1.13.0, change python/lib/python3.9/site packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224 from state_dict[fqn] = state_dict[fqn].clone().detach() to state_dict[fqn] = state_dict[fqn].cpu().clone().detach()

This usually happens when using GPUs of small memory (e.g., 40GB or 24GB)

4. How to perform inference?

Refer to #35 (comment)

5. Generated tokens are not human-readable at inference time.

Assume your training goes well (e.g., training loss <0.5), it's most likely your model weights are corrupted during model saving. Make sure there is no error message during the saving.

6. Finetuning is slow.

Refer to #32 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions