bug

来自: mobius

2024-01-15 09:47:07 发布

浏览: 742点赞: 0收藏: 0

1.llama2本地部署需要的python安装包杂，多，如何清晰的统一安装

2.如何实现新环境的快速搭建

考虑搭建公司私有镜像源，提供用户自研下载命令，指向公司镜像源（用户可以通过pip等从国外源下载，也可以通过下载命令从公司源下载依赖，解决外网下载慢）

3.团队成员python版本不一致，是否会对后续开发造成影响

4.linux中碰到乱码文件怎么删除

find ./ -inum 2236429 -exec rm -f {} \;

find ./ -inum 2236429找到当前文件夹下文件节点为2236429的文件

-exec rm 执行rm命令

{} find查找出来的文件

\做转义

；结束符

5.python打包流程，requirements.txt的作用

python从外网下载安装包-》项目完成，项目打包-》其他用户安装环境下载包

速度，准确性

6.python依赖包下载源的区别，外网源，阿里源，清华源的区别及对项目的影响

猜想1：外网下载依赖包A会顺带下载依赖包B，国内源下载A只下载A。

猜想2：删除依赖包的时候，旧包没有删除干净

7.RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

解决：cuda版本11.8 与pytorch版本不一致，到官网找对应版本，通过官网命令下载

8.AssertionError: no checkpoint files found in llama-2-7b-chat/

解决：执行命令参数设置出错，在正确路径指定正确的模型

9.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.

解决：参数命名问题，参数格式改为"./models_hf/7B"类似

10.torch.cuda.OutOfMemoryError: CUDA out of memory

解决：需要加上export CUDA_VISIBLE_DEVICES=0,1

set PYTORCH_CUDA_ALLOC_CONF = max_split_size_mb:32

11.mnt/sda/xuelin-3/anaconda3/envs/llama/bin/python: Error while finding module specification for 'llama_recipes.finetuning' (ModuleNotFoundError: No module named 'llama_recipes')

解决：在llama-recipes顶层目录执行pip install -e .

12.ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes`

解决：export CUDA_VISIBLE_DEVICES=1 写错了

13.AssertionError: Unknown dataset: /mnt/md0/xuelin/project/llama_recipes/datasets/alpaca_dataset

解决：指定参数的时候只能是类似--dataset alpaca_dataset 形式

14.FileNotFoundError: [Errno 2] No such file or directory: 'src/llama_recipes/datasets/alpaca_data.json'

解决：修改config下面的datasets.py里面的对应数据集路径，我这里改为了绝对路径

15.ConnectionError: Couldn't reach https://huggingface.co/datasets/samsum/resolve/main/data/corpus.7z (ConnectionError(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /datasets/samsum/resolve/main/data/corpus.7z (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9acc3b9b50>: Failed to establish a new connection: [Errno 101] Network is unreachable'))")))

解决：修改training.py里面的dataset为alpaca_data

16.OSError: /mnt/md0/xuelin/project/llama_hf/7B does not appear to have a file named config.json. Checkout 'https://huggingface.co//mnt/md0/xuelin/project/llama_hf/7B/main' for available files.

17.ValueError: Non-consecutive added token '<unk>' found. Should have index 32000 but has index 0 in saved vocabulary

解决：版本问题，transfor版本应为4.34.0

19.huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '~/project/llama-7B_convert'. Use `repo_type` argument if needed.

解决：一般是路径写错了，检查一下。

未解决&待复现：

20.ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes

未解决：疑似需要分布式

21.FileNotFoundError: [Errno 2] No such file or directory: '~/project/llama-recipes/src/llama_recipes/datasets/alpaca_data.json'

22.ImportError: cannot import name 'get_peft_model' from 'peft'

23.OSError: Can't load tokenizer for './PEFT_alpaca_model'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './PEFT_alpaca_model' is the correct path to a directory containing all relevant files for a LlamaTokenizer tokenizer.

24.ImportError: cannot import name 'get_dataset' from partially initialized module 'llama_recipes.datasets.grammar_dataset.grammar_dataset'

25.ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

26.torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

28.OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like Salesforce/safety-flan-t5-base is not the path to a directory containing a file named config.json.

29.Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

30.Epoch 1: train_perplexity=nan, train_epoch_loss=nan, epoch time 12980.557973548071s.loss显示nan

待解决

31.export CUDA_VISIBLE_DEVICES=0,1指定显卡失效，依然只跑0卡

解决：在导入os包的时候指定显卡设备os.environ["CUDA_VISIBLE_DEVICES"]="0,1"需要在导入torch包之前

32.CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

解决：疑似显存不足，在清空所有占用显存的进程之后正常了

33.lccc使用llama-2-13b-hf 乱码

解决：换个名字，疑似llama在lccc是个敏感词

34.TypeError: generate() takes 1 positional argument but 2 were given

解决：两种方式

解决办法1：

model.generate(

#输入数据，张量序列

**input_ids,

#最大生成文本的词元数量

max_new_tokens=1024,

do_sample = True, #是否使用采样;否则使用贪婪解码

#解码策略

top_p = 0.85,

#越大越随机，最大为1

temperature = 1.0,

repetition_penalty=1., #重复惩罚的参数。1.0 表示没有惩罚

eos_token_id=2, #序列结束的特殊令牌ID

bos_token_id=1, #序列开始的特殊令牌ID

pad_token_id=0) #序列填充的特殊令牌ID

将参数input_ids改为 **input_ids

解决办法2：

在文本生成前用model = model.merge_and_unload()合并

35.RuntimeError: CUDA error: an illegal memory access was encountered

解决：设置了这两个，解决之后我将这两行添加上的代码注释掉之后依然可以运行，疑似已经设置了环境变量之后在当前连接一直生效

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

torch.cuda.set_device(1)

重新连接之后复现

完整错误：

RuntimeError: CUDA error: an illegal memory access was encountered

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

再次设置 export CUDA_LAUNCH_BLOCKING=1

问题解决

36.ImportError: Failed to load PyTorch C extensions:

暂不清楚问题出现原因，在新建llama2环境的时候出现这个问题,

37.RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

解决：在将torch包降级到2.0.1后暂时未出现问题

38.ImportError: cannot import name 'get_dataset' from partially initialized module 'llama_recipes.datasets.grammar_dataset.grammar_dataset'

解决：怀疑是里面的某个方法名和其他依赖冲突了，将除了alpaca_dataset的另外两个数据注销掉了

39.ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models.

解决：4位和8位量化模型不能.to，也就是量化模型和纯bfloat16冲突,这里我去掉了纯bfloat16

40.ValueError: Inconsistent compute device and `device_id` on rank 1: cuda:0 vs cuda:1

未解决：原因是“FSDP 与 LoRA 不兼容”，采用新的脚本

41.terminate called after throwing an instance of 'c10::Error'

what(): CUDA error: an illegal memory access was encountered

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

42.RuntimeError: CUDA error: invalid device ordinal Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

解决：确保 nvidia num=nproc_per_node。设置nproc_per_node=1。确保机器GPU量和自己给定的数量是一致的

43.ValueError: Inconsistent compute device and `device_id` on rank 3: cuda:0 vs cuda:3

cuda 3 老是报错，只使用0，1，2

44.ValueError: Integer parameters are unsupported

只使用纯bfloat16 去掉了量化

点赞收藏关注

术源万算

为AI人工智能的研究者、开发者与使用者提供最开放的探讨交流社区；并降低参与AI人工智能的开发和研究门槛，共享人工智能发展成果。