bug
来自: mobius
2024-01-15 09:47:07 发布
浏览: 568点赞: 0收藏: 0

1.llama2本地部署需要的python安装包杂,多,如何清晰的统一安装

2.如何实现新环境的快速搭建

考虑搭建公司私有镜像源,提供用户自研下载命令,指向公司镜像源(用户可以通过pip等从国外源下载,也可以通过下载命令从公司源下载依赖,解决外网下载慢)

3.团队成员python版本不一致,是否会对后续开发造成影响

4.linux中碰到乱码文件怎么删除

find ./ -inum 2236429 -exec rm -f {} \;

find ./ -inum 2236429找到当前文件夹下文件节点为2236429的文件

-exec rm 执行rm命令

{} find查找出来的文件

\做转义

;结束符

5.python打包流程,requirements.txt的作用

python从外网下载安装包-》项目完成,项目打包-》其他用户安装环境下载包

速度,准确性

6.python依赖包下载源的区别,外网源,阿里源,清华源的区别及对项目的影响

猜想1:外网下载依赖包A会顺带下载依赖包B,国内源下载A只下载A。

猜想2:删除依赖包的时候,旧包没有删除干净

7.RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

解决:cuda版本11.8 与pytorch版本不一致,到官网找对应版本,通过官网命令下载

8.AssertionError: no checkpoint files found in llama-2-7b-chat/

解决:执行命令参数设置出错,在正确路径指定正确的模型

9.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.

解决:参数命名问题,参数格式改为"./models_hf/7B"类似

10.torch.cuda.OutOfMemoryError: CUDA out of memory

解决:需要加上export CUDA_VISIBLE_DEVICES=0,1

set PYTORCH_CUDA_ALLOC_CONF = max_split_size_mb:32

11.mnt/sda/xuelin-3/anaconda3/envs/llama/bin/python: Error while finding module specification for 'llama_recipes.finetuning' (ModuleNotFoundError: No module named 'llama_recipes')

解决:在llama-recipes顶层目录执行pip install -e .

12.ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes`

解决:export CUDA_VISIBLE_DEVICES=1 写错了

13.AssertionError: Unknown dataset: /mnt/md0/xuelin/project/llama_recipes/datasets/alpaca_dataset

解决:指定参数的时候只能是类似--dataset alpaca_dataset 形式

14.FileNotFoundError: [Errno 2] No such file or directory: 'src/llama_recipes/datasets/alpaca_data.json'

解决:修改config下面的datasets.py里面的对应数据集路径,我这里改为了绝对路径

15.ConnectionError: Couldn't reach https://huggingface.co/datasets/samsum/resolve/main/data/corpus.7z (ConnectionError(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /datasets/samsum/resolve/main/data/corpus.7z (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9acc3b9b50>: Failed to establish a new connection: [Errno 101] Network is unreachable'))")))

解决:修改training.py里面的dataset为alpaca_data

16.OSError: /mnt/md0/xuelin/project/llama_hf/7B does not appear to have a file named config.json. Checkout 'https://huggingface.co//mnt/md0/xuelin/project/llama_hf/7B/main' for available files.

17.ValueError: Non-consecutive added token '<unk>' found. Should have index 32000 but has index 0 in saved vocabulary

解决:版本问题,transfor版本应为4.34.0

19.huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '~/project/llama-7B_convert'. Use `repo_type` argument if needed.

解决:一般是路径写错了,检查一下。

未解决&待复现:

20.ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes

未解决:疑似需要分布式

21.FileNotFoundError: [Errno 2] No such file or directory: '~/project/llama-recipes/src/llama_recipes/datasets/alpaca_data.json'

22.ImportError: cannot import name 'get_peft_model' from 'peft'

23.OSError: Can't load tokenizer for './PEFT_alpaca_model'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './PEFT_alpaca_model' is the correct path to a directory containing all relevant files for a LlamaTokenizer tokenizer.

24.ImportError: cannot import name 'get_dataset' from partially initialized module 'llama_recipes.datasets.grammar_dataset.grammar_dataset'

25.ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

26.torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

28.OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like Salesforce/safety-flan-t5-base is not the path to a directory containing a file named config.json.

29.Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

30.Epoch 1: train_perplexity=nan, train_epoch_loss=nan, epoch time 12980.557973548071s.loss显示nan

待解决

31.export CUDA_VISIBLE_DEVICES=0,1指定显卡失效,依然只跑0卡

解决:在导入os包的时候指定显卡设备os.environ["CUDA_VISIBLE_DEVICES"]="0,1"需要在导入torch包之前

32.CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

解决:疑似显存不足,在清空所有占用显存的进程之后正常了

33.lccc使用llama-2-13b-hf 乱码

解决:换个名字,疑似llama在lccc是个敏感词

34.TypeError: generate() takes 1 positional argument but 2 were given

解决:两种方式

解决办法1:

model.generate(

           #输入数据,张量序列

           **input_ids,

           #最大生成文本的词元数量

           max_new_tokens=1024,

           do_sample = True, #是否使用采样;否则使用贪婪解码

           #解码策略

           top_p = 0.85,

           #越大越随机,最大为1

           temperature = 1.0,

           repetition_penalty=1., #重复惩罚的参数。1.0 表示没有惩罚

           eos_token_id=2, #序列结束的特殊令牌ID

           bos_token_id=1, #序列开始的特殊令牌ID

           pad_token_id=0) #序列填充的特殊令牌ID

将参数input_ids改为 **input_ids

解决办法2:

在文本生成前用model = model.merge_and_unload()合并

35.RuntimeError: CUDA error: an illegal memory access was encountered

解决:设置了这两个,解决之后我将这两行添加上的代码注释掉之后依然可以运行,疑似已经设置了环境变量之后在当前连接一直生效

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

torch.cuda.set_device(1)

重新连接之后复现

完整错误:

RuntimeError: CUDA error: an illegal memory access was encountered

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

再次设置 export CUDA_LAUNCH_BLOCKING=1

问题解决

36.ImportError: Failed to load PyTorch C extensions:

暂不清楚问题出现原因,在新建llama2环境的时候出现这个问题,

37.RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

解决:在将torch包降级到2.0.1后暂时未出现问题

38.ImportError: cannot import name 'get_dataset' from partially initialized module 'llama_recipes.datasets.grammar_dataset.grammar_dataset'

解决:怀疑是里面的某个方法名和其他依赖冲突了,将除了alpaca_dataset的另外两个数据注销掉了

39.ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models.

解决:4位和8位量化模型不能.to,也就是量化模型和纯bfloat16冲突,这里我去掉了纯bfloat16

40.ValueError: Inconsistent compute device and `device_id` on rank 1: cuda:0 vs cuda:1

未解决:原因是“FSDP 与 LoRA 不兼容”,采用新的脚本

41.terminate called after throwing an instance of 'c10::Error'

 what():  CUDA error: an illegal memory access was encountered

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

42.RuntimeError: CUDA error: invalid device ordinal Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

解决:确保 nvidia num=nproc_per_node。设置nproc_per_node=1。确保机器GPU量和自己给定的数量是一致的

43.ValueError: Inconsistent compute device and `device_id` on rank 3: cuda:0 vs cuda:3

cuda 3 老是报错,只使用0,1,2

44.ValueError: Integer parameters are unsupported

只使用纯bfloat16 去掉了量化

术源万算
为AI人工智能的研究者、开发者与使用者提供最开放的探讨交流社区;并降低参与AI人工智能的开发和研究门槛,共享人工智能发展成果。
最新论文
更多

暂无数据

最热发帖
更多

暂无数据