1.llama2本地部署需要的python安装包杂,多,如何清晰的统一安装
2.如何实现新环境的快速搭建
考虑搭建公司私有镜像源,提供用户自研下载命令,指向公司镜像源(用户可以通过pip等从国外源下载,也可以通过下载命令从公司源下载依赖,解决外网下载慢)
3.团队成员python版本不一致,是否会对后续开发造成影响
4.linux中碰到乱码文件怎么删除
find ./ -inum 2236429 -exec rm -f {} \;
find ./ -inum 2236429找到当前文件夹下文件节点为2236429的文件
-exec rm 执行rm命令
{} find查找出来的文件
\做转义
;结束符
5.python打包流程,requirements.txt的作用
python从外网下载安装包-》项目完成,项目打包-》其他用户安装环境下载包
速度,准确性
6.python依赖包下载源的区别,外网源,阿里源,清华源的区别及对项目的影响
猜想1:外网下载依赖包A会顺带下载依赖包B,国内源下载A只下载A。
猜想2:删除依赖包的时候,旧包没有删除干净
7.RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
解决:cuda版本11.8 与pytorch版本不一致,到官网找对应版本,通过官网命令下载
8.AssertionError: no checkpoint files found in llama-2-7b-chat/
解决:执行命令参数设置出错,在正确路径指定正确的模型
9.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.
解决:参数命名问题,参数格式改为"./models_hf/7B"类似
10.torch.cuda.OutOfMemoryError: CUDA out of memory
解决:需要加上export CUDA_VISIBLE_DEVICES=0,1
set PYTORCH_CUDA_ALLOC_CONF = max_split_size_mb:32
11.mnt/sda/xuelin-3/anaconda3/envs/llama/bin/python: Error while finding module specification for 'llama_recipes.finetuning' (ModuleNotFoundError: No module named 'llama_recipes')
解决:在llama-recipes顶层目录执行pip install -e .
12.ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes`
解决:export CUDA_VISIBLE_DEVICES=1 写错了
13.AssertionError: Unknown dataset: /mnt/md0/xuelin/project/llama_recipes/datasets/alpaca_dataset
解决:指定参数的时候只能是类似--dataset alpaca_dataset 形式
14.FileNotFoundError: [Errno 2] No such file or directory: 'src/llama_recipes/datasets/alpaca_data.json'
解决:修改config下面的datasets.py里面的对应数据集路径,我这里改为了绝对路径
15.ConnectionError: Couldn't reach https://huggingface.co/datasets/samsum/resolve/main/data/corpus.7z (ConnectionError(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /datasets/samsum/resolve/main/data/corpus.7z (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9acc3b9b50>: Failed to establish a new connection: [Errno 101] Network is unreachable'))")))
解决:修改training.py里面的dataset为alpaca_data
16.OSError: /mnt/md0/xuelin/project/llama_hf/7B does not appear to have a file named config.json. Checkout 'https://huggingface.co//mnt/md0/xuelin/project/llama_hf/7B/main' for available files.
17.ValueError: Non-consecutive added token '<unk>' found. Should have index 32000 but has index 0 in saved vocabulary
解决:版本问题,transfor版本应为4.34.0
19.huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '~/project/llama-7B_convert'. Use `repo_type` argument if needed.
解决:一般是路径写错了,检查一下。
未解决&待复现:
20.ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes
未解决:疑似需要分布式
21.FileNotFoundError: [Errno 2] No such file or directory: '~/project/llama-recipes/src/llama_recipes/datasets/alpaca_data.json'
22.ImportError: cannot import name 'get_peft_model' from 'peft'
23.OSError: Can't load tokenizer for './PEFT_alpaca_model'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './PEFT_alpaca_model' is the correct path to a directory containing all relevant files for a LlamaTokenizer tokenizer.
24.ImportError: cannot import name 'get_dataset' from partially initialized module 'llama_recipes.datasets.grammar_dataset.grammar_dataset'
25.ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
26.torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
28.OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like Salesforce/safety-flan-t5-base is not the path to a directory containing a file named config.json.
29.Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
30.Epoch 1: train_perplexity=nan, train_epoch_loss=nan, epoch time 12980.557973548071s.loss显示nan
待解决
31.export CUDA_VISIBLE_DEVICES=0,1指定显卡失效,依然只跑0卡
解决:在导入os包的时候指定显卡设备os.environ["CUDA_VISIBLE_DEVICES"]="0,1"需要在导入torch包之前
32.CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
解决:疑似显存不足,在清空所有占用显存的进程之后正常了
33.lccc使用llama-2-13b-hf 乱码
解决:换个名字,疑似llama在lccc是个敏感词
34.TypeError: generate() takes 1 positional argument but 2 were given
解决:两种方式
解决办法1:
model.generate(
#输入数据,张量序列
**input_ids,
#最大生成文本的词元数量
max_new_tokens=1024,
do_sample = True, #是否使用采样;否则使用贪婪解码
#解码策略
top_p = 0.85,
#越大越随机,最大为1
temperature = 1.0,
repetition_penalty=1., #重复惩罚的参数。1.0 表示没有惩罚
eos_token_id=2, #序列结束的特殊令牌ID
bos_token_id=1, #序列开始的特殊令牌ID
pad_token_id=0) #序列填充的特殊令牌ID
将参数input_ids改为 **input_ids
解决办法2:
在文本生成前用model = model.merge_and_unload()合并
35.RuntimeError: CUDA error: an illegal memory access was encountered
解决:设置了这两个,解决之后我将这两行添加上的代码注释掉之后依然可以运行,疑似已经设置了环境变量之后在当前连接一直生效
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
torch.cuda.set_device(1)
重新连接之后复现
完整错误:
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
再次设置 export CUDA_LAUNCH_BLOCKING=1
问题解决
36.ImportError: Failed to load PyTorch C extensions:
暂不清楚问题出现原因,在新建llama2环境的时候出现这个问题,
37.RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
解决:在将torch包降级到2.0.1后暂时未出现问题
38.ImportError: cannot import name 'get_dataset' from partially initialized module 'llama_recipes.datasets.grammar_dataset.grammar_dataset'
解决:怀疑是里面的某个方法名和其他依赖冲突了,将除了alpaca_dataset的另外两个数据注销掉了
39.ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models.
解决:4位和8位量化模型不能.to,也就是量化模型和纯bfloat16冲突,这里我去掉了纯bfloat16
40.ValueError: Inconsistent compute device and `device_id` on rank 1: cuda:0 vs cuda:1
未解决:原因是“FSDP 与 LoRA 不兼容”,采用新的脚本
41.terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
42.RuntimeError: CUDA error: invalid device ordinal Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
解决:确保 nvidia num=nproc_per_node。设置nproc_per_node=1。确保机器GPU量和自己给定的数量是一致的
43.ValueError: Inconsistent compute device and `device_id` on rank 3: cuda:0 vs cuda:3
cuda 3 老是报错,只使用0,1,2
44.ValueError: Integer parameters are unsupported
只使用纯bfloat16 去掉了量化