当前位置：首页 > news >正文

RedNet 复现记录

news 2025/6/27 21:26:08

RedNet 复现记录

下载
创建虚拟环境
准备数据
运行测试
运行训练

下载

git clone https://github.com/JindongJiang/RedNet.git
cd RedNet/

创建虚拟环境

官方readme说用的是Pytorch0.4.0，这太老了。甚至Pytorch官网提供的whl都没有我能用版本的。于是先试试最新版Pytorch 2.4.0。因为我电脑上CUDA是12.1的，所以装12.1的版本：

conda create -n REDNET python=3.10 -y
conda activate REDNET
pip3 install torch torchvision torchaudio
pip3 install empy==3.3.4 rospkg pyyaml catkin_pkg

然后按照requirements的要求，安装其他的包：

pip3 install numpy imageio scipy tensorboardX matplotlib scikit-image h5py

很好，没有啥报错。

准备数据

按照readme指示，下载SUNRGBD.zip和SUNRGBDtoolbox.zip，解压放到我的数据盘里，然后软链接过去就OK。

运行测试

按照readme指示：

python RedNet_inference.py --cuda --last-ckpt rednet_ckpt.pth -r SUMRGBD_Benchmark/SUNRGBD/kv1/NYUdata/NYU0001/image/NYU0001.jpg -d SUMRGBD_Benchmark/SUNRGBD/kv1/NYUdata/NYU0001/depth/NYU0001.png -o output.png

发现报错：

/home/lcy-magic/Segment_TEST/RedNet/RedNet_inference.py:36: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.depth = imageio.imread(args.depth)
Traceback (most recent call last):File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/PIL/Image.py", line 3277, in fromarraymode, rawmode = _fromarray_typemap[typekey]
KeyError: ((1, 1, 3), '<f4')

搜索后知道是需要的数据类型是uint8位，而目前是f4也就是4字节浮点数，float32。所以数据类型转换为uint8就行：

import numpy as np
#imageio.imsave(args.output, output.cpu().numpy().transpose((1, 2, 0)))imageio.imsave(args.output, np.uint8(output.cpu().numpy().transpose((1, 2, 0))))

再运行，就没毛病了：
在这里插入图片描述

运行训练

python RedNet_train.py --cuda --data-dir SUMRGBD_Benchmark/

报错：
在这里插入图片描述
原来以为是数据读的数据有错误。但我的matlab打不开了，也不好检查.mat文件。在项目的issue里也没看到别人遇到这个问题。于是试着把报错在网页上查找，还真发现了原因：参考博客说，因为H5py新版本把value属性删掉了，现在用[:]代替。于是我将代码改为：

# label = np.array(self.SUNRGBD2Dseg[seglabel.value[i][0]].value.transpose(1, 0))label = np.array(self.SUNRGBD2Dseg[seglabel[:][i][0]][:].transpose(1, 0))

不再有这个报错。

但是运行很长时间没动静，可能卡在循环了。我引入tqdm，可视化循环进度：
先安装：

pip3 install tqdm

再在RedNet_train.py和RedNet.data.py两个文件中导入模块：

from tqdm import *

最后把两个文件中的遍历过程都用tqdm修饰：

for i, meta in tqdm(enumerate(SUNRGBDMeta)):
for epoch in tqdm(range(int(args.start_epoch), args.epochs)):

现在能看到进度了。发现前面一直没动静，是数据加载太慢了。等待加载完成后，发现第一轮训练就报错：

Original Traceback (most recent call last):File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loopdata = fetcher.fetch(index)  # type: ignore[possibly-undefined]File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetchdata = [self.dataset[idx] for idx in possibly_batched_index]File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>data = [self.dataset[idx] for idx in possibly_batched_index]File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_data.py", line 126, in __getitem__sample = self.transform(sample)File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torchvision/transforms/transforms.py", line 95, in __call__img = t(img)File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_data.py", line 273, in __call__depth = np.expand_dims(depth, 0).astype(np.float)File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/numpy/__init__.py", line 397, in __getattr__raise AttributeError(__former_attrs__[attr], name=None)
AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

说的很明白，就是np没有float这个类型。那就换成float64。

这下好了，没有语法报错，可惜的是 CUDA out of memory了！哭死，什么时候才能用大显卡啊。
眼下只能减小batch size了。发现目前默认batch size是5：

parser.add_argument('-b', '--batch-size', default=5, type=int,metavar='N', help='mini-batch size (default: 10)')

于是改变参数，发现我的batch size调到3，不再out of memory。

但是有了新报错：

Traceback (most recent call last):File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_train.py", line 161, in <module>train()File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_train.py", line 122, in trainloss = CEL_weighted(pred_scales, target_scales)File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)File "/home/lcy-magic/Segment_TEST/RedNet/utils/utils.py", line 54, in forwardlosses.append(torch.sum(torch.masked_select(loss_all, mask)) / torch.sum(mask.float()))
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

我估计是torch.masked_select(loss_all, mask)) / torch.sum(mask.float())中，要么分子为inf，要么分母为0了。
我看这个issue参考issue可能也是这个问题，于是我也把0改成2e-9。还是同样报错。再改成：

targets = targets.long()mask = targets > 0

这个问题没有了。但又有新报错：

Traceback (most recent call last):File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_train.py", line 161, in <module>train()File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_train.py", line 141, in traingrid_image = make_grid(utils.color_label(torch.max(pred_scales[0][:3], 1)[1] + 1), 3, normalize=False,File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)
TypeError: make_grid() got an unexpected keyword argument 'range'

看起来感觉像make_grid函数更新了。一查果然是。根据参考博客的说法，应该改为value_range：

# grid_image = make_grid(utils.color_label(target_scales[0][:3]), 3, normalize=False, range=(0, 255))
grid_image = make_grid(utils.color_label(target_scales[0][:3]), 3, normalize=False, value_range=(0, 255))

再开始训练就没大问题了：
在这里插入图片描述