RedNet 复现记录
RedNet 复现记录
- 下载
- 创建虚拟环境
- 准备数据
- 运行测试
- 运行训练
下载
git clone https://github.com/JindongJiang/RedNet.git
cd RedNet/
创建虚拟环境
官方readme说用的是Pytorch0.4.0,这太老了。甚至Pytorch官网提供的whl都没有我能用版本的。于是先试试最新版Pytorch 2.4.0。因为我电脑上CUDA是12.1的,所以装12.1的版本:
conda create -n REDNET python=3.10 -y
conda activate REDNET
pip3 install torch torchvision torchaudio
pip3 install empy==3.3.4 rospkg pyyaml catkin_pkg
然后按照requirements的要求,安装其他的包:
pip3 install numpy imageio scipy tensorboardX matplotlib scikit-image h5py
很好,没有啥报错。
准备数据
按照readme指示,下载SUNRGBD.zip和SUNRGBDtoolbox.zip,解压放到我的数据盘里,然后软链接过去就OK。
运行测试
按照readme指示:
python RedNet_inference.py --cuda --last-ckpt rednet_ckpt.pth -r SUMRGBD_Benchmark/SUNRGBD/kv1/NYUdata/NYU0001/image/NYU0001.jpg -d SUMRGBD_Benchmark/SUNRGBD/kv1/NYUdata/NYU0001/depth/NYU0001.png -o output.png
发现报错:
/home/lcy-magic/Segment_TEST/RedNet/RedNet_inference.py:36: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.depth = imageio.imread(args.depth)
Traceback (most recent call last):File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/PIL/Image.py", line 3277, in fromarraymode, rawmode = _fromarray_typemap[typekey]
KeyError: ((1, 1, 3), '<f4')
搜索后知道是需要的数据类型是uint8位,而目前是f4也就是4字节浮点数,float32。所以数据类型转换为uint8就行:
import numpy as np
#imageio.imsave(args.output, output.cpu().numpy().transpose((1, 2, 0)))imageio.imsave(args.output, np.uint8(output.cpu().numpy().transpose((1, 2, 0))))
再运行,就没毛病了:
运行训练
python RedNet_train.py --cuda --data-dir SUMRGBD_Benchmark/
报错:
原来以为是数据读的数据有错误。但我的matlab打不开了,也不好检查.mat文件。在项目的issue里也没看到别人遇到这个问题。于是试着把报错在网页上查找,还真发现了原因:参考博客说,因为H5py新版本把value属性删掉了,现在用[:]代替。于是我将代码改为:
# label = np.array(self.SUNRGBD2Dseg[seglabel.value[i][0]].value.transpose(1, 0))label = np.array(self.SUNRGBD2Dseg[seglabel[:][i][0]][:].transpose(1, 0))
不再有这个报错。
但是运行很长时间没动静,可能卡在循环了。我引入tqdm,可视化循环进度:
先安装:
pip3 install tqdm
再在RedNet_train.py和RedNet.data.py两个文件中导入模块:
from tqdm import *
最后把两个文件中的遍历过程都用tqdm修饰:
for i, meta in tqdm(enumerate(SUNRGBDMeta)):
for epoch in tqdm(range(int(args.start_epoch), args.epochs)):
现在能看到进度了。发现前面一直没动静,是数据加载太慢了。等待加载完成后,发现第一轮训练就报错:
Original Traceback (most recent call last):File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loopdata = fetcher.fetch(index) # type: ignore[possibly-undefined]File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetchdata = [self.dataset[idx] for idx in possibly_batched_index]File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>data = [self.dataset[idx] for idx in possibly_batched_index]File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_data.py", line 126, in __getitem__sample = self.transform(sample)File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torchvision/transforms/transforms.py", line 95, in __call__img = t(img)File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_data.py", line 273, in __call__depth = np.expand_dims(depth, 0).astype(np.float)File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/numpy/__init__.py", line 397, in __getattr__raise AttributeError(__former_attrs__[attr], name=None)
AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
说的很明白,就是np没有float这个类型。那就换成float64。
这下好了,没有语法报错,可惜的是 CUDA out of memory了!哭死,什么时候才能用大显卡啊。
眼下只能减小batch size了。发现目前默认batch size是5:
parser.add_argument('-b', '--batch-size', default=5, type=int,metavar='N', help='mini-batch size (default: 10)')
于是改变参数,发现我的batch size调到3,不再out of memory。
但是有了新报错:
Traceback (most recent call last):File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_train.py", line 161, in <module>train()File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_train.py", line 122, in trainloss = CEL_weighted(pred_scales, target_scales)File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)File "/home/lcy-magic/Segment_TEST/RedNet/utils/utils.py", line 54, in forwardlosses.append(torch.sum(torch.masked_select(loss_all, mask)) / torch.sum(mask.float()))
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
我估计是torch.masked_select(loss_all, mask)) / torch.sum(mask.float())中,要么分子为inf,要么分母为0了。
我看这个issue参考issue可能也是这个问题,于是我也把0改成2e-9。还是同样报错。再改成:
targets = targets.long()mask = targets > 0
这个问题没有了。但又有新报错:
Traceback (most recent call last):File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_train.py", line 161, in <module>train()File "/home/lcy-magic/Segment_TEST/RedNet/RedNet_train.py", line 141, in traingrid_image = make_grid(utils.color_label(torch.max(pred_scales[0][:3], 1)[1] + 1), 3, normalize=False,File "/home/lcy-magic/anaconda3/envs/REDNET/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)
TypeError: make_grid() got an unexpected keyword argument 'range'
看起来感觉像make_grid函数更新了。一查果然是。根据参考博客的说法,应该改为value_range:
# grid_image = make_grid(utils.color_label(target_scales[0][:3]), 3, normalize=False, range=(0, 255))
grid_image = make_grid(utils.color_label(target_scales[0][:3]), 3, normalize=False, value_range=(0, 255))
再开始训练就没大问题了: