编程技术网

关注微信公众号,定时推送前沿、专业、深度的编程技术资料。

 找回密码
 立即注册

QQ登录

只需一步,快速开始

极客时间

在不同的机器上训练 PyTorch 模型会导致不同的结果:Training PyTorch models on different machines leads to different results

mausolos 深度学习 2022-5-7 12:38 10人围观

腾讯云服务器

PyTorch 随机数生成器

您可以使用 torch.manual_seed()所有设备(CPU 和 CUDA)设置 RNG:

<块引用>

CUDA 卷积确定性

虽然禁用 CUDA 卷积基准测试(上面讨论过)可确保每次应用程序运行时 CUDA 选择相同的算法,但该算法本身可能是不确定的,除非 torch.use_deterministic_algorithms(True)torch.backends.cudnn.deterministic = True 已设置.后一个设置仅控制此行为,与 torch 不同.use_deterministic_algorithms() 这将使其他 PyTorch 操作也具有确定性的行为.

CUDA RNN 和 LSTM

在某些版本的 CUDA 中,RNN 和 LSTM 网络可能具有非确定性行为.请参阅torch.nn.RNN()torch.nn.LSTM() 了解详情和解决方法.

DataLoader

DataLoader 将在多进程数据加载算法中按照随机性重新播种工人.使用 worker_init_fn() 来保持可重复性:

I am training the same model on two different machines, but the trained models are not identical. I have taken the following measures to ensure reproducibility:

# set random number 
random.seed(0)
torch.cuda.manual_seed(0)
np.random.seed(0)
# set the cudnn
torch.backends.cudnn.benchmark=False
torch.backends.cudnn.deterministic=True
# set data loader work threads to be 0
DataLoader(dataset, num_works=0)

When I train the same model multiple times on the same machine, the trained model is always the same. However, the trained models on two different machines are not the same. Is this normal? Are there any other tricks I can employ?

解决方案

There are a number of areas that could additionally introduce randomness e.g:

PyTorch random number generator

You can use torch.manual_seed() to seed the RNG for all devices (both CPU and CUDA):

CUDA convolution determinism

While disabling CUDA convolution benchmarking (discussed above) ensures that CUDA selects the same algorithm each time an application is run, that algorithm itself may be nondeterministic, unless either torch.use_deterministic_algorithms(True) or torch.backends.cudnn.deterministic = True is set. The latter setting controls only this behavior, unlike torch.use_deterministic_algorithms() which will make other PyTorch operations behave deterministically, too.

CUDA RNN and LSTM

In some versions of CUDA, RNNs and LSTM networks may have non-deterministic behavior. See torch.nn.RNN() and torch.nn.LSTM() for details and workarounds.

DataLoader

DataLoader will reseed workers following Randomness in multi-process data loading algorithm. Use worker_init_fn() to preserve reproducibility:

这篇关于在不同的机器上训练 PyTorch 模型会导致不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程技术网(www.editcode.net)!

腾讯云服务器 阿里云服务器
关注微信
^