Tensorflow model fails to train with initialize error
This post will provide some troubleshooting resources if you find yourself with this UnknownError
when training an algorithm: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
Environment:
- Windows 10
- Nvidia RTX GPU
- Python 3.8
- Tensorflow 2.4.1 with GPU support
There was a moment where when running my training of a tensorflow model I got the following error:
UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node cnn/conv2d/Relu (defined at c:...\model_def\cnn.py:35) ]] [Op:__inference_train_function_625528]
Function call stack:
train_function
There was no warning log, so to troubleshoot I tried to reload the kernel, see if I had another python application that used the GPU and even restarted the computer. The former things did not work in a reliable way. In some cases was that there was another python/tensorflow instance active, but other times this was not the case and a restart worked partially.
Looking back to the error message, cuDNN is one of the Nvidia libraries you want to install in order to use more efficiently make use of your graphics card when running deep learning frameworks (see https://developer.nvidia.com/cudnn)
In the end, I found out that one of the applications I was using to tweak my fan speed looked like it was using cuda cores and thus Tensorflow was not able to allocate all the cuda capacity to it, resulting in the error found above. Closing the application fixed the error.
Let me know if this helps you when facing this issue, or if you have a better insight on the real error cause.
Full stacktrace
---> 30 self._history = self.model.fit(
31 training_data.train_generator,
32 epochs=epochs, steps_per_epoch=len(training_data.train_generator)-1,
~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1098 _r=1):
1099 callbacks.on_train_batch_begin(step)
-> 1100 tmp_logs = self.train_function(iterator)
1101 if data_handler.should_sync:
1102 context.async_wait()
~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\def_function.py in __call__(self, *args, **kwds)
826 tracing_count = self.experimental_get_tracing_count()
827 with trace.Trace(self._name) as tm:
--> 828 result = self._call(*args, **kwds)
829 compiler = "xla" if self._experimental_compile else "nonXla"
830 new_tracing_count = self.experimental_get_tracing_count()
~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\def_function.py in _call(self, *args, **kwds)
886 # Lifting succeeded, so variables are initialized and we can run the
887 # stateless function.
--> 888 return self._stateless_fn(*args, **kwds)
889 else:
890 _, _, _, filtered_flat_args = \
~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\function.py in __call__(self, *args, **kwargs)
2940 (graph_function,
2941 filtered_flat_args) = self._maybe_define_function(args, kwargs)
-> 2942 return graph_function._call_flat(
2943 filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
2944
~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1916 and executing_eagerly):
1917 # No tape is watching; skip to running the function.
-> 1918 return self._build_call_outputs(self._inference_function.call(
1919 ctx, args, cancellation_manager=cancellation_manager))
1920 forward_backward = self._select_forward_and_backward_functions(
~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\function.py in call(self, ctx, args, cancellation_manager)
553 with _InterpolateFunctionError(self):
554 if cancellation_manager is None:
--> 555 outputs = execute.execute(
556 str(self.signature.name),
557 num_outputs=self._num_outputs,
~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
57 try:
58 ctx.ensure_initialized()
---> 59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node cnn/conv2d/Relu (defined at c:...\model_def\cnn.py:35) ]] [Op:__inference_train_function_625528]
Function call stack:
train_function