Tensorflow model fails to train with initialize error

This post will provide some troubleshooting resources if you find yourself with this UnknownError when training an algorithm: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

Environment:

  • Windows 10
  • Nvidia RTX GPU
  • Python 3.8
  • Tensorflow 2.4.1 with GPU support

There was a moment where when running my training of a tensorflow model I got the following error:

UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node cnn/conv2d/Relu (defined at c:...\model_def\cnn.py:35) ]] [Op:__inference_train_function_625528]

Function call stack:
train_function

There was no warning log, so to troubleshoot I tried to reload the kernel, see if I had another python application that used the GPU and even restarted the computer. The former things did not work in a reliable way. In some cases was that there was another python/tensorflow instance active, but other times this was not the case and a restart worked partially.

Looking back to the error message, cuDNN is one of the Nvidia libraries you want to install in order to use more efficiently make use of your graphics card when running deep learning frameworks (see https://developer.nvidia.com/cudnn)

In the end, I found out that one of the applications I was using to tweak my fan speed looked like it was using cuda cores and thus Tensorflow was not able to allocate all the cuda capacity to it, resulting in the error found above.  Closing the application fixed the error.

Let me know if this helps you when facing this issue, or if you have a better insight on the real error cause.

Full stacktrace

---> 30         self._history = self.model.fit(
     31             training_data.train_generator,
     32             epochs=epochs, steps_per_epoch=len(training_data.train_generator)-1,

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1098                 _r=1):
   1099               callbacks.on_train_batch_begin(step)
-> 1100               tmp_logs = self.train_function(iterator)
   1101               if data_handler.should_sync:
   1102                 context.async_wait()

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\def_function.py in __call__(self, *args, **kwds)
    826     tracing_count = self.experimental_get_tracing_count()
    827     with trace.Trace(self._name) as tm:
--> 828       result = self._call(*args, **kwds)
    829       compiler = "xla" if self._experimental_compile else "nonXla"
    830       new_tracing_count = self.experimental_get_tracing_count()

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\def_function.py in _call(self, *args, **kwds)
    886         # Lifting succeeded, so variables are initialized and we can run the
    887         # stateless function.
--> 888         return self._stateless_fn(*args, **kwds)
    889     else:
    890       _, _, _, filtered_flat_args = \

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\function.py in __call__(self, *args, **kwargs)
   2940       (graph_function,
   2941        filtered_flat_args) = self._maybe_define_function(args, kwargs)
-> 2942     return graph_function._call_flat(
   2943         filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
   2944 

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1916         and executing_eagerly):
   1917       # No tape is watching; skip to running the function.
-> 1918       return self._build_call_outputs(self._inference_function.call(
   1919           ctx, args, cancellation_manager=cancellation_manager))
   1920     forward_backward = self._select_forward_and_backward_functions(

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\function.py in call(self, ctx, args, cancellation_manager)
    553       with _InterpolateFunctionError(self):
    554         if cancellation_manager is None:
--> 555           outputs = execute.execute(
    556               str(self.signature.name),
    557               num_outputs=self._num_outputs,

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     57   try:
     58     ctx.ensure_initialized()
---> 59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:

UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node cnn/conv2d/Relu (defined at c:...\model_def\cnn.py:35) ]] [Op:__inference_train_function_625528]

Function call stack:
train_function