Can't get Tensorflow working on macOS M1 Pro Chip
Asked Answered
L

1

7

I've been trying to get into ML and I wanted to follow a course on it but it requires Tensorflow and I've been trying to get that working on my system. I have the 2021 14" 16GB Macbook Pro with the M1 Pro Chip and I am running Ventura 13.1. I have been following this article as well as digging around about getting Tensorflow working on M1 but to no avail. I managed to get tensorflow-macos installed in my environment as well as tensorflow-metal but when I try to run some sample code in Juyter, I'm getting an error that I do not understand. In Jupyter, when I run:

import tensorflow as tf print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

I get

Num GPUs Available: 1

So it does seem like I have tensorflow and metal installed, but when I try to run the rest of the code, I get:

TensorFlow version: 2.11.0
Num GPUs Available:  1
Metal device set to: Apple M1 Pro
WARNING:tensorflow:AutoGraph could not transform <function normalize_img at 0x14a4cec10> and will run it as-is.
Cause: Unable to locate the source code of <function normalize_img at 0x14a4cec10>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
2022-12-13 13:54:33.658225: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-13 13:54:33.658309: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
WARNING:tensorflow:AutoGraph could not transform <function normalize_img at 0x14a4cec10> and will run it as-is.
Cause: Unable to locate the source code of <function normalize_img at 0x14a4cec10>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function normalize_img at 0x14a4cec10> and will run it as-is.
Cause: Unable to locate the source code of <function normalize_img at 0x14a4cec10>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
Epoch 1/12
2022-12-13 13:54:34.162300: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-12-13 13:54:34.163015: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-13 13:54:35.383325: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x14a345660
2022-12-13 13:54:35.383350: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x14a345660
2022-12-13 13:54:35.389028: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x14a345660
2022-12-13 13:54:35.389049: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x14a345660
2022-12-13 13:54:35.401250: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x14a345660
2022-12-13 13:54:35.401274: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x14a345660
2022-12-13 13:54:35.405004: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x14a345660
2022-12-13 13:54:35.405025: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x14a345660
---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
File <timed exec>:45

File ~/conda/envs/mlp3/lib/python3.8/site-packages/keras/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     67     filtered_tb = _process_traceback_frames(e.__traceback__)
     68     # To get the full stack trace, call:
     69     # `tf.debugging.disable_traceback_filtering()`
---> 70     raise e.with_traceback(filtered_tb) from None
     71 finally:
     72     del filtered_tb

File ~/conda/envs/mlp3/lib/python3.8/site-packages/tensorflow/python/eager/execute.py:52, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     50 try:
     51   ctx.ensure_initialized()
---> 52   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     53                                       inputs, attrs, num_outputs)
     54 except core._NotOkStatusException as e:
     55   if name is not None:

NotFoundError: Graph execution error:

Detected at node 'StatefulPartitionedCall_6' defined at (most recent call last):
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/traitlets/config/application.py", line 992, in launch_instance
      app.start()
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 711, in start
      self.io_loop.start()
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 215, in start
      self.asyncio_loop.run_forever()
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
      self._run_once()
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
      handle._run()
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/asyncio/events.py", line 81, in _run
      self._context.run(self._callback, *self._args)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 510, in dispatch_queue
      await self.process_one()
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 499, in process_one
      await dispatch(*args)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 406, in dispatch_shell
      await result
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
      reply_content = await reply_content
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/ipykernel/ipkernel.py", line 411, in do_execute
      res = shell.run_cell(
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/ipykernel/zmqshell.py", line 531, in run_cell
      return super().run_cell(*args, **kwargs)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2940, in run_cell
      result = self._run_cell(
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2995, in _run_cell
      return runner(coro)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3194, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3373, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3433, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "/var/folders/k4/vgd34_w913ndkfkmvgssqgjr0000gn/T/ipykernel_16072/1016625245.py", line 1, in <module>
      get_ipython().run_cell_magic('time', '', 'import tensorflow as tf\nimport tensorflow_datasets as tfds\nprint("TensorFlow version:", tf.__version__)\nprint("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices(\'GPU\')))\ntf.config.list_physical_devices(\'GPU\')\n(ds_train, ds_test), ds_info = tfds.load(\n    \'mnist\',\n    split=[\'train\', \'test\'],\n    shuffle_files=True,\n    as_supervised=True,\n    with_info=True,\n)\ndef normalize_img(image, label):\n  """Normalizes images: `uint8` -> `float32`."""\n  return tf.cast(image, tf.float32) / 255., label\nbatch_size = 128\nds_train = ds_train.map(\n    normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)\nds_train = ds_train.cache()\nds_train = ds_train.shuffle(ds_info.splits[\'train\'].num_examples)\nds_train = ds_train.batch(batch_size)\nds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)\nds_test = ds_test.map(\n    normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)\nds_test = ds_test.batch(batch_size)\nds_test = ds_test.cache()\nds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)\nmodel = tf.keras.models.Sequential([\n  tf.keras.layers.Conv2D(32, kernel_size=(3, 3),\n                 activation=\'relu\'),\n  tf.keras.layers.Conv2D(64, kernel_size=(3, 3),\n                 activation=\'relu\'),\n  tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),\n#   tf.keras.layers.Dropout(0.25),\n  tf.keras.layers.Flatten(),\n  tf.keras.layers.Dense(128, activation=\'relu\'),\n#   tf.keras.layers.Dropout(0.5),\n  tf.keras.layers.Dense(10, activation=\'softmax\')\n])\nmodel.compile(\n    loss=\'sparse_categorical_crossentropy\',\n    optimizer=tf.keras.optimizers.Adam(0.001),\n    metrics=[\'accuracy\'],\n)\nmodel.fit(\n    ds_train,\n    epochs=12,\n    validation_data=ds_test,\n)\n')
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2417, in run_cell_magic
      result = fn(*args, **kwargs)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/IPython/core/magics/execution.py", line 1321, in time
      out = eval(code_2, glob, local_ns)
    File "<timed exec>", line 45, in <module>
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/engine/training.py", line 1650, in fit
      tmp_logs = self.train_function(iterator)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/engine/training.py", line 1249, in train_function
      return step_function(self, iterator)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/engine/training.py", line 1233, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/engine/training.py", line 1222, in run_step
      outputs = model.train_step(data)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/engine/training.py", line 1027, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
      self.apply_gradients(grads_and_vars)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
      return super().apply_gradients(grads_and_vars, name=name)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
      iteration = self._internal_apply_gradients(grads_and_vars)
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
      return tf.__internal__.distribute.interim.maybe_merge_call(
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
      distribution.extended.update(
    File "/Users/imigh/conda/envs/mlp3/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
      return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_6'
could not find registered platform with id: 0x14a345660
     [[{{node StatefulPartitionedCall_6}}]] [Op:__inference_train_function_1261]

Sorry for just dumping the entire error code but as you can see something's gone awry. It only seems to run the first Epoch and I'm not sure what's going wrong. I've followed everything in that guide as well as the instructions from tensor flow-metal. I've looked around seeminly everywhere but this is as far as I've gotten after hours of battling. I just updated my Mac today so the Xcode command line tools should be up to date. Any and all advice or helping me decipher the error code would be greatly appreciated. I just want to learn Machine Learning but I can't even follow my course without this working.

I've uninstalled and reinstalled Conda Miniforge for M1 several times. I've created and tried the steps in a blank environment. I've followed the steps listed in the guides I've linked above and went through them multiple times. I was originally getting some issues with numpy, h5py, grcio, and protobuf but after tinkering with the versions I no longer get error codes for them, so I'm not sure if that's all good but I don't see any explicit mentions. I've also ran

conda install -c conda-forge openblas

after looking at this page from StackOverflow from someone with a similar issue, but I'm still getting this error.

Ludvig answered 14/12, 2022 at 0:12 Comment(0)
P
7

A similar issue was raised on the Apple Developer Forums with a solution to use tf.keras.optimizers.legacy.Adam() to workaround this issue due to a gap in the PluggableDevice implementation for Metal.

Alternatively, specify the released versions mentioned in Get started with tensorflow-metal when installing with pip.

python -m pip install tensorflow-macos==2.9.0
python -m pip install tensorflow-metal==0.5.0
Pierette answered 15/12, 2022 at 4:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.