Quick Benchmark Colab CPU GPU TPU (XLA-CPU)

If you ever wonder about the performance differences between CPU, GPU, and TPU for your machine learning project, this article shows a simple benchmark for these three.

Memory Subsystem Architecture

Central Processing Unit (CPU), Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU) are processors with a specialized purpose and architecture.

CPU: A processor designed to solve every computational problem in a general fashion. The cache and memory design are to be optimal for any general programming problem.
GPU: A processor designed to accelerate the rendering of graphics.
TPU: A co-processor designed to accelerate deep learning tasks develop using TensorFlow (a programming framework). It is designed for a high volume of low precision computation (e.g. as little as 8-bit precision). Nevertheless, compilers have not been developed for TPU which could be used for general-purpose programming; hence, it requires significant effort to do general programming on TPU.

CPU GPU TPU memory subsystem architecture

Compute primitive

The dimensions of data are:

CPU: 1 X 1 data unit
GPU: 1 X N data unit
TPU: N X N data unit

Benchmark CPU, GPU, TPU

I will generate some data and perform the calculation on different infrastructures. The execution time will be logged for comparison. The implementation is on Google Colab with a limited option for TPU on Google compute engine backend. See this post for a quick intro of Google Colab. Specifically, we test on CPU, GPU, and XLA_CPU (accelerated linear algebra).

Test on CPU

Firstly, we enable TensorFlow 2.0 and set log info.

%tensorflow_version 2.x
import tensorflow as tf
import timeit
tf.get_logger().setLevel('INFO')

Then we select the CPU. Note that, you’ll need to enable CPU for the notebook:

Navigate to Edit→Notebook Settings
select None from the Hardware Accelerator drop-down (Runtime)

Next, we’ll confirm that we can connect to the CPU with TensorFlow:

cpu = tf.config.experimental.list_physical_devices('CPU')[0]
print(cpu)
#PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')

After that, we define an operation to test. We use convolution 2d

testcpu = """
import tensorflow as tf
with tf.device('/cpu:0'):
  random_image_cpu = tf.random.normal((100, 100, 100, 3))
  net_cpu = tf.compat.v1.layers.conv2d(random_image_cpu, 32, 7)
  net_cpu = tf.math.reduce_sum(net_cpu)
"""

We conduct the operation 10 times and get the log for the execution time.

print('Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images '
      '(batch x height x width x channel). Sum of ten runs.')
print('CPU (s):')
cpu_time = timeit.timeit(testcpu, number=10)
print(cpu_time)
#Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
#CPU (s):
#4.412778385999985

Test on GPU

Similarly, you need to set the runtime to use GPU before running the following code.

# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
gpus = tf.config.experimental.list_physical_devices('GPU')
gpu = gpus[0]
print(gpu)
tf.config.experimental.set_memory_growth(gpu, True)
testgpu = """
import tensorflow as tf
with tf.device('/device:GPU:0'):
  random_image_gpu = tf.random.normal((100, 100, 100, 3))
  net_gpu = tf.compat.v1.layers.conv2d(random_image_gpu, 32, 7)
  net_gpu = tf.math.reduce_sum(net_gpu)
"""
print('GPU (s):')
gpu_time = timeit.timeit(testgpu, number=10)
print(gpu_time)
print('GPU speedup over CPU: {}x'.format(int(cpu_time/gpu_time)))

The CPU vs GPU result.

PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
GPU (s):
0.07790310600000794
GPU speedup over CPU: 56x

Test on TPU – XLA-GPU

Don’t forget to switch the runtime to TPU.

tpus = tf.config.experimental.list_physical_devices('XLA_CPU')
tpu = tpus[0]
print(tpu)
#tf.config.experimental.set_memory_growth(tpu, True)
testtpu = """
import tensorflow as tf
with tf.device('/device:XLA_CPU:0'):
  random_image_tpu = tf.random.normal((100, 100, 100, 3))
  net_tpu = tf.compat.v1.layers.conv2d(random_image_tpu, 32, 7)
  net_tpu = tf.math.reduce_sum(net_tpu)
"""
print('TPU (s):')
tpu_time = timeit.timeit(testtpu, number=10)
print(tpu_time)
print('TPU speedup over CPU: {}x'.format(int(cpu_time/tpu_time)))

And the result

PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU')
TPU (s):
4.863093675999949
TPU speedup over CPU: 0x

Conclusion

There are a couple of other tests carried out in different settings. Nevertheless, under the current configuration of Google compute engine backend, it seems that CPU and TPU’s performance are very similar. GPU outperforms these two in this test (~50x faster).

In the future, we may conduct another test with a CNN project, which TPU is optimized for. For your interest, you can read this paper for a more structured benchmark.

benchmark CPU GPU tensorflow TPU XLA

Quick Benchmark Colab CPU GPU TPU (XLA-CPU)

Memory Subsystem Architecture

Compute primitive

Benchmark CPU, GPU, TPU

Test on CPU

Test on GPU

Test on TPU – XLA-GPU

Conclusion

Add comment

💬Cancel reply

Memory Subsystem Architecture

Compute primitive

Benchmark CPU, GPU, TPU

Test on CPU

Test on GPU

Test on TPU – XLA-GPU

Conclusion

Share this:

Add comment

💬Cancel reply

Read more

Categories