pythoncudacupy# Why (x / y)[i] faster than x[i] / y[i]?

I'm new to `CuPy`

and CUDA/GPU computing. Can someone explain why `(x / y)[i]`

faster than `x[i] / y[i]`

?

When taking advantage of GPU accelerated computations, are there any guidelines that would allow me to quickly determine which operation is faster? Which to avoid benchmarking every operation.

```
# In VSCode Jupyter Notebook
import cupy as cp
from cupyx.profiler import benchmark
x = cp.arange(1_000_000)
y = (cp.arange(1_000_000) + 1) / 2
i = cp.random.randint(2, size=1_000_000) == 0
x, y, i
# Output:
(array([ 0, 1, 2, ..., 999997, 999998, 999999], shape=(1000000,), dtype=int32),
array([5.000000e-01, 1.000000e+00, 1.500000e+00, ..., 4.999990e+05,
4.999995e+05, 5.000000e+05], shape=(1000000,), dtype=float64),
array([ True, False, True, ..., True, False, True], shape=(1000000,), dtype=bool))
```

```
def test1(x, y, i):
return (x / y)[i]
def test2(x, y, i):
return x[i] / y[i]
print(benchmark(test1, (x, y, i)))
print(benchmark(test2, (x, y, i)))
# Output:
test1: CPU: 175.164 us +/- 61.250 (min: 125.200 / max: 765.100) us GPU-0: 186.001 us +/- 67.314 (min: 134.144 / max: 837.568) us
test2: CPU: 342.364 us +/- 130.840 (min: 223.000 / max: 1277.600) us GPU-0: 368.133 us +/- 136.911 (min: 225.504 / max: 1297.408) us
```

Solution

The ratio of arithmetic capability to bytes retrieval capability on a GPU (at least, maybe also CPU) is generally lopsided in favor of arithmetic capability. Therefore algorithm performance can often be predicted by the the number of memory operations required. The first step in the `x[i]/y[i]`

realization is to create `x[i]`

and `y[i]`

, which is pure memory operations - stream compaction. Then the reduced level of arithmetic occurs. In the other realization, a higher level of arithmetic occurs first, followed by a 50% reduction in memory ops.

A high end data center GPU - e.g. the A100, could have ~10 TF of FP64 throughput, and ~2 TB/s of memory bandwidth, so that ratio would be 5 FP64 FLOPs per byte of data read or written. For an 8-byte FP64 quantity, that is 40 FLOPs per FP64 quantity read or written. (Division requires multiple FLOPs.)

Let's assume that the `i`

array is composed of 50% True and 50% false. Now lets count all the steps required.

In the `x[i]/y[i]`

realization, we must

- read the entire
`i`

array - 1,000,000 reads. - use that to read a portion of the
`x`

and`y`

arrays, but on a GPU, you don't get to read individual elements. You read memory in chunks of 32 bytes at a time, so realistically with evenly distributed True/False distribution, you may end up reading the entire`x`

and`y`

arrays - 2,000,000 reads - stream compaction on x, y according to
`i`

- perform the division arithmetic - 500,000 division ops.
- write results - 500,000 writes

In the `(x/y)[i]`

realization:

- read the entire
`i`

array - 1,000,000 reads. - read the entire
`x`

and`y`

arrays - 2,000,000 reads - perform the division arithmetic - 1,000,000 division ops.
- stream compaction on division result according to
`i`

- write results - 500,000 writes

The only difference is the positioning of the stream compaction operation(s) and the number of division operations required. Stream compaction is basically a purely memory bound activity, so the difference is in the cost of accessing memory. The first approach requires twice as much stream compaction.

On a GPU, at least, flops are often significantly less "expensive" than reading/writing data.

The percentage ratio of True/False in `i`

doesn't really affect this analysis for A100. Other GPUs like T4 have a relatively low throughput for FP64 compared to FP32. T4 has a ratio of 320GB/s of memory bandwidth to ~8TF of FP32 thoughput, but only 1/8TF or 125GF of FP64 throughput. At high ratios of True to False, this won't matter, but at low ratios of True to False, for T4, its possible that the cost of division is a factor. This distinction would probably be wiped out by switching to FP32 datatype. Another way to say this could be that T4, for FP64 ops, doesn't necessarily satisfy the initial premise of my answer here: "The ratio of arithmetic capability to bytes retrieval capability on a GPU (at least, maybe also CPU) is generally lopsided in favor of arithmetic capability."

- AttributeError: install_layout when attempting to install a package in a virtual environment
- Python list comprehension - want to avoid repeated evaluation
- Hash algorithm for dynamic growing/streaming data?
- matplotlib - making labels for violin plots
- Python How to I check if last element has been reached in iterator tool chain?
- Polars and the Lazy API: How to drop columns that contain only null values?
- Why are my Mean, Var, and Std outputs from NumPy different from what the online grader expects?
- Correlation dataframe convertion from results from pl.corr
- Polars DataFrame transformation
- Discord rate limiting while only sending 1 request per minute
- Check if column contains (/,-,_, *or~) and split in another column - Pandas
- How to draw a rectangle at (x,y) in a PyQt GraphicsView?
- how to calculate correlation between ten columns with polars
- How to set class attribute with await in __init__
- Detect hindi encoding, response received from Facebook API in Python
- Is it possible to write a horizontal if statement with a multi-line body?
- Max length of items in list
- Cannot subclass multiprocessing Queue in Python 3.5
- How can I get notified of updates to Python packages in a unified way?
- Using python AST to traverse code and extract return statements
- merge groups of columns in a polars dataframe to single columns
- Group Pandas DataFrame by Continuous Date Ranges
- Flask login @login_required not working
- Odoo: one2many and many2one? KeyError:'___'
- merge some columns in a Polars dataframe and duplicate the others
- Python: Create table from string mixed with separators using FOR loops
- How do I type hint a method with the type of the enclosing class?
- How can I verify an emails DKIM signature in Python?
- Writing a class that accepts a callback in Python?
- Python Paramiko channel.exec_command not returning output intermittently