I had some code that random-initialized some numpy arrays with:
rng = np.random.default_rng(seed=seed)
new_vectors = rng.uniform(-1.0, 1.0, target_shape).astype(np.float32) # [-1.0, 1.0)
new_vectors /= vector_size
And all was working well, all project tests passing.
Unfortunately, uniform()
returns np.float64
, though downstream steps only want np.float32
, and in some cases, this array is very large (think millions of 400-dimensional word-vectors). So the temporary np.float64
return-value momentarily uses 3X the RAM necessary.
Thus, I replaced the above with what definitionally should be equivalent:
rng = np.random.default_rng(seed=seed)
new_vectors = rng.random(target_shape, dtype=np.float32) # [0.0, 1.0)
new_vectors *= 2.0 # [0.0, 2.0)
new_vectors -= 1.0 # [-1.0, 1.0)
new_vectors /= vector_size
And after this change, all closely-related functional tests still pass, but a single distant, fringe test relying on far-downstream calculations from the vectors so-initialized has started failing. And failing in a very reliable way. It's a stochastic test, and passes with large margin-for-error in top case, but always fails in bottom case. So: something has changed, but in some very subtle way.
The superficial values of new_vectors
seem properly and similarly distributed in both cases. And again, all the "close-up" tests of functionality still pass.
So I'd love theories for what non-intuitive changes this 3-line change may have made that could show up far-downstream.
(I'm still trying to find a minimal test that detects whatever's different. If you'd enjoy doing a deep-dive into the affected project, seeing the exact close-up tests that succeed & one fringe test that fails, and commits with/without the tiny change, at https://github.com/RaRe-Technologies/gensim/pull/2944#issuecomment-704512389. But really, I'm just hoping a numpy expert might recognize some tiny corner-case where something non-intuitive happens, or offer some testable theories of same.)
Any ideas, proposed tests, or possible solutions?
I ran your code with the following values:
seed = 0
target_shape = [100]
vector_size = 3
I noticed that the code in your first solution generated a different new_vectors then your second solution.
Specifically it looks like uniform
keeps half of the values from the random number generator that random
does with the same seed. This is probably because of an implementation detail within the random generator from numpy.
In the following snippet i only inserted spaces to align similar values. there is probably also some float rounding going on making the result appear not identical.
[ 0.09130779, -0.15347552, -0.30601767, -0.32231492, 0.20884682, ...]
[0.23374946, 0.09130772, 0.007424275, -0.1534756, -0.12811375, -0.30601773, -0.28317323, -0.32231498, -0.21648853, 0.20884681, ...]
Based on this i speculate that your stochastic test case only tests your solution with one seed and because you generate a different sequence with the new solution. and this result causes a failure in the test case.