[ACCEPTED]-How to hash a large object (dataset) in Python?-pickle
Thanks to John Montgomery I think I have 7 found a solution, and I think it has less 6 overhead than converting every number in 5 possibly huge arrays to strings:
I can create 4 a byte-view of the arrays and use these 3 to update the hash. And somehow this seems 2 to give the same digest as directly updating 1 using the array:
>>> import hashlib >>> import numpy >>> a = numpy.random.rand(10, 100) >>> b = a.view(numpy.uint8) >>> print a.dtype, b.dtype # a and b have a different data type float64 uint8 >>> hashlib.sha1(a).hexdigest() # byte view sha1 '794de7b1316b38d989a9040e6e26b9256ca3b5eb' >>> hashlib.sha1(b).hexdigest() # array sha1 '794de7b1316b38d989a9040e6e26b9256ca3b5eb'
What's the format of the data in the arrays? Couldn't 8 you just iterate through the arrays, convert 7 them into a string (via some reproducible 6 means) and then feed that into your hash 5 via update?
import hashlib m = hashlib.md5() # or sha1 etc for value in array: # array contains the data m.update(str(value))
Don't forget though that 4 numpy arrays won't provide
__hash__() because they 3 are mutable. So be careful not to modify 2 the arrays after your calculated your hash 1 (as it will no longer be the same).
Using Numpy 1.10.1 and python 2.7.6, you 2 can now simply hash numpy arrays using hashlib 1 if the array is C-contiguous (use
numpy.ascontiguousarray() if not), e.g.
>>> h = hashlib.md5() >>> arr = numpy.arange(101) >>> h.update(arr) >>> print(h.hexdigest()) e62b430ff0f714181a18ea1a821b0918
Here is how I do it in jug (git HEAD at the 5 time of this answer):
e = some_array_object M = hashlib.md5() M.update('np.ndarray') M.update(pickle.dumps(e.dtype)) M.update(pickle.dumps(e.shape)) try: buffer = e.data M.update(buffer) except: M.update(e.copy().data)
The reason is that 4
e.data is only available for some arrays (contiguous 3 arrays). Same thing with
a.view(np.uint8) (which fails with 2 a non-descriptive type error if the array 1 is not contiguous).
Fastest by some margin seems to be:
a 3 is a numpy ndarray.
Obviously not secure 2 hashing, but it should be good for caching 1 etc.
array.data is always hashable, because it's 6 a buffer object. easy :) (unless you care 5 about the difference between differently-shaped 4 arrays with the exact same data, etc.. (ie 3 this is suitable unless shape, byteorder, and 2 other array 'parameters' must also figure 1 into the hash)
More Related questions