A few thingz


Joseph Basquin


28/03/2024

#bigdata


N-dimensional array data store (with labeled indexing)

What am I trying to do?

I'm currently looking for the perfect data structure for an ongoing R&D task.

I need to work with a data store as a n-dimensional array x (of dimension 4 or more) such that:

Possible solutions

I'm looking for a good and lightweight solution.
To keep things simple, I deliberately avoid (for now):

method ragged non-consecutive indexing numpy arithm. random access for 100 GB data store notes
xarray ? no
sparse ? no
Pandas DataFrame + Numpy ndarray ? ? (*) (**)
Tensorflow tf.ragged.constant ? ? ?
Sqlite + Numpy ndarray ? ? ? ? to be tested

(*) serialization with parquet: doesn't accept 2D or 3D arrays:

import numpy as np, pandas as pd
x = pd.DataFrame(columns=['a', 'b'])
for i in range(100):
    x.loc['t%i' % i] = [np.random.rand(100, 100), np.random.rand(2000)]
x.to_parquet('test.parquet')
# pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column a with type object')

(**) serialization with hdf5: currently not working:

import numpy as np, pandas as pd
store = pd.HDFStore("store.h5")
df = pd.DataFrame(columns=['a', 'b'])
df.loc['t1'] = {'a': np.random.rand(100, 100), 'b': np.random.rand(2000)}
store.append('test', df)
store.close()
# TypeError: Cannot serialize the column [a] because its data contents are not [string] but [mixed] object dtype

Contact me if you have ideas!

Links

https://stackoverflow.com/questions/72733385/data-structure-for-sparse-n-dimensional-array-tensor-such-a0-and-a1, https://stackoverflow.com/questions/72737525/pandas-rows-containing-numpy-ndarrays-various-shapes, https://stackoverflow.com/questions/72742007/pandas-dataframe-containing-numpy-ndarray-and-mean, https://stackoverflow.com/questions/72742843/100gb-data-store-pandas-dataframe-of-numpy-ndarrays-loading-only-a-small-part

My personal blog.

twitter
email
github

Data / AI / Python consulting and freelancing.

Articles about:
#all
#music
#photo
#opensource
#python