A few thingz


Joseph Basquin


21/11/2024

#bigdata


N-dimensional array data store (with labeled indexing)

What am I trying to do?

I'm currently looking for the perfect data structure for an ongoing R&D task.

I need to work with a data store as a n-dimensional array x (of dimension 4 or more) such that:

Possible solutions

I'm looking for a good and lightweight solution.
To keep things simple, I deliberately avoid (for now):

method ragged non-consecutive indexing numpy arithm. random access for 100 GB data store notes
xarray ? no
sparse ? no
Pandas DataFrame + Numpy ndarray ? ? (*) (**)
Tensorflow tf.ragged.constant ? ? ?
Sqlite + Numpy ndarray ? ? ? ? to be tested

(*) serialization with parquet: doesn't accept 2D or 3D arrays:

import numpy as np, pandas as pd
x = pd.DataFrame(columns=['a', 'b'])
for i in range(100):
    x.loc['t%i' % i] = [np.random.rand(100, 100), np.random.rand(2000)]
x.to_parquet('test.parquet')
# pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column a with type object')

(**) serialization with hdf5: currently not working:

import numpy as np, pandas as pd
store = pd.HDFStore("store.h5")
df = pd.DataFrame(columns=['a', 'b'])
df.loc['t1'] = {'a': np.random.rand(100, 100), 'b': np.random.rand(2000)}
store.append('test', df)
store.close()
# TypeError: Cannot serialize the column [a] because its data contents are not [string] but [mixed] object dtype

Contact me if you have ideas!

Links

Data structure for n-dimensional array / tensor such A[0, :, :] and A[1, :, :] can have different shapes
Pandas rows containing numpy ndarrays various shapes
Pandas Dataframe containing Numpy ndarray and mean
100GB data store: Pandas dataframe of numpy ndarrays: loading only a small part + avoid rewriting the whole file when doing small modifications

My blog – Joseph Basquin

twitter
email
github
linkedin
freelancing

Available for freelancing: Python expert / R&D / Automation / Embedded / Audio / Data / UX

I create products such as SamplerBox, YellowNoiseAudio, Jeux d'orgues, this blogging engine, etc.

Articles about: #all, #music, #opensource, #python.