11/08/2024

# #bigdata

## N-dimensional array data store (with labeled indexing)

### What am I trying to do?

I'm currently looking for the perfect data structure for an ongoing R&D task.

I need to work with a data store as a n-dimensional array `x` (of dimension 4 or more) such that:

• (1) "Ragged" array

It should be possible that `x[0, 0, :, :]` is of shape (100, 100), and `x[0, 1, :, :]` is of shape (10000, 10000) without wasting memory by making the two last dimensions always fixed to the largest value (10000, 10000).

• (2) Labeled indexing instead of positional consecutive indexing

I also would like to be able to work with `x[19700101000000, :, :, :]`, `x[202206231808, :, :, :]`, i.e. one dimension would be a numerical timestamp in format YYYYMMDDhhmmss or more generally an integer label (not a continuous `0...n-1` range).

• (3): Easy Numpy-like arithmetic

All of this should keep (as much as possible) all the standard Numpy basic operations, such as basic arithmetic, array slicing, useful functions such as `x.mean(axis=0)` to average the data over a dimension, etc.

• (4): Random access

I would like this data store to be possibly 100 GB large. This means it should be possible to work with it without loading the whole dataset in memory.

We should be able to open the data store, modify some values and save, without rewriting the whole 100 GB file:

``````x = datastore.open('datastore.dat')                              # open the data store, *without* loading everything in memory
x[20220624000000, :, :, :] = 0                                   # modify some values
x[20220510120000, :, :, :] -= x[20220510120000, :, :, :].mean()  # modify other values
x.close()                                                        # only a few bytes written to disk ``````

### Possible solutions

I'm looking for a good and lightweight solution.
To keep things simple, I deliberately avoid (for now):

• BigQuery
• PySpark ("Note that PySpark requires Java 8 or later with ...")
• and more generally all cloud solutions or client/server solutions: I'd like a solution that runs on a single computer without networking
method ragged non-consecutive indexing numpy arithm. random access for 100 GB data store notes
`xarray` ? no
`sparse` ? no
Pandas `DataFrame` + Numpy `ndarray` ? ? (*) (**)
Tensorflow `tf.ragged.constant` ? ? ?
Sqlite + Numpy `ndarray` ? ? ? ? to be tested

(*) serialization with `parquet`: doesn't accept 2D or 3D arrays:

``````import numpy as np, pandas as pd
x = pd.DataFrame(columns=['a', 'b'])
for i in range(100):
x.loc['t%i' % i] = [np.random.rand(100, 100), np.random.rand(2000)]
x.to_parquet('test.parquet')
# pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column a with type object')``````

(**) serialization with `hdf5`: currently not working:

``````import numpy as np, pandas as pd
store = pd.HDFStore("store.h5")
df = pd.DataFrame(columns=['a', 'b'])
df.loc['t1'] = {'a': np.random.rand(100, 100), 'b': np.random.rand(2000)}
store.append('test', df)
store.close()
# TypeError: Cannot serialize the column [a] because its data contents are not [string] but [mixed] object dtype``````

Contact me if you have ideas!