A few thingz
Joseph Basquin
07/10/2024
#bigdata
N-dimensional array data store (with labeled indexing)
What am I trying to do?
I'm currently looking for the perfect data structure for an ongoing R&D task.
I need to work with a data store as a n-dimensional array x
(of dimension 4 or more) such that:
-
(1) "Ragged" array
It should be possible that
x[0, 0, :, :]
is of shape (100, 100), andx[0, 1, :, :]
is of shape (10000, 10000) without wasting memory by making the two last dimensions always fixed to the largest value (10000, 10000). -
(2) Labeled indexing instead of positional consecutive indexing
I also would like to be able to work with
x[19700101000000, :, :, :]
,x[202206231808, :, :, :]
, i.e. one dimension would be a numerical timestamp in format YYYYMMDDhhmmss or more generally an integer label (not a continuous0...n-1
range). -
(3): Easy Numpy-like arithmetic
All of this should keep (as much as possible) all the standard Numpy basic operations, such as basic arithmetic, array slicing, useful functions such as
x.mean(axis=0)
to average the data over a dimension, etc. -
(4): Random access
I would like this data store to be possibly 100 GB large. This means it should be possible to work with it without loading the whole dataset in memory.
We should be able to open the data store, modify some values and save, without rewriting the whole 100 GB file:
x = datastore.open('datastore.dat') # open the data store, *without* loading everything in memory x[20220624000000, :, :, :] = 0 # modify some values x[20220510120000, :, :, :] -= x[20220510120000, :, :, :].mean() # modify other values x.close() # only a few bytes written to disk
Possible solutions
I'm looking for a good and lightweight solution.
To keep things simple, I deliberately avoid (for now):
- BigQuery
- PySpark ("Note that PySpark requires Java 8 or later with ...")
- and more generally all cloud solutions or client/server solutions: I'd like a solution that runs on a single computer without networking
method | ragged | non-consecutive indexing | numpy arithm. | random access for 100 GB data store | notes |
---|---|---|---|---|---|
xarray |
? | ✓ | ✓ | no | |
sparse |
? | ✓ | ✓ | no | |
Pandas DataFrame + Numpy ndarray |
✓ | ✓ | ? | ? | (*) (**) |
Tensorflow tf.ragged.constant |
✓ | ? | ? | ? | |
Sqlite + Numpy ndarray |
? | ? | ? | ? | to be tested |
(*) serialization with parquet
: doesn't accept 2D or 3D arrays:
import numpy as np, pandas as pd
x = pd.DataFrame(columns=['a', 'b'])
for i in range(100):
x.loc['t%i' % i] = [np.random.rand(100, 100), np.random.rand(2000)]
x.to_parquet('test.parquet')
# pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column a with type object')
(**) serialization with hdf5
: currently not working:
import numpy as np, pandas as pd
store = pd.HDFStore("store.h5")
df = pd.DataFrame(columns=['a', 'b'])
df.loc['t1'] = {'a': np.random.rand(100, 100), 'b': np.random.rand(2000)}
store.append('test', df)
store.close()
# TypeError: Cannot serialize the column [a] because its data contents are not [string] but [mixed] object dtype
Contact me if you have ideas!
Links
Data structure for n-dimensional array / tensor such A[0, :, :] and A[1, :, :] can have different shapes
Pandas rows containing numpy ndarrays various shapes
Pandas Dataframe containing Numpy ndarray and mean
100GB data store: Pandas dataframe of numpy ndarrays: loading only a small part + avoid rewriting the whole file when doing small modifications