A few thingz


Joseph Basquin


25/04/2024

#datascience


Working on PDF files with Python

There are many solutions to work on PDF files with Python. Depending on whether you need to read, parse data, extract tables, modify (split, merge, crop...), or create a new PDF, you will need different tools.

Here is a quick diagram of some common tools I have used:

If you need to extract data from image PDF files, it's a whole different story, and you might need to use OCR libraries like (Py)Tesseract or other tools.

Have some specific data conversion / extraction needs? Please contact me for consulting - a little script can probably automate hours of manual processing in a few seconds!

N-dimensional array data store (with labeled indexing)

What am I trying to do?

I'm currently looking for the perfect data structure for an ongoing R&D task.

I need to work with a data store as a n-dimensional array x (of dimension 4 or more) such that:

Possible solutions

I'm looking for a good and lightweight solution.
To keep things simple, I deliberately avoid (for now):

method ragged non-consecutive indexing numpy arithm. random access for 100 GB data store notes
xarray ? no
sparse ? no
Pandas DataFrame + Numpy ndarray ? ? (*) (**)
Tensorflow tf.ragged.constant ? ? ?
Sqlite + Numpy ndarray ? ? ? ? to be tested

(*) serialization with parquet: doesn't accept 2D or 3D arrays:

import numpy as np, pandas as pd
x = pd.DataFrame(columns=['a', 'b'])
for i in range(100):
    x.loc['t%i' % i] = [np.random.rand(100, 100), np.random.rand(2000)]
x.to_parquet('test.parquet')
# pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column a with type object')

(**) serialization with hdf5: currently not working:

import numpy as np, pandas as pd
store = pd.HDFStore("store.h5")
df = pd.DataFrame(columns=['a', 'b'])
df.loc['t1'] = {'a': np.random.rand(100, 100), 'b': np.random.rand(2000)}
store.append('test', df)
store.close()
# TypeError: Cannot serialize the column [a] because its data contents are not [string] but [mixed] object dtype

Contact me if you have ideas!

Links

https://stackoverflow.com/questions/72733385/data-structure-for-sparse-n-dimensional-array-tensor-such-a0-and-a1, https://stackoverflow.com/questions/72737525/pandas-rows-containing-numpy-ndarrays-various-shapes, https://stackoverflow.com/questions/72742007/pandas-dataframe-containing-numpy-ndarray-and-mean, https://stackoverflow.com/questions/72742843/100gb-data-store-pandas-dataframe-of-numpy-ndarrays-loading-only-a-small-part

My personal blog.

twitter
email
github

Data / AI / Python consulting and freelancing.

Articles about:
#all
#music
#photo
#opensource
#python