A few thingz


Joseph Basquin


19/03/2024

Working on PDF files with Python

There are many solutions to work on PDF files with Python. Depending on whether you need to read, parse data, extract tables, modify (split, merge, crop...), or create a new PDF, you will need different tools.

Here is a quick diagram of some common tools I have used:

If you need to extract data from image PDF files, it's a whole different story, and you might need to use OCR libraries like (Py)Tesseract or other tools.

Have some specific data conversion / extraction needs? Please contact me for consulting - a little script can probably automate hours of manual processing in a few seconds!

N-dimensional array data store (with labeled indexing)

What am I trying to do?

I'm currently looking for the perfect data structure for an ongoing R&D task.

I need to work with a data store as a n-dimensional array x (of dimension 4 or more) such that:

Possible solutions

I'm looking for a good and lightweight solution.
To keep things simple, I deliberately avoid (for now):

method ragged non-consecutive indexing numpy arithm. random access for 100 GB data store notes
xarray ? no
sparse ? no
Pandas DataFrame + Numpy ndarray ? ? (*) (**)
Tensorflow tf.ragged.constant ? ? ?
Sqlite + Numpy ndarray ? ? ? ? to be tested

(*) serialization with parquet: doesn't accept 2D or 3D arrays:

import numpy as np, pandas as pd
x = pd.DataFrame(columns=['a', 'b'])
for i in range(100):
    x.loc['t%i' % i] = [np.random.rand(100, 100), np.random.rand(2000)]
x.to_parquet('test.parquet')
# pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column a with type object')

(**) serialization with hdf5: currently not working:

import numpy as np, pandas as pd
store = pd.HDFStore("store.h5")
df = pd.DataFrame(columns=['a', 'b'])
df.loc['t1'] = {'a': np.random.rand(100, 100), 'b': np.random.rand(2000)}
store.append('test', df)
store.close()
# TypeError: Cannot serialize the column [a] because its data contents are not [string] but [mixed] object dtype

Contact me if you have ideas!

Links

https://stackoverflow.com/questions/72733385/data-structure-for-sparse-n-dimensional-array-tensor-such-a0-and-a1, https://stackoverflow.com/questions/72737525/pandas-rows-containing-numpy-ndarrays-various-shapes, https://stackoverflow.com/questions/72742007/pandas-dataframe-containing-numpy-ndarray-and-mean, https://stackoverflow.com/questions/72742843/100gb-data-store-pandas-dataframe-of-numpy-ndarrays-loading-only-a-small-part

Python + TensorFlow + GPU + CUDA + CUDNN setup with Ubuntu

Every time I setup Python + TensorFlow on a new machine with a fresh Ubuntu install, I have to spend some time again and again on this topic, and do some trial and error (yes I'm speaking about such issues). So here is a little HOWTO, once for all.

Important fact: we need to install the specific version number of CUDA and CUDNN relative to a particular version of TensorFlow, otherwise it will fail, with errors like libcudnn.so.7: cannot open shared object file: No such file or directory.

For example, for TensorFlow 2.3, we have to use CUDA 10.1 and CUDNN 7.6 (see here).

Here is how to install on a Ubuntu 18.04:

pip3 install --upgrade pip   # it was mandatory to upgrade for me
pip3 install keras tensorflow==2.3.0

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt install cuda-10-1 nvidia-driver-430

To test if the NVIDIA driver is properly installed, you can run nvidia-smi (I noticed a reboot was necessary).

Then download "Download cuDNN v7.6.5 (November 5th, 2019), for CUDA 10.1" on https://developer.nvidia.com/rdp/cudnn-archive (you need to create an account there), and then:

sudo dpkg -i libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb     

That's it! Reboot the computer, launch Python 3 and do:

import tensorflow
tensorflow.test.gpu_device_name()     # also, tensorflow.test.is_gpu_available() should give True

The last line should display the right GPU device name. If you get an empty string instead, it means your GPU isn't used by TensorFlow!

Notes:

Quick-tip: Rebooter une Livebox avec un script Python

Petite astuce utile pour rebooter une Livebox Play en 4 lignes de code :

import requests
r = requests.post("http://192.168.1.1/authenticate?username=admin&password=LEMOTDEPASSEICI")
h = {'Content-Type': 'application/json; charset=UTF-8', 'X-Context': r.json()['data']['contextID']}
s = requests.post("http://192.168.1.1/sysbus/NMC:reboot", headers=h, cookies=r.cookies)

Avec une Livebox 4 ou 5, voici la méthode :

import requests
session = requests.Session()
auth = '{"service":"sah.Device.Information","method":"createContext","parameters":{"applicationName":"so_sdkut","username":"admin","password":"LEMOTDEPASSEICI"}}'
r = session.post('http://192.168.1.1/ws', data=auth, headers={'Content-Type': 'application/x-sah-ws-1-call+json', 'Authorization': 'X-Sah-Login'})
h = {'X-Context': r.json()['data']['contextID'], 'X-Prototype-Version': '1.7', 'Content-Type': 'application/x-sah-ws-1-call+json; charset=UTF-8', 'Accept': 'text/javascript'}
s = session.post("http://192.168.1.1/sysbus/NMC:reboot", headers=h, data='{"parameters":{}}')
print(s.json())

Inspiré de ce post avec curl, de ce projet (la même chose en ... 99 lignes de code ;)) et la librairie sysbus.

NB: cette méthode de reboot change l'IP de la Livebox au redémarrage.

"Since"

A song I made a few months ago.

Join/Leave · Since

nFreezer, a secure remote backup tool

So you make backups of your sensitive data on a remote server. How to be sure that it is really safe on the destination server?

By safe, I mean "safe even if a malicious user gains access" on the destination server; here we're looking for a solution such that, even if a hacker attacks your server (and installs compromised software on it), they cannot read your data.

You might think that using SFTP/SSH (and/or rsync, or sync programs) and using an encrypted filesystem on the server is enough. In fact, no: there will be a short time during which the data will be processed unencrypted on the remote server (at the output of the SSH layer, and before arriving at the filesystem encryption layer).

How to solve this problem? By using an encrypted-at-rest backup program: the data is encrypted locally, and is never decrypted on the remote server.

I created nFreezer for this purpose.

Main features:

More about this on nFreezer.




  By the way I just published another (local) backup tool on PyPi: backupdisk, that you can install with pip install diskbackup. It allows you to quickly backup your disk to an external USB HDD in one-line:

diskbackup.backup(src=r'D:\Documents', dest=r'I:\Documents', exclude=['.mp4'])




Update: many thanks to @Korben for his article nFreezer – De la sauvegarde chiffrée de bout en bout (December 12, 2020).

Get organized with your stuff – all you need is a 5-character identifier

 

After years of music production, photography, electronics, programming, <name your favorite creative field here>, or whatever, we probably all end up with the same situation: we accumulate a lot of gear.

Most of these items are (thankfully) working, some of them are broken (but we keep them just in case), and some others, well ... we don't really know, probably because we never properly identified them.

I'm speaking about USB cables, phone chargers/PSU (good and not-so-good ones), external hard drives that all look the same, microphones, XLR microphones cables, audio interfaces, etc.

Usually it's ok to use one item or another, but for special occasions (an important recording session / photography shooting / whatever), you don't want your work to be spoiled because, among 5 units, you picked the wrong laptop power supply, the only one that produces an annoying 50Hz buzz when recording audio.

Here is an easy rule to circumvent this problem:

All you need is to label your items with a 5-character ID

with a pen, some tape

and to make an inventory with your (tested) items:

 

But why random 5-alphanumeric characters? Because every time you'll want to label a new object, you won't have to worry about "Was this ID already taken or not in my inventory?"
With a very high, large enough probability, it will not be already taken.

To be more precise, if you label 1000 objects in your life with these 5-random-alphanumeric-char identifiers, you'll have a probability of 0.8 % that two objects have the same label. I think it's ok. This is a classical application of the math birthday problem. I personnally don't care if once in my life two items have the same number in my inventory, but if I'd care, I would just use a 6-alphanumeric-character ID (in this case the probability of at least one collision is 0.02%).
Ok, this is just UUID applied to real life.

I can hear you saying:

"Well that's nonsense, I can just number the items #0001, #0002, and so on. Why a random alphanumeric ID?"

Reason #1: Let's say you have 5 cables around you. You label them #0001, #0002, ..., #0005. Two month laters you have a new cable with no label, and don't have the inventory handy. Where did I stop in the numbering the last time? I think I stopped at #0004, so let's label this one #0005. (1 hour later). Oops no no no, #0005 was already taken. But maybe #0006 as well? Well no problem, let's label it with #9999. (2 months later). How to label this new cable? Did I already have a #9998 or not?
As we can see using an increasing sequence requires us to remember where we stopped the previous time, and it's not convenient.

Reason #2: If you have multiple item types (cables, PSU, hard drives), you will have many objects numbered #0001, so it's not easy to find them in an inventory. Here you can have a single inventory file with all your stuff. Once again, it's unlikely that two items in your life will have the same label.

 

Interested by this kind of useless things?

Vversioning, a quick and dirty code versioning system

For some projects you need a real code versioning system (like git or similar tools).

But for some others, typically micro-size projects, you sometimes don't want to use a complex tool. You might want to move to git later when the project gets bigger, or when you want to publish it online, but using git for any small project you begin creates a lot of friction for some users like me. Examples:

User A: "I rarely use git. If I use it for this project, will I remember which commands to use in 5 years when I'll want to reopen this project? Or will I get stuck with git wtf and unable to quickly see the different versions?".

User B: "I want to be able to see the different versions of my code even if no software like git is installed (ex: using my parents' computer)."

User C: "My project is just a single file. I don't want to use a complex versioning system for this. How can I archive the versions?"

For this reason, I just made this:

vversioning

It is a (quick and dirty) versioning system, done in less than 50 lines of Python code.

"Comme un ciel sans nuage"

Here is some 80s-cheeeeesy French pop I made with Gaëlle W. :

FastReply – Lightweight template system for your emails

 

Install it here: FastReply Chrome extension

 

Note:

Interested for future evolutions and other (smarter) autoreply email tools?
(several other hour-saving tools in progress)

Older articles

My personal blog.

twitter
email
github

Data / AI / Python consulting and freelancing.

Articles about:
#all
#music
#photo
#opensource
#python