Data Preparation

Input files are accepted in .tsv file format - as text files with tab-separated columns.

Screenshot

Requirements:

A well-formatted, working example input file can be downloaded from here.

Option 1: Database export without timestamps

In the simplest case, you need a 2-column extract from your database. The IDs within the columns can (and probably should be) anonymized. Each row expresses an event - the fact that a certain user ID interacted with a product ID at a given point in time.

Screenshot

Input file creation with Pandas:

>>> df.columns
Index(['client_id', 'entity_id'], dtype='object')

>>> df.dtypes
user_id         int64
product_id      int64

>>> df.to_csv('cleora_input.tsv', sep='\t', header=True, index=False)

Column 1 will give Cleora a key for grouping (which is used to create edges in the graph). This will usually be user IDs, session IDs, shop IDs, invoice IDs or other similar entities which represent some approach towards grouping the entities from Column 2.

Column 2 will be embedded - you will obtain embeddings for each entity which appears in this column. These will usually be product IDs, brand IDs, category IDs, or other entities your users are interacting with.

Option 2: Database export with timestamps

Optionally, you can also use a 3-column format with a timestamp column. The logic is the same as in the 2-column format, but the 3rd column additionally has a timestamp for each event.

Timestamps allow Cleora to use more detailed temporal relations for edge creation in the graph, potentially increasing embedding quality.

Your input file should look like this:

Screenshot

Input file creation with Pandas:

>>> df.columns
Index(['client_id', 'entity_id', 'timestamp'], dtype='object')

>>> df.dtypes
user_id         int64
product_id      int64
timestamp       object

>>> df.to_csv('cleora_input.tsv', sep='\t', header=True, index=False)

Supported datetime formats:

Datetime Format Example
Y-m 2024-02
Y-m-d 2024-02-01
d/m/Y 01/02/2024
m/d/Y 02/01/2024
b d, Y Feb 01, 2021
B d, Y February 01, 2024
Y-m-d H:M 2024-02-01 07:00
Y-m-d H:M:S 2024-02-01 07:00:00
Y-m-dTH:M:S 2024-02-01T07:00:00
Y-m-dTH:M:SZ 2024-02-01T07:00:00Z
Y-m-d H:M:Sz 2024-02-01 07:00:00+00:00

If your datetime format is outside of this list, Cleora will try to parse it anyway, but success is not guaranteed.