KEMBAR78
Relational Deep Learningpdf | PDF | Relational Model | Relational Database
0% found this document useful (0 votes)
54 views29 pages

Relational Deep Learningpdf

The document introduces Relational Deep Learning (RDL), a framework designed to facilitate machine learning on relational databases by representing them as heterogeneous graphs, thus eliminating the need for manual feature engineering. RDL utilizes Graph Neural Networks (GNNs) to learn from relational data directly, demonstrating significant improvements in predictive accuracy and efficiency over traditional methods. Benchmark results show RDL outperforms manual feature engineering approaches, reducing human effort by over 90% across various predictive tasks.

Uploaded by

w4nderlust
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views29 pages

Relational Deep Learningpdf

The document introduces Relational Deep Learning (RDL), a framework designed to facilitate machine learning on relational databases by representing them as heterogeneous graphs, thus eliminating the need for manual feature engineering. RDL utilizes Graph Neural Networks (GNNs) to learn from relational data directly, demonstrating significant improvements in predictive accuracy and efficiency over traditional methods. Benchmark results show RDL outperforms manual feature engineering approaches, reducing human effort by over 90% across various predictive tasks.

Uploaded by

w4nderlust
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Relational Deep Learning: Graph Representation

Learning on Relational Databases

Joshua Robinson1∗, Rishabh Ranjan1∗, Weihua Hu2∗, Kexin Huang1∗,


Jiaqi Han1 , Alejandro Dobles1 , Matthias Fey2 , Jan E. Lenssen2,3 ,
Yiwen Yuan2 , Zecheng Zhang2 , Xinwei He2 , Jure Leskovec1,2
1
Stanford University 2 Kumo.AI 3 Max Planck Institute for Informatics

https://relbench.stanford.edu

Abstract

Much of the world’s most valuable data is stored in relational databases, where data
is organized into tables connected by primary-foreign key relationships. Building
machine learning models on this data is challenging because existing algorithms
cannot directly learn from multiple connected tables. Current methods require
manual feature engineering, which involves joining and aggregating tables into a
single format, a labor-intensive and error-prone process. We introduce Relational
Deep Learning (RDL), an end-to-end learning framework that eliminates the need
for manual feature engineering by representing relational databases as temporal,
heterogeneous graphs. In this representation, rows become nodes, and primary-
foreign key links define edges. Graph Neural Networks (GNNs) are then used to
learn representations from all available data. We benchmark RDL on R EL B ENCH,
evaluating 30 predictive tasks across seven relational databases, and demonstrate
superior performance compared to traditional methods. In a user study, RDL
significantly outperforms an experienced data scientist’s manual feature engineering
approach, reducing human effort by more than 90%. These results highlight the
potential of deep learning for predictive tasks in relational databases.

1 Introduction
The information age is fueled by data stored in relational databases, which are foundational to nearly
all modern technology stacks. These databases store data across multiple tables, linked by primary
and foreign keys, and are managed using query languages like SQL (Codd, 1970; Chamberlin and
Boyce, 1974). As a result, they are at the core of systems in sectors like e-commerce, social media,
healthcare, and scientific repositories (Johnson et al., 2016; PubMed, 1996).
Many predictive tasks over relational databases hold significant implications for decision making.
For example, a hospital might predict patient discharge risk, or a company may forecast product sales.
Each of these tasks relies on rich relational schemas, and machine learning models are often built
from this data (Kaggle, 2022).
However, existing machine learning paradigms, especially tabular learning, cannot directly operate
on relational data. Instead, a manual feature engineering process is often required, where data from
multiple tables is aggregated into a single table format for learning. For instance, in an e-commerce
schema, a data scientist might extract features like “number of purchases in the last 30 days” to
predict future customer behavior. These features are manually constructed and stored in a single table
for training tabular models.

Equal contribution, order chosen randomly. First authors may swap the ordering for professional purposes.

3rd Table Representation Learning Workshop at NeurIPS 2024.


Transactions Products
Products Transactions

ProductID TransactionID
Description ProductID
Training Table Image Timestamp
90-day Target Size Customer Training Table
CustomerID
UserID
Customers Price
SeedTime
CustomerID
Name

(a) Rel. Tables with Training Table (b) Entities Linked by Foreign Keys

B A
A
1

2
B A
3
B
1 3 2 1 1
(c) Relational Entity Graph (d) Graph Neural Network

Figure 1: Relational Deep Learning Pipeline. (a) Relational tables with a predictive task generate a
training table containing supervised labels. (b) Entities are linked by foreign-primary key relations.
(c) The data forms a Relational Entity Graph, with nodes for each entity and edges from key links.
(d) Features are extracted for each entity, and a GNN computes embeddings. A model head produces
predictions, and errors are backpropagated.

This manual approach has significant limitations: it is labor-intensive, suboptimal, and often fails to
capture fine-grained signals. Moreover, temporal features need frequent recomputation, leading to
additional computational cost and potential time-leakage bugs (Kapoor and Narayanan, 2023). These
challenges are similar to those faced in early computer vision, where hand-engineered features were
eventually replaced by end-to-end deep learning systems (He et al., 2016; Russakovsky et al., 2015).
Here we introduce Relational Deep Learning (RDL), a framework for end-to-end deep learning
on relational databases. RDL fully leverages relational data by representing it as a heterogeneous
Relational Entity Graph, where rows become nodes, columns become node features, and primary-
foreign key links form edges. Graph Neural Networks (GNNs) (Gilmer et al., 2017; Hamilton et al.,
2017) are then applied to learn predictive models directly from the graph structure.
RDL consists of four main steps (Fig. 1): (1) A training table containing labels is computed from
historic data, (2) entity-level features are extracted as node features, (3) node embeddings are learned
via a GNN that exchanges information across primary-foreign key links, and (4) a task-specific model
produces predictions, with errors backpropagated for optimization.
RDL naturally handles temporal data by ensuring entities only receive messages from earlier times-
tamps, preventing information leakage. This allows for efficient model updating and avoids common
time-travel bugs.
We evaluate RDL on 30 tasks across 7 databases in the R EL B ENCH benchmark. Our method
outperforms baselines, including a strong “data scientist” approach where features are manually
engineered for each task. RDL models match or exceed these models in accuracy while reducing
human labor by 96% and lines of code by 94%. This demonstrates RDL’s promise as the first end-to-
end deep learning solution for relational data, offering significant improvements in both accuracy and
efficiency.

2 Problem Formulation
This section defines predictive tasks on relational tables, focusing on data structure and task specifica-
tion. We lay the groundwork for Section 3, where we present our GNN-based approach.

2
A relational database (T , L) consists of tables T = T1 , . . . , Tn , and links L ⊆ T × T , which
connect tables through primary-foreign key relations (Fig. 1a). Each table T is a set of rows, or
entities v, with a primary key, foreign keys, attributes, and an optional timestamp. For instance, in
Fig. 1a, T RANSACTIONS has a primary key (T RANSACTION ID), two foreign keys, attributes, and a
timestamp.
Attributes are tuples of values xv = (x1v , . . . , xdvT ) shared across entities in the same table. For
example, P RODUCTS contains text, image, and numerical attributes, each processed by specific
encoders (Sec. 3.4.3).
Predictive tasks often involve forecasting future events. Ground truth labels for training are generated
from historical data (e.g., summing customer purchases over a future period). To hold these labels,
we introduce a training table Ttrain , where each row has a foreign key, a timestamp, and a label.
The training table links to the main database via foreign keys, and timestamps ensure temporal
consistency, preventing data leakage. This setup supports node-level, link prediction, and both
temporal and static tasks, providing a versatile framework for various machine learning problems on
relational data.

3 Methods
Here, we formulate a generic graph neural network architecture for solving predictive tasks on
relational databases. The following section will first introduce three important graph concepts: (a)
The schema graph (cf. Sec. 3.1), table-level graph, where one table corresponds to one node. (b) The
relational entity graph (cf. Sec. 3.2), an entity-level graph, with a node for each entity in each table,
and edges are defined via foreign-primary key connections between entities. (c) The time-consistent
computation graph (cf. Sec. 3.3), which acts as an explicit training example for graph neural networks.
We describe generic procedures to map between graph types, and finally introduce our GNN blueprint
for end-to-end learning on relational databases (cf. Sec. 3.4).

3.1 Schema Graph


The first graph in our blueprint is the schema graph (cf., which describes the table-
level structure of data. Given a relational database (T , L) as defined in Sec. 2, we let
L−1 = {(Tpkey , Tfkey ) | (Tfkey , Tpkey ) ∈ L} denote its inverse set of links. Then, the schema graph
is the graph (T , R) that arises from the relational database, with node set T and edge set R = L∪L−1 .
Inverse links ensure that all tables are reachable within the schema graph. The schema graph nodes
serve as type definitions for the heterogeneous relational entity graph, which we define next.

3.2 Relational Entity Graph


To formulate a graph suitable for processing with GNNs, we introduce the relational entity graph,
which has entity-level nodes and serves as the basis of the proposed framework.
Our relational entity graph is a heterogeneous graph G = (V, E, ϕ, ψ), with node set V and edge set
E ⊆ V × V and type mapping functions ϕ : V → T and ψ : E → R, where each node v ∈ V belongs
to a node type ϕ(v) ∈ T and each edge e ∈ E belongs to an edge type ψ(e) ∈ R. Specifically, the
sets T and R from the schema graph define the node and edge types of our relational entity graph.
Given a schema graph (T , R) with tables T = {v1 , ..., vnT } ∈ T as defined in Sec. S 2, we define the
node set in our relational entity graph as the union of all entries in all tables V = T ∈T T . Its edge
set is then defined as
E = {(v1 , v2 ) ∈ V × V | pv2 ∈ Kv1 or pv1 ∈ Kvv }, (3.1)
i.e. the entity-level pairs that arise from the primary-foreign key relationships in the database. We
equip the relational entity graph with the following key information:
• Type mapping functions ϕ : V → T and ψ : E → R, mapping nodes and edges
to respective elements of the schema graph, making the graph heterogeneous. We set
ϕ(v) = T for all v ∈ T and ψ(v1 , v2 ) = (ϕ(v1 ), ϕ(v2 )) ∈ R if (v1 , v2 ) ∈ E.
• Time mapping function τ : V → D, mapping nodes to its timestamp: τ : v 7→ tv ,
introducing time as a central component and establishes the temporality of the graph. The

3
value τ (v) denotes the point in time in which the table row v became available or −∞ in
case of non-temporal rows.
• Embedding vectors hv ∈ Rdϕ(v) for each v ∈ V, which contains an embedding vector for
each node in the graph. Initial embeddings are obtained via multi-modal column encoders,
cf. Sec. 3.4.3. Final embeddings are computed via GNNs outlined in Section 3.4.

The graph contains a node for each row in the database tables. Two nodes are connected if the foreign
key entry in one table row links to the primary key entry of another table row. Node and edge types
are defined by the schema graph. Nodes resulting from temporal tables carry the timestamp from the
respective row, allowing temporal message passing, which is described next.

3.3 Time-Consistent Computational Graphs

Given a relational entity graph and a training table, we need to be able to query the graph at specific
points in time which then serve as explicit training examples used as input to the model. In particular,
we create a subgraph from the relational entity graph induced by the set of foreign keys Kv and its
timestamp tv of a training example in the training table Ttrain . This subgraph then acts as a local and
time-consistent computation graph to predict its ground-truth label yv .
The computational graphs obtained via neighbor sampling (Hamilton et al., 2017) allow the scalability
of our proposed approach to modern large-scale relational data with billions of table rows, while
ensuring the temporal constraints (Wang et al., 2021).

3.4 Task-Specific Temporal Graph Neural Networks

Given a time-consistent computational graph and its future label to predict, we define a generic
multi-stage deep learning architecture as follows:
(0)
1. Table-level column encoders that encode table row data into initial node embeddings hv
(cf. Sec. 3.4.3).
2. A stack of L relational-temporal message passing layers (cf. Sec. 3.4.1).
3. A task-specific model head, mapping final node embeddings to a prediction (cf. Sec. 3.4.2).

The whole architecture, consisting of table-level encoders, message passing layers and task specific
model heads can be trained end-to-end to obtain a model for the given task.

3.4.1 Relational-Temporal Message Passing


A message passing operator in the given relational framework needs to respect the heterogeneous
nature as well as the temporal properties of the graph. We adopt common hetereogeneous message
passing (Gilmer et al., 2017; Fey and Lenssen, 2019; Schlichtkrull et al., 2018; Hu et al., 2020) and
extend it by a temporal filtering mechanism. Given a relational entity graph, initial node embeddings
(0) (L)
{hv }v∈V and an example specific seed time t ∈ R, we obtain a set of node embeddings {hv }v∈V
by L consecutive applications of message passing, where information flow between nodes can only
go forward in time, ensured by a temporal neighbor sampler.

3.4.2 Prediction with Model Heads


The model described so far is task-agnostic and simply propagates information through the relational
entity graph to produce node embeddings. We obtain a task-specific model by combining our graph
with a training table, leading to specific model heads and loss functions. We distinguish between (but
are not limited to) two types of tasks: node-level prediction and link-level prediction.
Node-level Model Head. Given a batch of N node level training table examples {(K, t, y)i }N
i=1 ,
where K = {k} contains the primary key of node v ∈ V in the relational entity graph, t ∈ R is
the seed time, and y ∈ Rd is the target value. Then, the node-level model head maps node-level
(L)
embeddings hv to a prediction ŷ, i.e.

f : Rdv → Rd , f : h(L)
v 7→ ŷ. (3.2)

4
Table 1: Entity classification results (AUROC, higher is better) on R EL B ENCH. Best values are in
bold. See Table 4 in Appendix B for standard deviations.
Rel. Gain
Dataset Task Split LightGBM RDL
of RDL

Val 52.05 70.45 35.35 %


user-churn
Test 52.22 70.42 34.86 %
rel-amazon
Val 62.39 82.39 32.06 %
item-churn
Test 62.54 82.81 32.40 %
Val 53.31 69.65 30.66 %
user-visits
Test 53.05 66.20 24.78 %
rel-avito
Val 55.63 64.73 16.35 %
user-clicks
Test 53.60 65.90 22.96 %
Val 67.76 71.25 5.15 %
user-repeat
Test 68.04 76.89 13.02 %
rel-event
Val 87.96 91.70 4.25 %
user-ignore
Test 79.93 81.62 2.12 %
Val 68.42 71.36 4.31 %
driver-dnf
Test 68.56 72.62 5.93 %
rel-f1
Val 67.76 77.64 14.57 %
driver-top3
Test 73.92 75.54 2.20 %
Val 56.05 70.42 25.63 %
rel-hm user-churn
Test 55.21 69.88 26.59 %
Val 65.12 90.21 38.53 %
user-engagement
Test 63.39 90.59 42.91 %
rel-stack
Val 65.39 89.86 37.43 %
user-badge
Test 63.43 88.86 40.08 %
Val 68.30 68.18 −0.19 %
rel-trial study-outcome
Test 70.09 68.60 −2.13 %
Val 64.18 76.49 20.34 %
Average
Test 63.66 75.83 20.48 %

Link-level Model Head. Similarly, we can define a link-level model head for training examples
{(K, t, y)i }N
i=1 with K = {k1 , k2 } containing primary keys of two different nodes v1 , v2 ∈ V in the
(L) (L)
relational entity graph. A function maps node embeddings hv1 , hv2 to a prediction, i.e.

f : Rdv1 × Rdv2 → Rd , f : (h(L) (L)


v1 , hv2 ) 7→ ŷ. (3.3)

A task-specific loss L(ŷ, y) provides gradient signals to all trainable parameters. The presented
approach can be generalized to |K| > 2 to specify subgraph-level tasks.

3.4.3 Multi-Modal Node Encoders


(0)
The final piece of the pipeline is to obtain the initial entity-level node embeddings hv from the
multi-modal input attributes xv = (x1v , . . . , xdvT ). Due to the nature of tabular data, each column
element xiv lies in its own modality space such as image, text, categorical, and numerical values.
Therefore, we use pre-trained modality-specific encoders to embed each attribute into embeddings,
and fuse the column-level embeddeings into a single embedding per row.

4 Experiments
We evaluate RDL on R EL B ENCH. Tasks are grouped into three task types: entity classification
(Section 4.1), entity regression (Section 4.2), and entity link prediction (Section 4.3). Tasks differ
significantly in the number of train/val/test entities, number of unique entities (the same entity may
appear multiple times at different timestamps), and the proportion of test entities seen during training.
Note this is not data leakage, since entity predictions are timestamp dependent, and can change over
time. Tasks with no overlap are pure inductive tasks, whilst other tasks are (partially) transductive.

4.1 Entity Classification


The first task type is entity-level classification. The task is to predict binary labels of a given entity at
a given seed time. We use the ROC-AUC (Hanley and McNeil, 1983) metric for evaluation (higher is
better). We compare to a LightGBM classifier baseline over the raw entity table features. Note that
here only information from the single entity table is used.

5
Table 2: Entity regression results (MAE, lower is better) on R EL B ENCH. Best values are in bold. See
Table 5 in Appendix B for standard deviations.
Global Global Global Entity Entity Rel. Gain
Dataset Task Split LightGBM RDL
Zero Mean Median Mean Median of RDL

Val 14.141 20.740 14.141 17.685 15.978 14.141 12.132 14.21 %


user-ltv
Test 16.783 22.121 16.783 19.055 17.423 16.783 14.313 14.72 %
rel-amazon
Val 72.096 78.110 59.471 80.466 68.922 55.741 45.140 19.02 %
item-ltv
Test 77.126 81.852 64.234 78.423 66.436 60.569 50.053 17.36 %
Val 0.048 0.048 0.040 0.044 0.044 0.037 0.037 2.21 %
rel-avito ad-ctr
Test 0.052 0.051 0.043 0.046 0.046 0.041 0.041 −0.18 %
Val 0.262 0.457 0.262 0.296 0.268 0.262 0.255 2.65 %
rel-event user-attendance
Test 0.264 0.470 0.264 0.304 0.269 0.264 0.258 1.97 %
Val 11.083 4.334 4.136 7.181 7.114 3.450 3.193 7.44 %
rel-f1 driver-position
Test 11.926 4.513 4.399 8.501 8.519 4.170 4.022 3.56 %
Val 0.086 0.142 0.086 0.117 0.086 0.086 0.065 24.50 %
rel-hm item-sales
Test 0.076 0.134 0.076 0.111 0.078 0.076 0.056 26.90 %
Val 0.062 0.146 0.062 0.102 0.064 0.062 0.059 4.19 %
rel-stack post-votes
Test 0.068 0.149 0.068 0.106 0.069 0.068 0.065 4.11 %
Val 57.083 75.008 56.786 57.083 57.083 45.774 46.290 −1.13 %
study-adverse
Test 57.930 73.781 57.533 57.930 57.930 44.011 44.473 −1.05 %
rel-trial
Val 0.475 0.462 0.475 0.447 0.450 0.417 0.401 3.87 %
site-success
Test 0.462 0.468 0.462 0.448 0.441 0.425 0.400 5.86 %
Val 17.260 19.939 15.051 18.158 16.668 13.330 11.952 8.55 %
Average
Test 18.299 20.393 15.985 18.325 16.801 14.045 12.631 8.14 %

Experimental results. Results are given in Table 1, with RDL outperforming or matching baselines
in all cases. Notably, LightGBM achieves similar performance to RDL on the study-outcome
task from rel-trial. This task has extremely rich features in the target table (28 columns total),
giving the LightGBM many potentially useful features even without feature engineering. It is an
interesting research question how to design RDL models better able to extract these features and
unify them with cross-table information in order to outperform the LightGBM model on this dataset.

4.2 Entity Regression


Entity-level regression tasks involve predicting numerical labels of an entity at a given seed time.
We use Mean Absolute Error (MAE) as our metric (lower is better). We consider the following
baselines: Entity mean/median calculates the mean/median label value for each entity in training
data and predicts the mean/median value for the entity. Global mean/median calculates the global
mean/median label value over the training data and predicts the same mean/median value across all
entities. Global zero predicts zero for all entities. LightGBM learns a LightGBM (Ke et al., 2017)
regressor over the raw entity features to predict the numerical targets. Note that only information
from the single entity table is used.
Experimental results. Results in Table 2 show our RDL implementation outperforms or matches
baselines in all cases. A number of tasks, such as driver-position and study-adverse,
have matching performance up to statistical significance, suggesting some room for improvement.
We analyze this further in Appendix C, identifying one potential cause, suggesting an opportunity for
improved performance for regression tasks.

4.3 Recommendation
Finally, we also introduce recommendation tasks on pairs of entities. The task is to predict a list of
top K target entities given a source entity at a given seed time. The metric we use is Mean Average
Precision (MAP) @K, where K is set per task (higher is better). We consider the following baselines:
Global popularity computes the top K most popular target entities (by count) across the entire
training table and predict the K globally popular target entities across all source entities. Past visit
computes the top K most visited target entities for each source entity within the training table and
predict those past-visited target entities for each entity. LightGBM learns a LightGBM (Ke et al.,
2017) classifier over the raw features of the source and target entities (concatenated) to predict the
link. Additionally, global popularity and past visit ranks are also provided as inputs.
For recommendation, it is also important to ensure a certain density of links in the training data in
order for there to be sufficient predictive signal. In Appendix A we report statistics on the average
number of destination entities each source entity links to. For most tasks the density is ≥ 1, with the
exception of rel-stack which is more sparse, but is included to test in extreme sparse settings.

6
Table 3: Recommendation results (MAP, higher is better) on R EL B ENCH. Best values are in bold.
See Table 6 in Appendix B for standard deviations.
Global Past RDL RDL Rel. Gain
Dataset Task Split LightGBM
Popularity Visit (GraphSAGE) (ID-GNN) of RDL

Val 0.31 0.07 0.18 1.53 0.13 397.55 %


user-item-purchase
Test 0.24 0.06 0.16 0.74 0.10 204.74 %
rel-amazon Val 0.16 0.09 0.22 1.42 0.15 550.12 %
user-item-rate
Test 0.15 0.07 0.17 0.87 0.12 395.92 %
Val 0.18 0.05 0.14 1.03 0.11 476.06 %
user-item-review
Test 0.11 0.04 0.09 0.47 0.09 313.07 %
Val 0.01 3.66 0.17 0.09 5.40 47.37 %
rel-avito user-ad-visit
Test 0.00 1.95 0.06 0.02 3.66 87.09 %
Val 0.36 1.07 0.44 0.92 2.64 145.60 %
rel-hm user-item-purchase
Test 0.30 0.89 0.38 0.80 2.81 214.49 %
Val 0.03 2.05 0.04 0.43 15.17 640.05 %
user-post-comment
Test 0.02 1.42 0.04 0.11 12.72 795.15 %
rel-stack
Val 0.47 0.00 1.62 0.00 7.76 378.26 %
post-post-related
Test 1.46 1.74 2.00 0.07 10.83 440.27 %
Val 2.63 8.58 4.88 3.12 11.33 32.05 %
condition-sponsor-run
Test 2.52 8.42 4.82 2.89 11.36 34.89 %
rel-trial
Val 4.91 15.90 10.92 14.09 17.43 9.65 %
site-sponsor-run
Test 3.75 17.31 8.40 10.70 19.00 9.74 %
Val 1.01 3.50 2.07 2.51 6.68 297.41 %
Average
Test 0.95 3.55 1.79 1.85 6.74 277.26 %

Experimental results. Results are given in Table 3. We find that either the RDL implementation
using GraphSAGE (Hamilton et al., 2017), or ID-GNN (You et al., 2021) as the GNN component
performs best, often by a very significant margin. ID-GNN excels in cases were predictions are
entity-specific (i.e., Past Visit baseline outperforms Global Popularity), whilst the plain GNN excels
in the reverse case. This reflects the inductive biases of each model, with GraphSAGE being able to
learn structural features, and ID-GNN able to take into account the specific node ID.
5 Expert Data Scientist User Study
To rigorously test RDL, we conducted a human trial where a data scientist manually engineered
features and used methods like LightGBM or XGBoost (Chen and Guestrin, 2016; Ke et al., 2017).
This represents the prior standard for building predictive models on relational databases (Heaton,
2016), providing a key comparison for RDL.
The study follows five workflow steps: EDA: Exploring the dataset to understand its characteristics,
including feature columns and missing data. Feature ideation: Proposing entity-level features that
may contain predictive signals. Feature engineering: Using SQL to compute and add features to the
target table. Tabular ML: Running LightGBM or XGBoost on the table with engineered features
and recording performance. Post-hoc feature analysis (Optional): Tools like SHAP and LIME
explain feature contributions.
For example, in the rel-hm dataset, additional features such as time since last purchase are computed
to predict customer churn. A detailed walkthrough of the data scientist’s process is provided in
Appendix C.
Limitations of Manual Feature Engineering. This process is labor-intensive, misses potential
signals, and limits feature complexity. Every new task requires repeating these steps, adding hours of
human labor and SQL code (Zheng and Casari, 2018). RDL models avoid these issues.
Data Scientist. We recruited a data scientist with a Stanford CS MSc, 4.0 GPA, and five years of
experience building ML models, following the five steps outlined above.
User Study Protocol. The study protocol standardizes the time spent at each step: EDA: Capped at 4
hours to understand the dataset’s schema and relationships. Feature ideation: Limited to 1 hour, with
features proposed manually. Feature engineering: SQL queries are used to generate features, with
no time limit for code writing. The time spent is recorded. Tabular ML: A standardized LightGBM
script is used, with time recorded for preprocessing SQL query results. Post-hoc analysis: Conducted
for sanity checks, taking just a few minutes (not included in total time).
Results. We compare RDL to the data scientist on three metrics: (i) predictive power, (ii) hours of
human work, and (iii) lines of code. Marginal effort was measured, excluding reusable infrastructure.
Figures 2, 3, and 4 show that RDL outperforms the data scientist in 11 of 15 tasks while reducing

7
6(0 (EXE7GMIRXMWX
YWIVIRKEKIQIRX YWIVZSXIW
YWIVFEHKI YWIVPXZ
YWIVGLYVR EQE^SR MXIQPXZ
MXIQGLYVR
MXIQWEPIW
HVMZIVHRJ
HVMZIVXST HVMZIVTSWMXMSR
YWIVGLYVR LQ WXYH]EHZIVWI
WXYH]SYXGSQI WMXIWYGGIWW
       
%963' 2SVQEPM^IH1%)
Figure 2: RDL vs. Data Scientist. RDL matches or outperforms the data scientist in 11 of 15 tasks.
Left: AUROC for classification, right: MAE (normalized) for regression.

6(0 (EXE7GMIRXMWX
YWIVIRKEKIQIRX YWIVZSXIW
YWIVFEHKI YWIVPXZ
YWIVGLYVR EQE^SR MXIQPXZ
MXIQGLYVR
MXIQWEPIW
HVMZIVHRJ
HVMZIVXST HVMZIVTSWMXMSR
YWIVGLYVR LQ WXYH]EHZIVWI
WXYH]SYXGSQI WMXIWYGGIWW
       
,SYVW,YQER0EFSV ,SYVW,YQER0EFSV

Figure 3: RDL vs. Data Scientist. RDL reduces the human work required to solve a task by 96% on
average. Left: classification, right: regression.

hours worked by 96% and lines of code by 94%. On average, the data scientist spent 12.3 hours per
task, while RDL took just 30 minutes. This demonstrates the potential of RDL to transform predictive
tasks on relational databases, replacing manual feature engineering with end-to-end learnable models,
a key insight from the last 15 years of AI research. RDL outperforms the data scientist in classification
tasks but struggles more on regression. Improvements in output heads for regression could enhance
RDL’s performance. RDL reduces hours worked by 96% and lines of code by 94%. Much of RDL’s
code is reusable, while the data scientist must write task-specific code for each problem, highlighting
RDL’s efficiency advantage.

6 Conclusion
This work introduces R EL B ENCH, a benchmark to facilitate research on relational deep learning (Fey
et al., 2024). R EL B ENCH provides diverse and realistic relational databases and define practical
predictive tasks that cover both entity-level prediction and entity link prediction. In addition, we
provide the first open-source implementation of relational deep learning and validated its effectiveness
over the common practice of manual feature engineering by an experienced data scientist. We hope
R EL B ENCH will catalyze further research on relational deep learning to achieve highly-accurate
prediction over complex multi-tabular datasets without manual feature engineering.

Acknowledgments and Disclosure of Funding

We thank Shirley Wu, Kaidi Cao, Rok Sosic, Yu He, Qian Huang, Bruno Ribeiro and Michi Yasunaga
for discussions and for providing feedback on our manuscript. We also gratefully acknowledge the
support of NSF under Nos. OAC-1835598 (CINES), CCF-1918940 (Expeditions), DMS-2327709
(IHBEM); Stanford Data Applications Initiative, Wu Tsai Neurosciences Institute, Stanford Institute
for Human-Centered AI, Chan Zuckerberg Initiative, Amazon, Genentech, GSK, Hitachi, SAP, and
UCB. The content is solely the responsibility of the authors and does not necessarily represent the
official views of the funding entities.

8
6(0 (EXE7GMIRXMWX
YWIVIRKEKIQIRX YWIVZSXIW
YWIVFEHKI YWIVPXZ
YWIVGLYVR EQE^SR MXIQPXZ
MXIQGLYVR
MXIQWEPIW
HVMZIVHRJ
HVMZIVXST HVMZIVTSWMXMSR
YWIVGLYVR LQ WXYH]EHZIVWI
WXYH]SYXGSQI WMXIWYGGIWW
          
0MRIWSJ'SHI 0MRIWSJ'SHI

Figure 4: RDL vs. Data Scientist. RDL reduces new lines of code by 94%. Left: classification,
right: regression.

References
Donald D Chamberlin and Raymond F Boyce. Sequel: A structured english query language. In
Proceedings of the 1974 ACM SIGFIDET (now SIGMOD) workshop on Data description, access
and control, pages 249–264, 1974.
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In ACM SIGKDD
Conference on Knowledge Discovery and Data Mining (KDD), pages 785–794, 2016.
Edgar F Codd. A relational model of data for large shared data banks. Communications of the ACM,
13(6):377–387, 1970.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi-
rectional transformers for language understanding. In North American Chapter of the Association
for Computational Linguistics (NAACL), 2018.
Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. ICLR
2019 (RLGM Workshop), 2019.
Matthias Fey, Weihua Hu, Kexin Huang, Jan Eric Lenssen, Rishabh Ranjan, Joshua Robinson, Rex
Ying, Jiaxuan You, and Jure Leskovec. Relational deep learning: Graph representation learning on
relational databases. ICML Position Paper, 2024.
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural
message passing for quantum chemistry. In International Conference on Machine Learning (ICML),
page 1263–1272, 2017.
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In
Advances in Neural Information Processing Systems (NeurIPS), 2017.
James A Hanley and Barbara J McNeil. A method of comparing the areas under receiver operating
characteristic curves derived from the same cases. Radiology, 148(3):839–843, 1983.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
770–778, 2016.
Jeff Heaton. An empirical analysis of feature engineering for predictive modeling. In SoutheastCon
2016, pages 1–6. IEEE, 2016.
Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer. In
Proceedings of The Web Conference 2020, page 2704–2710, 2020.
Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad
Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a
freely accessible critical care database. Scientific data, 3(1):1–9, 2016.

9
Kaggle. Kaggle Data Science & Machine Learning Survey, 2022. Avail-
able: https://www.kaggle.com/code/paultimothymooney/
kaggle-survey-2022-all-results/notebook.
Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in machine-learning-
based science. Patterns, 4(9), 2023.
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-
Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural
Information Processing Systems (NeurIPS), volume 30, 2017.
Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in
neural information processing systems, 30, 2017.
Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly-labeled
reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods
in natural language processing and the 9th international joint conference on natural language
processing (EMNLP-IJCNLP), pages 188–197, 2019.
Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word
representation. In Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP), pages 1532–1543, 2014.
PubMed. National Center for Biotechnology Information, U.S. National Library of Medicine, 1996.
Available: https://www.ncbi.nlm.nih.gov/pubmed/.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition
challenge. International journal of computer vision, 115(3):211–252, 2015.
Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max
Welling. Modeling relational data with graph convolutional networks. In Aldo Gangemi, Roberto
Navigli, Maria-Esther Vidal, Pascal Hitzler, Raphaël Troncy, Laura Hollink, Anna Tordai, and
Mehwish Alam, editors, The Semantic Web, pages 593–607, Cham, 2018. Springer International
Publishing.
Yiwei Wang, Yujun Cai, Yuxuan Liang, Henghui Ding, Changhu Wang, and Bryan Hooi. Time-aware
neighbor sampling for temporal graph networks. In arXiv pre-print, 2021.
Jiaxuan You, Jonathan M Gomes-Selman, Rex Ying, and Jure Leskovec. Identity-aware graph neural
networks. In Proceedings of the AAAI conference on artificial intelligence, pages 10737–10745,
2021.
Alice Zheng and Amanda Casari. Feature engineering for machine learning: principles and tech-
niques for data scientists. " O’Reilly Media, Inc.", 2018.

10
A Additional Task Information
For reference, the following list documents all the predictive tasks in R EL B ENCH.

1. rel-amazon
Node-level tasks:
(a) user-churn: For each user, predict 1 if the customer does not review any product in
the next 3 months, and 0 otherwise.
(b) user-ltv: For each user, predict the $ value of the total number of products they
buy and review in the next 3 months.
(c) item-churn: For each product, predict 1 if the product does not receive any reviews
in the next 3 months.
(d) item-ltv: For each product, predict the $ value of the total number purchases and
reviews it recieves in the next 3 months.
Link-level tasks:
(a) user-item-purchase: Predict the list of distinct items each customer will pur-
chase in the next 3 months.
(b) user-item-rate: Predict the list of distinct items each customer will purchase and
give a 5 star review in the next 3 months.
(c) user-item-review: Predict the list of distinct items each customer will purchase
and give a detailed review in the next 3 months.
2. rel-avito
Node-level tasks:
(a) user-visits: Predict whether each customer will visit more than one Ad in the
next 4 days.
(b) user-clicks: Predict whether each customer will click on more than one Ads in
the next 4 day.
(c) ad-ctr: Assuming the Ad will be clicked in the next 4 days, predict the Click-
Through-Rate (CTR) for each Ad.
Link-level tasks:
(a) user-ad-visit: Predict the list of ads a user will visit in the next 4 days.
3. rel-f1
Node-level tasks:
(a) driver-position: Predict the average finishing position of each driver all races
in the next 2 months.
(b) driver-dnf: For each driver predict the if they will DNF (did not finish) a race in
the next 1 month.
(c) driver-top3: For each driver predict if they will qualify in the top-3 for a race in
the next 1 month.
4. rel-hm
Node-level tasks:
(a) user-churn: Predict the churn for a customer (no transactions) in the next week.
(b) item-sales: Predict the total sales for an article (the sum of prices of the associated
transactions) in the next week.
Link-level tasks:
(a) user-item-purchase: Predict the list of articles each customer will purchase in
the next seven days.
5. rel-stack
Node-level tasks:
(a) user-engagement: For each user predict if a user will make any votes, posts, or
comments in the next 3 months.

11
(b) post-votes: For each user post predict how many votes it will receive in the next 3
months
(c) user-badge: For each user predict if each user will receive in a new badge the next
3 months.
Link-level tasks:
(a) user-post-comment: Predict a list of existing posts that a user will comment in
the next two years.
(b) post-post-related: Predict a list of existing posts that users will link a given
post to in the next two years.
6. rel-trial
Node-level tasks:
(a) study-outcome: Predict if the trials in the next 1 year will achieve its primary
outcome.
(b) study-adverse: Predict the number of affected patients with severe advsere
events/death for the trial in the next 1 year.
(c) site-success: Predict the success rate of a trial site in the next 1 year.
Link-level tasks:
(a) condition-sponsor-run: Predict whether this condition will have which spon-
sors.
(b) site-sponsor-run: Predict whether this sponsor will have a trial in a facility.
7. rel-event
Node-level tasks:
(a) user-attendance: Predict how many events each user will respond yes or maybe
in the next seven days.
(b) user-repeat: Predict whether a user will attend an event(by responding yes or
maybe) in the next 7 days if they have already attended an event in the last 14 days.
(c) user-ignore: Predict whether a user will ignore more than 2 event invitations in
the next 7 days.

B Experiment Details and Additional Results


B.1 Detailed Results
Tables 4, 5 and 6 show mean and standard deviations over 5 runs for the entity classification, entity
regression and link prediction results respectively.

B.2 Hyperparameter Choices


All our RDL experiments were run based on a single set of default task-specific hyperparameters,
i.e. we did not perform exhaustive hyperparamter tuning, cf. Table 7. This verifies the stability
and robustness of RDL solutions, even against expert data scientist baselines. Specifically, all task
types use a shared GNN configuration (a two-layer GNN with a hidden feature size of 128 and
“sum” aggregation) and sample subgraphs identically (disjoint subgraphs of 512 seed entities with a
maximum of 128 neighbors for each foreign key). Across task types, we only vary the learning rate
and the maximum number of epochs to train for.
Notably, we found that our default set of hyperparameters heavily underperformed on the node-level
tasks on the rel-trial dataset. On this dataset, we used a learning rate of 0.0001, a “mean”
neighborhood aggregation scheme, 64 sampled neighbors, and trained for a maximum of 20 epochs.
For the ID-GNN link-prediction experiments on rel-trial, it was necessary to use a four-layer
deep GNN in order to ensure that destination nodes are part of source node-centric subgraphs.

B.3 Ablations
We also report additional results ablating parts of our relational deep learning implementation. All
experiments are designed to be data-centric, aiming to validate basic properties of the chosen datasets

12
Table 4: Entity classification results (AUROC mean±std over 5 runs, higher is better) on R EL B ENCH.
Best values are in bold along with those not statistically different from it.
Dataset Task Split LightGBM RDL

Val 52.05±0.06 70.45±0.06


user-churn
Test 52.22±0.06 70.42±0.05
rel-amazon
Val 62.39±0.20 82.39±0.02
item-churn
Test 62.54±0.18 82.81±0.03
Val 53.31±0.09 69.65±0.04
user-visits
Test 53.05±0.32 66.20±0.10
rel-avito
Val 55.63±0.31 64.73±0.32
user-clicks
Test 53.60±0.59 65.90±1.95
Val 67.76±0.97 71.25±2.53
user-repeat
Test 68.04±1.82 76.89±1.59
rel-event
Val 87.96±0.28 91.70±0.33
user-ignore
Test 79.93±0.49 81.62±1.11
Val 68.42±1.14 71.36±1.54
driver-dnf
Test 68.56±3.89 72.62±0.27
rel-f1
Val 67.76±2.75 77.64±3.16
driver-top3
Test 73.92±5.75 75.54±0.63
Val 56.05±0.05 70.42±0.09
rel-hm user-churn
Test 55.21±0.12 69.88±0.21
Val 65.12±0.25 90.21±0.07
user-engagement
Test 63.39±0.26 90.59±0.09
rel-stack
Val 65.39±0.05 89.86±0.08
user-badge
Test 63.43±0.12 88.86±0.08
Val 68.30±0.53 68.18±0.49
rel-trial study-outcome
Test 70.09±1.41 68.60±1.01

Table 5: Entity regression results (MAE mean±std over 5 runs, lower is better) on R EL B ENCH. Best
values are in bold along with those not statistically different from it.
Global Global Global Entity Entity
Dataset Task Split LightGBM RDL
Zero Mean Median Mean Median

Val 14.141 20.740 14.141 17.685 15.978 14.141±0.000 12.132±0.007


user-ltv
Test 16.783 22.121 16.783 19.055 17.423 16.783±0.000 14.313±0.013
rel-amazon
Val 72.096 78.110 59.471 80.466 68.922 55.741±0.049 45.140±0.068
item-ltv
Test 77.126 81.852 64.234 78.423 66.436 60.569±0.047 50.053±0.163
Val 0.048 0.048 0.040 0.044 0.044 0.037±0.000 0.037±0.000
rel-avito ad-ctr
Test 0.052 0.051 0.043 0.046 0.046 0.041±0.000 0.041±0.001
Val 0.262 0.457 0.262 0.296 0.268 0.262±0.000 0.255±0.007
rel-event user-attendance
Test 0.264 0.470 0.264 0.304 0.269 0.264±0.000 0.258±0.006
Val 11.083 4.334 4.136 7.181 7.114 3.450±0.030 3.193±0.024
rel-f1 driver-position
Test 11.926 4.513 4.399 8.501 8.519 4.170±0.137 4.022±0.119
Val 0.086 0.142 0.086 0.117 0.086 0.086±0.000 0.065±0.000
rel-hm item-sales
Test 0.076 0.134 0.076 0.111 0.078 0.076±0.000 0.056±0.000
Val 0.062 0.146 0.062 0.102 0.064 0.062±0.000 0.059±0.000
rel-stack post-votes
Test 0.068 0.149 0.068 0.106 0.069 0.068±0.000 0.065±0.000
Val 57.083 75.008 56.786 57.083 57.083 45.774±1.191 46.290±0.304
study-adverse
Test 57.930 73.781 57.533 57.930 57.930 44.011±0.998 44.473±0.209
rel-trial
Val 0.475 0.462 0.475 0.447 0.450 0.417±0.003 0.401±0.009
site-success
Test 0.462 0.468 0.462 0.448 0.441 0.425±0.003 0.400±0.020

13
Table 6: Link prediction results (MAP mean±std over 5 runs, higher is better) on R EL B ENCH. Best
values are in bold along with those not statistically different from it.
Global Past RDL RDL
Dataset Task Split LightGBM
Popularity Visit (GraphSAGE) (ID-GNN)

Val 0.31 0.07 0.18±0.07 1.53±0.05 0.13±0.00


user-item-purchase
Test 0.24 0.06 0.16±0.05 0.74±0.08 0.10±0.00
rel-amazon Val 0.16 0.09 0.22±0.02 1.42±0.06 0.15±0.00
user-item-rate
Test 0.15 0.07 0.17±0.01 0.87±0.05 0.12±0.00
Val 0.18 0.05 0.14±0.03 1.03±0.03 0.11±0.00
user-item-review
Test 0.11 0.04 0.09±0.01 0.47±0.05 0.09±0.00
Val 0.01 3.66 0.17±0.01 0.09±0.01 5.40±0.02
rel-avito user-ad-visit
Test 0.00 1.95 0.06±0.01 0.02±0.00 3.66±0.02
Val 0.36 1.07 0.44±0.03 0.92±0.04 2.64±0.00
rel-hm user-item-purchase
Test 0.30 0.89 0.38±0.02 0.80±0.03 2.81±0.01
Val 0.03 2.05 0.04±0.02 0.43±0.08 15.17±0.15
user-post-comment
Test 0.02 1.42 0.04±0.03 0.11±0.05 12.72±0.22
rel-stack
Val 0.47 0.00 1.62±0.36 0.00±0.01 7.76±0.20
post-post-related
Test 1.46 1.74 2.00±0.43 0.07±0.08 10.83±0.22
Val 2.63 8.58 4.88±0.13 3.12±0.24 11.33±0.04
condition-sponsor-run
Test 2.52 8.42 4.82±0.20 2.89±0.39 11.36±0.08
rel-trial
Val 4.91 15.90 10.92±0.67 14.09±0.77 17.43±0.07
site-sponsor-run
Test 3.75 17.31 8.40±0.70 10.70±1.10 19.00±0.12

Table 7: Task-specific RDL default hyperparameters.


Task type
Hyperparameter
Node classification Node regression Link prediction
Learning rate 0.005 0.005 0.001
Maximum epochs 10 10 20
Batch size 512 512 512
Hidden feature size 128 128 128
Aggregation summation summation summation
Number of layers 2 2 2
Number of neighbors 128 128 128
Temporal sampling strategy uniform uniform uniform

and tasks. Examples include confirming that the graph structure, node features, and temporal-
awareness all play important roles in achieving optimal performance, which also underscores the
unique challenges our R EL B ENCH dataset and tasks present.
Graph structure. We first investigate the role of the graph structure we adopt for GNNs on R EL -
B ENCH. Specifically, we compare the following two approaches of constructing the edges: 1.
Primary-foreign key (pkey-fkey), where the entities from two tables that share the same primary key
and foreign key are connected through an edge; 2. Randomly permuted, where we apply a random
permutation on the destination nodes in the primary-foreign key graph for each type of the edge while
keeping the source nodes untouched. From Fig. 5 we observe that with random permutation on the
primary-foreign key edges the performance of the GNN becomes much worse, verifying the critical
role of carefully constructing the graph structure through, e.g., primary-foreign key as proposed
in Fey et al. (2024).
Node features and text embeddings. Here we study the effect of node features used in R EL B ENCH.
In the experiments depicted in Fig. 6, we compare GNN (w/ node feature) with its variant where
the node features are all masked by zeros (i.e., w/o node feature). We find that utilizing rich node
features incorporated in our R EL B ENCH dataset is crucial for GNN. Moreover, we also investigate, in
particular, the approach to encode texts in the data that constitutes part of the node features. In Fig. 7,
we compare GloVe text embedding (Pennington et al., 2014) and BERT text embedding (Devlin
et al., 2018) with w/o text embedding, where the text embeddings are masked by zeros. We observe
that encoding the rich texts in R EL B ENCH with GloVe or BERT embedding consistently yields better
performance compared with using no text features. We also find that BERT embedding is usually
better than GloVe embedding especially for node classification tasks, which suggests that enhancing
the quality of text embedding will potentially help achieve better performance.
Temporal awareness. We also investigate the importance of injecting temporal awareness into the
GNN by ablating on the time embedding. To be specific, in the implementation we add a relative

14
Figure 5: Investigation on the role of leveraging primary-foreign key (pkey-fkey) edges for the GNN.
At the top row are three node classification tasks with metric AUROC (higher is better) while at the
bottom are three node regression tasks with metric MAE (lower is better), evaluated on the test set.
We find that our proposal of using pkey-fkey edges for message passing is vital for GNN to achieve
desirable performance on R EL B ENCH. Error bars correspond to 95% confidence interval.

Figure 6: Investigation on the role of node features. At the top row are three node classification
tasks with metric AUROC (higher is better) while at the bottom are three node regression tasks with
metric MAE (lower is better), evaluated on the test set. We observe that leveraging node features is
important for GNN. Error bars correspond to 95% confidence interval.

time embedding when deriving the node features using the relative time span between the timestamp
of the entity and the querying seed time. Results are exhibited in Fig. 8. We discover that adding the
time embedding significantly enhance the performance across a diverse range of tasks, demonstrating
the efficacy and importance of building up the temporal awareness into the model.

15
Figure 7: Investigation on the role of text embedding. At the top row are three node classification
tasks with metric AUROC (higher is better) while at the bottom are three node regression tasks with
metric MAE (lower is better), evaluated on the test set. We observe that adding text embedding
using GloVe (Pennington et al., 2014) or BERT (Devlin et al., 2018) generally helps improve the
performance. Error bars correspond to 95% confidence interval.

Figure 8: Investigation on the role of time embedding. At the top row are three node classification
tasks with metric AUROC (higher is better) while at the bottom are three node regression tasks with
metric MAE (lower is better), evaluated on the test set. We find that adding time embedding to the
GNN consistently boosts the performance. Error bars correspond to 95% confidence interval.

C User Study Additional Details

C.1 Data Scientist Example Workflow

In this section we provide a detailed description of the data scientist workflow for the user-churn
task of the rel-hm dataset. The purpose of this is to exemplify the efforts undertaken by the
data scientist to solve R EL B ENCH tasks. For data scientist solutions to all tasks, see https:
//github.com/snap-stanford/relbench-user-study.
Recall that the main data science workflow steps are:

1. Exploratory data analysis (EDA).

16
2. Feature ideation.
3. Feature enginnering.
4. Tabular ML.
5. Post-hoc analysis of feature importance (optional).

C.1.1 Exploratory Data Analysis


During the exploratory data analysis (EDA) the data scientist familiarizes themselves with a new
dataset. It is typically carried out in a Jupyter notebook, where the data scientist first loads the dataset
or establishes a connection to it and then systematically explores it. The data scientist may:
• Visualize the database schema, looking at the fields of different tables and the relationships
between them.
• Closely analyze the label sets:
– Look at the relative sizes and temporal split of the training, validation and test subsets.
– Look at label statistics such as the mean, the standard deviation and various quantiles.
– For classification tasks, understand class (im)balance: how much bigger is the modal
class than the rest? For example, in the user-churn task roughly 82% of the samples
have label 1, so there is a good amount of imbalance but not enough to strictly require
up-sampling techniques.
– For regression tasks, understand the label distribution: are the labels concentrated
around a typical value or do they follow a power law wherein the labels span several
orders of magnitude? In extreme cases, this exploration will point to a need for
specialized handing of the label space for model training.
• Plot distributions and aggregations of interesting columns/fields. For example, in Figure 9
we can see three such plots. From left to right:
– The first plot shows the distribution of age among customers. We see two distinct peaks
one in the mid-twenties and another in the mid-fifties, suggesting different customer
“archetypes”, which may have different spending patterns.
– The second plot shows the number of sales per month over a two year period. We can
see some seasonality with summer months being particularly good for overall sales.
This suggests date related features could be useful.
– The third plot shows a Lorenz curve of sales per article, showcasing the canonical
Pareto Principle: 20% of the articles account for 80% of the sales.
• Run custom queries to look at interesting quantities and/or relationships between different
columns. For instance, in the EDA for rel-hm, an interesting quantity to look at is
the variability in item prices across the year. This reveals that most of the variability is
downward, representing temporary discounts.
• Investigate outliers or odd-looking patterns in the data. These usually will have some
real-world explanation that may inform how the data scientist chooses to pre-process the
data and construct features.
In all, this process takes in the order of a few hours (3-4 for most datasets in the user study).

C.1.2 Feature Ideation


Having explored the dataset in the EDA, the data scientist will then brainstorm features that, to their
judgement, will provide valuable signal to a model for a specific learning task. In the case of the
user-churn task, a rather simple feature would be the customer’s age, which is a field directly
available in one of the tables. A slightly more complex feature would be the total amount spent by the
customer so far. Finally, an example of a fairly complex feature is the average monthly sales volume
of items purchased by the customer in the past week. A high value for this feature may indicate that
the customer has been shopping trendy items lately, whereas a low value for this feature may indicate
that the customer has been interested in more arcane or specific items.
In practice, the ideation phase consists of writing down all of these feature ideas in a file or a piece of
paper. It is the quickest part of the whole process and in this user study took between 30 minutes and
one hour.

17
Age Distribution Number of Sales by Month Lorenz Curve (articles)

2M
1

8
0.8
1.5M

Fraction of sales
6

num_sales
0.6
percent

1M

4 0.4

0.5M
0.2
2

0
0 0
20 40 60 80 Jan 2019 Jul 2019 Jan 2020 Jul 2020 0 0.2 0.4 0.6 0.8 1

age month Fraction of articles

Figure 9: EDA Plots. Each plot explores different characteristics of the dataset. Understanding
the data and identifying relationships between different quantities is an essential prerequisite to
meaningful feature engineering.

C.1.3 Feature Engineering

With a list of features in hand, the data scientist then proceeds to actually write code to generate all
the features for each sample in the the train, validation and test subsets. In this user study, this was
carried out using DuckDB SQL2 with some Jinja templating3 for convenience.
Revisiting the example features from the previous section, the conceptual complexity of the features
closely tracks with the technical complexity of implementing them. For customer age all that is
required is a simple join. The total amount spent by the customer, can be calculated using a group
by clause and a couple of join’s. Lastly, calculating the average monthly sales volume of items
purchased by the customer in the past week requires multiple group by’s, join’s, and window functions
distributed across multiple common table expressions (CTEs).
A key consideration during feature engineering is the prevention of leakage. The data scientist must
ensure that none of the features accidentally include information from after the sample timestamp.
This is especially true for complex features like the third example above, where special care must be
taken to ensure that each join has the appropriate filters to comply with the sample timestamp.
For some tasks, e.g., study-outcome, the initial features did leak information from the validation
set into the training set. Thanks to the R EL B ENCH testing setup, leaking test data into the training
data is hard to do by accident, since test data is hidden. Leaking information from validation to train
(but not test to train) led to extremely high validation performance and very low test performance
(test was significantly lower than LightGBM with no feature engineering). The large discrepancy
between validation and test performances alerted the data scientist to the mistake, and the features
were eventually fixed. This example illustrates another complexity that feature engineering introduces,
with special care needed to ensure leakage does not happen.
Other considerations that the data scientist must keep in mind during development and implementation
of the features are parsing issues, runtime constrains and memory load. For example, during the
user study we identified a parsing issue arising from special characters in user posts/comments in the
rel-stack dataset. The backslash character, widely used LATEXcan trip up certain text parsers if
not handled with care. Furthermore, runtime and memory constraints are important to keep in mind
when working with larger datasets and computing features that require nested join’s and aggregations.
During the user study, there were some cases where we had to refactor SQL queries to make them
more efficient, increasing the overall implementation time. For some tasks we had to implement
sub-sampling of the training set to reduce the burden on compute resources.
Finally, once the features have been generated for each data subset, the data scientist will usually
inspect the generated features looking for anomalies (e.g. an unusual prevalence of NULL values). In
this user study we also implemented some automated sanity checks to validate the generated features
beyond manual inspection.

2
See https://duckdb.org/.
3
See https://jinja.palletsprojects.com/en/3.1.x/intro/.

18
Figure 10: Feature Importances. SHAP values of top 30 features ranked by importance. Note:
week_of_year feature shows little variability because the validation set is temporally concentrated in
a few weeks.

C.1.4 Tabular Machine Learning


The output of the Feature Engineering phase is a DuckDB table with engineered features for each data
subset. There is some non-trivial amount of work required to go from those tables to the numerical
arrays used for training by most Tabular ML models (LightGBM in this case). This is implemented
in a Python script that loads the data, transforms it into arrays and carries out hyperparameter tuning.
In this user study we ran 5 hyperparameter optimization runs, with 10 trials each, reporting the mean
and standard deviation over the 5 runs. For the user-churn task this took one to two hours.

C.1.5 Post-hoc Analysis


The last step in the process is to look at a trained model and analyze its performance and feature
importance. To this end we used SHAP values (Lundberg and Lee, 2017) and the corresponding
python package4 . Figure 10 shows the top 30 most important features in the user-churn task. The
individual violin plots show the distribution of SHAP values for a subset of the validation set, the
color indicates the value of the feature. For the user-churn task, the most predictive features were
primarily (1) all-time statistics of user behavior pattern, and (2) temporal information that allows the
model to be aware of seasonality.

C.2 Regression Output Head Analysis


By default, our RDL implementation uses a simple linear output head on top of the GNN embeddings.
However we found that on regression tasks this sometimes led to lower than desirable performance.
4
See https://shap.readthedocs.io/en/latest/.

19
Table 8: Entity regression results (MAE, lower is better) on selected R EL B ENCH datasets. Training a
LightGBM model on features extracted by a trained GNN leads to performance lift. This is evidence
that the linear layer output head of the base GNN is suboptimal.
Dataset Task Split GNN GNN+LightGBM
rel-f1 driver-position Test 4.173±0.178 4.05±0.09
rel-stack post-votes Test 0.065±0.00 0.062±0.00
rel-amazon item-ltv Test 14.31±0.028 14.10±0.02

Table 9: Entity classification results (AUROC, higher is better, numbers bolded if withing standard
deviation of best result) on selected R EL B ENCH tasks. Training a LightGBM model on features
extracted by a trained GNN does not lead to performance lift, and can even hurt performance slightly.
This is evidence that output head limitations hold for regression tasks only. Note, study-outcome
uses default GNN parameters for simplicity, differing form the performance reported in the main
paper.
Dataset Task Split GNN GNN+LightGBM
rel-f1 driver-dnf Test 72.3±1.67 71.8±1.30
rel-trial study-outcome Test 68.8±1.10 68.2±0.44
rel-stack user-badge Test 88.3±0.04 88.4±0.04

We found that performance on many regression tasks could be improved by modifying this output
head. Instead of a linear layer, we took the output from the GNN, and fed these embeddings into a
LightGBM model, which is trained in a second separate training phase from the GNN model.
The resulting model still uses an end-to-end learned GNN for cross-table feature engineering, showing
that the GNN is learning useful features. Instead we attribute the weaker performance to the linear
output head. We believe that further attention to the regression output head is an interesting direction
for further study, with the goal of designing an output head that is performant and can be trained
jointly with the GNN (unlike our LightGMB modification).
We run three experiments to study this phenomena, and attempt to isolate the output head as a
problematic component for regression tasks.
1. LightGBM trained on GNN-learned entity-level features on regression tasks. We find that
this model performs better than the original GNN, suggesting that the linear output head of
the GNN is suboptimal.
2. LightGBM trained on GNN-learned entity-level features on classification tasks. We find
no performance improvement, and even some degradation, compared to the original GNN
model, suggesting that the observed performance boost of (1) comes not from an overall
better architecture but from the correction of an innate shortcoming of the linear output head
vis-a-vis regression tasks. In other words, using a LightGBM on top of the GNN is only
helpful insofar as it provides a more flexible output head for regression tasks.
3. Evaluate GNN performance after converting regression tasks to binary classification tasks
with label y = 1{yregression > 0}. We find that the performance gap between the data
scientist models and the GNN narrow. This suggests that the GNN can learn the relevant
predictive signal, but performance is affected by how the task is formulated (classification
vs regression).
See Tables 8, 9, 10 for the results of each of these experiments.
In Figure 2, for regression tasks we report the RDL results using GNN learned features with
LightGBM output head. In Table 2 we report result for the basic GNN in order to avoid creating
confusion for other researchers when comparing different GNN methods. We believe that Tables 8, 9,
10 provide clear evidence that there is an opportunity for improvements and simplifications, which
we leave to future work.

D Dataset Origins and Licenes


This section details the sources for all data used in R EL B ENCH. In all cases, the data providers
consent for their data to be used freely for non-commercial and research purposes. The only

20
Table 10: Entity classification results (AUROC, higher is better) on selected R EL B ENCH regression
tasks, converted into classification tasks with binary label y = 1{yregression > 0}. Training a
LightGBM model on features extracted by a trained GNN leads to performance lift. This is evidence
that the linear layer output head of the base GNN is suboptimal.
Dataset Task Split GNN Data Scientist
rel-f1 driver-position Test 81.96±1.18 86.63±0.40
rel-stack post-votes Test 80.5±0.18 78.3±0.05
rel-amazon item-ltv Test 70.61±0.06 70.29±0.06

database with potentially personally identifiable information is rel-stack, which draws from the
Stack Exchange site, which sometimes has individuals’ names as their username. This information
shared with consent, as all users must agree to the Stack Exchange privacy policy, see: https:
//stackoverflow.com/legal/privacy-policy.
rel-amazon. Data obtained from the Amazon Review Data Dump from Ni et al. (2019). See
the website: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/. Data
license is not specified.
rel-avito. Data is obtained from Kaggle https://www.kaggle.com/competitions/
avito-context-ad-clicks. All R EL B ENCH users must download data from Kaggle them-
selves, a part of which is accepting the data usage terms. These terms include use only for non-
commercial and academic purposes. Note that after data download, we further downsample the avito
dataset by randomly selecting approximately 100,000 data point from user table and sample all other
tables that have connections to the sampled users.
rel-stack. Data was obtained from The Internet Archive, whose stated mission is to provide
“universal access to all knowledge. We downloaded our data from https://archive.org/
download/stackexchange in Novermber 2023. Data license is not specified.
rel-f1. Data was sourced from the Ergast API (https://ergast.com/mrd/) in February
2024. The Ergast Developer API is an experimental web service which provides a historical record of
motor racing data for non-commercial purposes. As far as we are able to determine the data is public
and license is not specified.
rel-trial. Data was downloaded from the ClinicalTrials.gov website in January 2024.
This data is provided by the NIH, an official branch of the US Government. The terms of use state
that data are available to all requesters, both within and outside the United States, at no charge. Our
rel-trial database is a snapshot from January 2024, and will not be updated with newer trials
results.
rel-hm. Data is obtained from Kaggle https://www.kaggle.com/competitions/
h-and-m-personalized-fashion-recommendations. All R EL B ENCH users must
download data from Kaggle themselves, a part of which is accepting the data usage terms. These
terms include use only for non-commercial and academic purposes.
rel-event. The dataset employed in this research was initially released on Kaggle for the Event
Recommendation Engine Challenge, which can be accessed at https://www.kaggle.com/c/
event-recommendation-engine-challenge/data. We have obtained explicit consent
from the creators of this dataset to use it within R EL B ENCH. We extend our sincere gratitude to Allan
Carroll for his support and generosity in sharing the data with the academic community.

E Additional Training Table Statistics

We report additional training table statistics for all tasks, separated into entity classification (cf. Ta-
ble 11), entity regression (cf. Table 12), and link prediction (cf. Table 13).

21
Table 11: R EL B ENCH entity classification training table target statistics.
Dataset Task Split Positives Negatives
Train 2,956,658 (62.47%) 1,775,897 (37.53%)
user-churn Val 263,098 (64.2%) 146,694 (35.8%)
Test 213,400 (60.64%) 138,485 (39.36%)
rel-amazon
Train 1,113,863 (43.52%) 1,445,401 (56.48%)
item-churn Val 73,242 (41.22%) 104,447 (58.78%)
Test 61,647 (36.95%) 105,195 (63.05%)
Train 1,365 (11.96%) 10,046 (88.04%)
driver-dnf Val 125 (22.08%) 441 (77.92%)
Test 207 (29.49%) 495 (70.51%)
rel-f1
Train 231 (17.07%) 1,122 (82.93%)
driver-top3 Val 119 (20.24%) 469 (79.76%)
Test 128 (17.63%) 598 (82.37%)
Train 3,170,367 (81.89%) 701,043 (18.11%)
rel-hm user-churn Val 62,225 (81.28%) 14,331 (18.72%)
Test 61,609 (82.61%) 12,966 (17.39%)
Train 68,020 (5.0%) 1,292,830 (95.0%)
user-engagement Val 2,411 (2.81%) 83,427 (97.19%)
Test 2,411 (2.74%) 85,726 (97.26%)
rel-stack
Train 163,048 (4.81%) 3,223,228 (95.19%)
user-badge Val 7,301 (2.95%) 240,097 (97.05%)
Test 6,735 (2.64%) 248,625 (97.36%)
Train 7,647 (63.76%) 4,347 (36.24%)
rel-trial study-outcome Val 561 (58.44%) 399 (41.56%)
Test 483 (58.55%) 342 (41.45%)
Train 1,882 (48.98%) 1,960 (51.02%)
user-repeat Val 130 (48.51%) 138 (51.49%)
Test 110 (44.72%) 136 (55.28%)
rel-event
Train 3,247 (16.88%) 15,992 (83.12%)
user-ignore Val 441 (10.54%) 3,744 (89.46%)
Test 450 (11.40%) 3,499 (88.60%)
Train 2,302 (3.87%) 57,152 (96.13%)
user-clicks Val 745 (3.52%) 20,438 (96.48%)
Test 740 (1.54%) 47,256 (98.46%)
rel-avito
Train 78,467 (90.59%) 8,152 (9.41%)
user-visits Val 27,086 (90.35%) 2,893 (9.65%)
Test 30,731 (85.06%) 5,398 (14.94%)

22
Table 12: R EL B ENCH entity regression training table target statistics.
Dataset Task Split Minimum Median Mean Maximum
Train 0.0 0.0 16.93 9,511.46
Val 0.0 0.0 14.14 7,259.91
user-ltv
Test 0.0 0.0 16.78 10,329.86
Total 0.0 0.0 16.71 10,329.86
rel-amazon
Train 0.0 20.78 67.57 198,419.8
Val 0.0 22.44 72.10 75,901.55
item-ltv
Test 0.0 23.72 77.13 206,663.58
Total 0.0 20.97 68.38 206,663.58
Train 1.0 13.33 13.90 39.0
Val 1.0 11.4 11.08 22.0
rel-f1 driver-position
Test 1.0 12.18 11.93 24.0
Total 1.0 13.0 13.57 39.0
Train 0.0 0.0 0.076 87.16
Val 0.0 0.0 0.086 40.36
rel-hm item-sales
Test 0.0 0.0 0.076 38.31
Total 0.0 0.0 0.076 87.16
Train 0.0 0.0 0.093 78.0
Val 0.0 0.0 0.062 36.0
rel-stack post-votes
Test 0.0 0.0 0.068 26.0
Total 0.0 0.0 0.090 78.0
Train 0.0 2.0 39.84 28,085.0
Val 0.0 2.0 57.08 17,245.0
study-adverse
Test 0.0 3.0 57.93 5,978.0
Total 0.0 2.0 42.20 28,085.0
rel-trial
Train 0.0 0.0 0.44 1.0
Val 0.0 0.4 0.47 1.0
site-success
Test 0.0 0.17 0.4 1.06
Total 0.0 0.0 0.45 1.0
Train 0.0 0.0 0.37 16.0
Val 0.0 0.0 0.28 5.0
rel-event user-attendance
Test 0.0 0.0 0.26 8.0
Total 0.0 0.0 0.34 16.0
Train 0.00052 0.018 0.045 1.0
Val 0.00091 0.018 0.048 1.0
rel-avito ad-ctr
Test 0.00085 0.019 0.052 1.0
Total 0.00052 0.018 0.047 1.0

23
Table 13: R EL B ENCH link prediction training table link statistics.
Avg #links per
Dataset Task Split #Links %Repeated links
entity/timestamp
Train 11,759,844 2.18 -
user-item-purchase Val 802,540 2.28 0.18
Test 918,919 2.33 0.15

rel-amazon Train 7,146,115 1.8 -


user-item-rate Val 519,496 2.01 0.19
Test 599,867 2.05 0.15
Train 5,138,184 2.19 -
user-item-review Val 268,651 2.3 0.18
Test 305,476 2.4 0.15
Train 13,191,321 3.38 -
rel-hm user-item-purchase Val 237,152 3.18 3.51
Test 207,996 3.10 3.76
Train 43,337 2.08 -
user-post-comment Val 1,603 1.94 3.43
Test 1,517 2.0 4.09
rel-stack
Train 7,162 1.2 -
post-post-related Val 294 1.3 0.0
Test 359 1.39 1.39
Train 503,176 12.51 -
condition-sponsor-run Val 30,448 14.63 34.48
Test 25,694 12.49 38.37
rel-trial
Train 1,485,360 2.27 -
site-sponsor-run Val 80,103 2.16 20.91
Test 50,635 1.85 23.29
Train 2,738,733 31.53 -
rel-avito user-ad-visit Val 877,441 29.27 6.79
Test 712,985 19.73 4.73

F Dataset Schema

customer

review 1 customer_id numerical

review_text text customer_name text

summary text

review_time timestamp
product
rating numerical
1 product_id numerical
verified categorical
* brand text
customer_id numerical
* title text
product_id numerical
description text

price numerical

category varchar

Figure 11: rel-amazon database diagram.

24
postLinks posts users

Id numerical 1 Id numerical 1 1 Id numerical 1

RelatedPostId numerical * PostTypeId numerical AccountId numerical

PostId numerical * AcceptedAnswerId numerical * CreationDate timestamp

LinkTypeId numerical ParentId numerical * AboutMe text

CreationDate timestamp CreationDate timestamp


votes
Body text
Id numerical
OwnerUserId numerical *
comments * PostId numerical
Title text
Id numerical VoteTypeId numerical
Tags text
PostId numerical * UserId numerical *

Text text CreationDate timestamp


postHistory
CreationDate timestamp
Id numerical
UserId numerical *
* badges
PostId numerical
Id numerical
UserId numerical *
UserId numerical *
PostHistoryTypeId numerical
Class categorical
ContentLicense categorical
Date timestamp
Text text
TagBased categorical
CreationDate timestamp

Figure 12: rel-stack database diagram.

races circuits drivers results

raceId numerical 1 1 circuitId numerical 1 driverId numerical 1 resultId numerical

year numerical circuitRef text driverRef text * raceId numerical

round numerical name text code text * driverId numerical

circuitId numerical * location text forename text * constructorId numerical

name text country categorical surname text statusId categorical

date timestamp lat numerical dob timestamp number numerical

time timestamp lng numerical nationality categorical grid numerical

alt numerical position numerical

positionOrder numerical
standings constructor_results
points numerical
driverStandingsId numerical constructors constructorResultsId numerical
laps numerical
raceId numerical * 1 constructorId numerical 1 * raceId numerical
milliseconds numerical
driverId numerical * constructorRef text * constructorId numerical
fastestLap numerical
points numerical name text points numerical
rank numerical
position numerical nationality categorical date timestamp
date timestamp
wins numerical

date timestamp

constructor_standings

constructorStandingsId numerical
qualifying
* raceId numerical
qualifyId numerical
* constructorId numerical
raceId numerical *
points numerical
driverId numerical *
position numerical
constructorId numerical *
wins numerical
number numerical
date timestamp
position numerical

Figure 13: rel-f1 database diagram.

25
studies outcomes outcome_analyses drop_withdrawals

nct_id numerical 1 id numerical 1 id numerical id numerical

start_date timestamp * nct_id numerical * nct_id numerical * nct_id numerical

target_duration text outcome_type categorical * outcome_id numerical period text

study_type categorical title text non_inferiority_type categorical reason text

acronym text description text non_inferiority_description text count numerical

baseline_population text time_frame text param_type text date timestamp

brief_title text population text param_value numerical

official_title text units text dispersion_type categorical


interventions interventions_studies
phase categorical units_analyzed text dispersion_value numerical
intervention_id numerical 1 id numerical
enrollment numerical dispersion_type text p_value_modifier text
mesh_term text * nct_id numerical
enrollment_type categorical param_type categorical p_value numerical
* intervention_id numerical
source text date timestamp ci_n_sides categorical
date timestamp
number_of_arms numerical ci_percent numerical

number_of_groups numerical ci_lower_limit numerical


designs
has_dmc categorical ci_upper_limit numerical
id numerical
facilities facilities_studies
is_fda_regulated_drug categorical ci_upper_limit_na_comment text
* nct_id numerical
facility_id numerical 1 id numerical
is_fda_regulated_device categorical p_value_description text
allocation categorical * nct_id
name text numerical
is_unapproved_device categorical method text
intervention_model categorical * facility_id
city text numerical
is_ppsd text method_description text
observational_model categorical
state text date timestamp
is_us_export categorical estimate_description text
primary_purpose text
zip text
biospec_retention categorical groups_description text
time_perspective categorical
country text
biospec_description text other_analysis_description text
masking categorical
source_class categorical date timestamp
masking_description text
baseline_type_units_analyzed text
intervention_model_description text
fdaaa801_violation categorical sponsors sponsors_studies
subject_masked categorical eligibilities
plan_to_share_ipd categorical sponsor_id numerical 1 id numerical
caregiver_masked categorical id numerical
detailed_descriptions text name text * nct_id numerical
investigator_masked categorical * nct_id numerical
brief_summaries text agency_class categorical * sponsor_id numerical
outcomes_assessor_masked categorical sampling_method categorical
lead_or_collaborator categorical
date timestamp gender categorical
date timestamp
reported_event_totals minimum_age text

id numerical maximum_age text

nct_id numerical * healthy_volunteers categorical


conditions
event_type categorical population text conditions_studies
condition_id numerical 1
classification categorical criteria text id numerical
mesh_term text
subjects_affected numerical gender_description text * nct_id numerical

subjects_at_risk numerical gender_based categorical * condition_id numerical

date timestamp adult categorical date timestamp

child categorical

older_adult categorical

date timestamp

Figure 14: rel-trial database diagram.

26
article customer

article_id numerical 1 customer_id text 1

product_code numerical FN categorical

prod_name text Active categorical

product_type_no numerical club_member_status categorical

product_type_name categorical fashion_news_frequency categorical

product_group_name categorical age numerical

graphical_appearance_no categorical postal_code categorical

graphical_appearance_name categorical

colour_group_code categorical

colour_group_name categorical transactions

perceived_colour_value_id categorical t_dat timestamp

perceived_colour_value_name categorical price numerical

perceived_colour_master_id numerical sales_channel_id categorical

customer_id numerical *
perceived_colour_master_name categorical
* article_id numerical
department_no numerical

department_name categorical

index_code categorical

index_name categorical

index_group_no categorical

index_group_name categorical

section_no numerical

section_name text

garment_group_no categorical

garment_group_name categorical

detail_desc text

Figure 15: rel-hm database diagram.

27
users events event_attendees

user_id numerical 1 event_id numerical 1 * event numerical

locale text * user_id numerical status categorical

birthyear numerical start_time timestamp * user_id numerical

gender categorical city text start_time timestamp

joinedAt timestamp state text

location text zip text event_interest

timezone numerical country text * user numerical

lat numerical * event numerical

lng numerical invited categorical


user_friends
c_1_to_c_100 numerical timestamp timestamp
user numerical *
c_other numerical interested categorical
friend numerical *
not_interested categorical

Figure 16: rel-event database diagram.

AdsInfo Category Location PhoneRequestsStream

AdID numerical 1 1 CategoryID numerical 1 LocationID numerical * UserID numerical

LocationID numerical * Level categorical Level categorical IPID numerical

CategoryID numerical * ParentCategoryID numerical RegionID numerical * AdID numerical

Price numerical SubcategoryID numerical CityID numerical PhoneRequestDate timestamp

Title text __index_level_0__ numerical

IsContext categorical
UserInfo VisitStream

SearchStream 1 UserID numerical 1 * UserID numerical

SearchInfo * SearchID numerical UserAgentID numerical IPID numerical

UserID numerical * * AdID numerical UserAgentOSID numerical * AdID numerical

SearchID numerical 1 Position categorical UserDeviceID numerical ViewDate timestamp

SearchDate timestamp ObjectType categorical UserAgentFamilyID numerical

IPID numerical HistCTR numerical

IsUserLoggedOn categorical IsClick categorical

SearchQuery text SearchDate timestamp

LocationID numerical *

CategoryID numerical *

Figure 17: rel-avito database diagram.

G Broader Impact
Relational deep learning broadens the applicability of graph machine learning to include relational
databases. Whilst the blueprint is general, and can be applied to a wide variety of tasks, including
potentially hazardous ones, we have taken steps to focus attention of potential positive use cases.
Specifically, the beta version of R EL B ENCH considers two databases, Amazon products, and Stack
Exchange, that are designed to highlight the usefulness of RDL for driving online commerce and
online social networks. Future releases of R EL B ENCH will continue to expand the range of databases
into domains we reasonably expect to be positive, such as biomedical data and sports fixtures. We
hope these concrete steps ensure the adoption of RDL for purposes broadly beneficial to society.
Whilst we strongly believe the R EL B ENCH has all the ingredients needed to be a long term benchamrk
for relational deep learning, there are also possibilities for improvement and extension. Two such
possibilities include: (1) RDL at scale: currently our implementation must load the entire database

28
into working memory during training. For very large datasets this is not viable. Instead, a custom
batch sampler is needed that acesses the database via queries to sample specific entities and their
pkey-fkey neighbors; (2) Fully inductive link-prediction: our current link-prediction implementation
supports predicting links for test time pairs (head,tail) where head is potentially new (unseen during
training) and tail seen in the training data. Extending this formulation to be fully inductive (i.e., tail
unseen during training) is possible, but out of the scope of this work for now.

29

You might also like