Relational Deep Learningpdf
Relational Deep Learningpdf
https://relbench.stanford.edu
Abstract
Much of the world’s most valuable data is stored in relational databases, where data
is organized into tables connected by primary-foreign key relationships. Building
machine learning models on this data is challenging because existing algorithms
cannot directly learn from multiple connected tables. Current methods require
manual feature engineering, which involves joining and aggregating tables into a
single format, a labor-intensive and error-prone process. We introduce Relational
Deep Learning (RDL), an end-to-end learning framework that eliminates the need
for manual feature engineering by representing relational databases as temporal,
heterogeneous graphs. In this representation, rows become nodes, and primary-
foreign key links define edges. Graph Neural Networks (GNNs) are then used to
learn representations from all available data. We benchmark RDL on R EL B ENCH,
evaluating 30 predictive tasks across seven relational databases, and demonstrate
superior performance compared to traditional methods. In a user study, RDL
significantly outperforms an experienced data scientist’s manual feature engineering
approach, reducing human effort by more than 90%. These results highlight the
potential of deep learning for predictive tasks in relational databases.
1 Introduction
The information age is fueled by data stored in relational databases, which are foundational to nearly
all modern technology stacks. These databases store data across multiple tables, linked by primary
and foreign keys, and are managed using query languages like SQL (Codd, 1970; Chamberlin and
Boyce, 1974). As a result, they are at the core of systems in sectors like e-commerce, social media,
healthcare, and scientific repositories (Johnson et al., 2016; PubMed, 1996).
Many predictive tasks over relational databases hold significant implications for decision making.
For example, a hospital might predict patient discharge risk, or a company may forecast product sales.
Each of these tasks relies on rich relational schemas, and machine learning models are often built
from this data (Kaggle, 2022).
However, existing machine learning paradigms, especially tabular learning, cannot directly operate
on relational data. Instead, a manual feature engineering process is often required, where data from
multiple tables is aggregated into a single table format for learning. For instance, in an e-commerce
schema, a data scientist might extract features like “number of purchases in the last 30 days” to
predict future customer behavior. These features are manually constructed and stored in a single table
for training tabular models.
∗
Equal contribution, order chosen randomly. First authors may swap the ordering for professional purposes.
ProductID TransactionID
Description ProductID
Training Table Image Timestamp
90-day Target Size Customer Training Table
CustomerID
UserID
Customers Price
SeedTime
CustomerID
Name
(a) Rel. Tables with Training Table (b) Entities Linked by Foreign Keys
B A
A
1
2
B A
3
B
1 3 2 1 1
(c) Relational Entity Graph (d) Graph Neural Network
Figure 1: Relational Deep Learning Pipeline. (a) Relational tables with a predictive task generate a
training table containing supervised labels. (b) Entities are linked by foreign-primary key relations.
(c) The data forms a Relational Entity Graph, with nodes for each entity and edges from key links.
(d) Features are extracted for each entity, and a GNN computes embeddings. A model head produces
predictions, and errors are backpropagated.
This manual approach has significant limitations: it is labor-intensive, suboptimal, and often fails to
capture fine-grained signals. Moreover, temporal features need frequent recomputation, leading to
additional computational cost and potential time-leakage bugs (Kapoor and Narayanan, 2023). These
challenges are similar to those faced in early computer vision, where hand-engineered features were
eventually replaced by end-to-end deep learning systems (He et al., 2016; Russakovsky et al., 2015).
Here we introduce Relational Deep Learning (RDL), a framework for end-to-end deep learning
on relational databases. RDL fully leverages relational data by representing it as a heterogeneous
Relational Entity Graph, where rows become nodes, columns become node features, and primary-
foreign key links form edges. Graph Neural Networks (GNNs) (Gilmer et al., 2017; Hamilton et al.,
2017) are then applied to learn predictive models directly from the graph structure.
RDL consists of four main steps (Fig. 1): (1) A training table containing labels is computed from
historic data, (2) entity-level features are extracted as node features, (3) node embeddings are learned
via a GNN that exchanges information across primary-foreign key links, and (4) a task-specific model
produces predictions, with errors backpropagated for optimization.
RDL naturally handles temporal data by ensuring entities only receive messages from earlier times-
tamps, preventing information leakage. This allows for efficient model updating and avoids common
time-travel bugs.
We evaluate RDL on 30 tasks across 7 databases in the R EL B ENCH benchmark. Our method
outperforms baselines, including a strong “data scientist” approach where features are manually
engineered for each task. RDL models match or exceed these models in accuracy while reducing
human labor by 96% and lines of code by 94%. This demonstrates RDL’s promise as the first end-to-
end deep learning solution for relational data, offering significant improvements in both accuracy and
efficiency.
2 Problem Formulation
This section defines predictive tasks on relational tables, focusing on data structure and task specifica-
tion. We lay the groundwork for Section 3, where we present our GNN-based approach.
2
A relational database (T , L) consists of tables T = T1 , . . . , Tn , and links L ⊆ T × T , which
connect tables through primary-foreign key relations (Fig. 1a). Each table T is a set of rows, or
entities v, with a primary key, foreign keys, attributes, and an optional timestamp. For instance, in
Fig. 1a, T RANSACTIONS has a primary key (T RANSACTION ID), two foreign keys, attributes, and a
timestamp.
Attributes are tuples of values xv = (x1v , . . . , xdvT ) shared across entities in the same table. For
example, P RODUCTS contains text, image, and numerical attributes, each processed by specific
encoders (Sec. 3.4.3).
Predictive tasks often involve forecasting future events. Ground truth labels for training are generated
from historical data (e.g., summing customer purchases over a future period). To hold these labels,
we introduce a training table Ttrain , where each row has a foreign key, a timestamp, and a label.
The training table links to the main database via foreign keys, and timestamps ensure temporal
consistency, preventing data leakage. This setup supports node-level, link prediction, and both
temporal and static tasks, providing a versatile framework for various machine learning problems on
relational data.
3 Methods
Here, we formulate a generic graph neural network architecture for solving predictive tasks on
relational databases. The following section will first introduce three important graph concepts: (a)
The schema graph (cf. Sec. 3.1), table-level graph, where one table corresponds to one node. (b) The
relational entity graph (cf. Sec. 3.2), an entity-level graph, with a node for each entity in each table,
and edges are defined via foreign-primary key connections between entities. (c) The time-consistent
computation graph (cf. Sec. 3.3), which acts as an explicit training example for graph neural networks.
We describe generic procedures to map between graph types, and finally introduce our GNN blueprint
for end-to-end learning on relational databases (cf. Sec. 3.4).
3
value τ (v) denotes the point in time in which the table row v became available or −∞ in
case of non-temporal rows.
• Embedding vectors hv ∈ Rdϕ(v) for each v ∈ V, which contains an embedding vector for
each node in the graph. Initial embeddings are obtained via multi-modal column encoders,
cf. Sec. 3.4.3. Final embeddings are computed via GNNs outlined in Section 3.4.
The graph contains a node for each row in the database tables. Two nodes are connected if the foreign
key entry in one table row links to the primary key entry of another table row. Node and edge types
are defined by the schema graph. Nodes resulting from temporal tables carry the timestamp from the
respective row, allowing temporal message passing, which is described next.
Given a relational entity graph and a training table, we need to be able to query the graph at specific
points in time which then serve as explicit training examples used as input to the model. In particular,
we create a subgraph from the relational entity graph induced by the set of foreign keys Kv and its
timestamp tv of a training example in the training table Ttrain . This subgraph then acts as a local and
time-consistent computation graph to predict its ground-truth label yv .
The computational graphs obtained via neighbor sampling (Hamilton et al., 2017) allow the scalability
of our proposed approach to modern large-scale relational data with billions of table rows, while
ensuring the temporal constraints (Wang et al., 2021).
Given a time-consistent computational graph and its future label to predict, we define a generic
multi-stage deep learning architecture as follows:
(0)
1. Table-level column encoders that encode table row data into initial node embeddings hv
(cf. Sec. 3.4.3).
2. A stack of L relational-temporal message passing layers (cf. Sec. 3.4.1).
3. A task-specific model head, mapping final node embeddings to a prediction (cf. Sec. 3.4.2).
The whole architecture, consisting of table-level encoders, message passing layers and task specific
model heads can be trained end-to-end to obtain a model for the given task.
f : Rdv → Rd , f : h(L)
v 7→ ŷ. (3.2)
4
Table 1: Entity classification results (AUROC, higher is better) on R EL B ENCH. Best values are in
bold. See Table 4 in Appendix B for standard deviations.
Rel. Gain
Dataset Task Split LightGBM RDL
of RDL
Link-level Model Head. Similarly, we can define a link-level model head for training examples
{(K, t, y)i }N
i=1 with K = {k1 , k2 } containing primary keys of two different nodes v1 , v2 ∈ V in the
(L) (L)
relational entity graph. A function maps node embeddings hv1 , hv2 to a prediction, i.e.
A task-specific loss L(ŷ, y) provides gradient signals to all trainable parameters. The presented
approach can be generalized to |K| > 2 to specify subgraph-level tasks.
4 Experiments
We evaluate RDL on R EL B ENCH. Tasks are grouped into three task types: entity classification
(Section 4.1), entity regression (Section 4.2), and entity link prediction (Section 4.3). Tasks differ
significantly in the number of train/val/test entities, number of unique entities (the same entity may
appear multiple times at different timestamps), and the proportion of test entities seen during training.
Note this is not data leakage, since entity predictions are timestamp dependent, and can change over
time. Tasks with no overlap are pure inductive tasks, whilst other tasks are (partially) transductive.
5
Table 2: Entity regression results (MAE, lower is better) on R EL B ENCH. Best values are in bold. See
Table 5 in Appendix B for standard deviations.
Global Global Global Entity Entity Rel. Gain
Dataset Task Split LightGBM RDL
Zero Mean Median Mean Median of RDL
Experimental results. Results are given in Table 1, with RDL outperforming or matching baselines
in all cases. Notably, LightGBM achieves similar performance to RDL on the study-outcome
task from rel-trial. This task has extremely rich features in the target table (28 columns total),
giving the LightGBM many potentially useful features even without feature engineering. It is an
interesting research question how to design RDL models better able to extract these features and
unify them with cross-table information in order to outperform the LightGBM model on this dataset.
4.3 Recommendation
Finally, we also introduce recommendation tasks on pairs of entities. The task is to predict a list of
top K target entities given a source entity at a given seed time. The metric we use is Mean Average
Precision (MAP) @K, where K is set per task (higher is better). We consider the following baselines:
Global popularity computes the top K most popular target entities (by count) across the entire
training table and predict the K globally popular target entities across all source entities. Past visit
computes the top K most visited target entities for each source entity within the training table and
predict those past-visited target entities for each entity. LightGBM learns a LightGBM (Ke et al.,
2017) classifier over the raw features of the source and target entities (concatenated) to predict the
link. Additionally, global popularity and past visit ranks are also provided as inputs.
For recommendation, it is also important to ensure a certain density of links in the training data in
order for there to be sufficient predictive signal. In Appendix A we report statistics on the average
number of destination entities each source entity links to. For most tasks the density is ≥ 1, with the
exception of rel-stack which is more sparse, but is included to test in extreme sparse settings.
6
Table 3: Recommendation results (MAP, higher is better) on R EL B ENCH. Best values are in bold.
See Table 6 in Appendix B for standard deviations.
Global Past RDL RDL Rel. Gain
Dataset Task Split LightGBM
Popularity Visit (GraphSAGE) (ID-GNN) of RDL
Experimental results. Results are given in Table 3. We find that either the RDL implementation
using GraphSAGE (Hamilton et al., 2017), or ID-GNN (You et al., 2021) as the GNN component
performs best, often by a very significant margin. ID-GNN excels in cases were predictions are
entity-specific (i.e., Past Visit baseline outperforms Global Popularity), whilst the plain GNN excels
in the reverse case. This reflects the inductive biases of each model, with GraphSAGE being able to
learn structural features, and ID-GNN able to take into account the specific node ID.
5 Expert Data Scientist User Study
To rigorously test RDL, we conducted a human trial where a data scientist manually engineered
features and used methods like LightGBM or XGBoost (Chen and Guestrin, 2016; Ke et al., 2017).
This represents the prior standard for building predictive models on relational databases (Heaton,
2016), providing a key comparison for RDL.
The study follows five workflow steps: EDA: Exploring the dataset to understand its characteristics,
including feature columns and missing data. Feature ideation: Proposing entity-level features that
may contain predictive signals. Feature engineering: Using SQL to compute and add features to the
target table. Tabular ML: Running LightGBM or XGBoost on the table with engineered features
and recording performance. Post-hoc feature analysis (Optional): Tools like SHAP and LIME
explain feature contributions.
For example, in the rel-hm dataset, additional features such as time since last purchase are computed
to predict customer churn. A detailed walkthrough of the data scientist’s process is provided in
Appendix C.
Limitations of Manual Feature Engineering. This process is labor-intensive, misses potential
signals, and limits feature complexity. Every new task requires repeating these steps, adding hours of
human labor and SQL code (Zheng and Casari, 2018). RDL models avoid these issues.
Data Scientist. We recruited a data scientist with a Stanford CS MSc, 4.0 GPA, and five years of
experience building ML models, following the five steps outlined above.
User Study Protocol. The study protocol standardizes the time spent at each step: EDA: Capped at 4
hours to understand the dataset’s schema and relationships. Feature ideation: Limited to 1 hour, with
features proposed manually. Feature engineering: SQL queries are used to generate features, with
no time limit for code writing. The time spent is recorded. Tabular ML: A standardized LightGBM
script is used, with time recorded for preprocessing SQL query results. Post-hoc analysis: Conducted
for sanity checks, taking just a few minutes (not included in total time).
Results. We compare RDL to the data scientist on three metrics: (i) predictive power, (ii) hours of
human work, and (iii) lines of code. Marginal effort was measured, excluding reusable infrastructure.
Figures 2, 3, and 4 show that RDL outperforms the data scientist in 11 of 15 tasks while reducing
7
6 ( 0 ( E X E 7 G M I R X M W X
Y W I V I R K E K I Q I R X Y W I V Z S X I W
Y W I V F E H K I Y W I V P X Z
Y W I V G L Y V R E Q E ^ S R
M X I Q P X Z
M X I Q G L Y V R
M X I Q W E P I W
H V M Z I V H R J
H V M Z I V X S T H V M Z I V T S W M X M S R
Y W I V G L Y V R L Q
W X Y H ] E H Z I V W I
W X Y H ] S Y X G S Q I W M X I W Y G G I W W
% 9 6 3 ' 2 S V Q E P M ^ I H 1 % )
Figure 2: RDL vs. Data Scientist. RDL matches or outperforms the data scientist in 11 of 15 tasks.
Left: AUROC for classification, right: MAE (normalized) for regression.
6 ( 0 ( E X E 7 G M I R X M W X
Y W I V I R K E K I Q I R X Y W I V Z S X I W
Y W I V F E H K I Y W I V P X Z
Y W I V G L Y V R E Q E ^ S R
M X I Q P X Z
M X I Q G L Y V R
M X I Q W E P I W
H V M Z I V H R J
H V M Z I V X S T H V M Z I V T S W M X M S R
Y W I V G L Y V R L Q
W X Y H ] E H Z I V W I
W X Y H ] S Y X G S Q I W M X I W Y G G I W W
, S Y V W , Y Q E R 0 E F S V , S Y V W , Y Q E R 0 E F S V
Figure 3: RDL vs. Data Scientist. RDL reduces the human work required to solve a task by 96% on
average. Left: classification, right: regression.
hours worked by 96% and lines of code by 94%. On average, the data scientist spent 12.3 hours per
task, while RDL took just 30 minutes. This demonstrates the potential of RDL to transform predictive
tasks on relational databases, replacing manual feature engineering with end-to-end learnable models,
a key insight from the last 15 years of AI research. RDL outperforms the data scientist in classification
tasks but struggles more on regression. Improvements in output heads for regression could enhance
RDL’s performance. RDL reduces hours worked by 96% and lines of code by 94%. Much of RDL’s
code is reusable, while the data scientist must write task-specific code for each problem, highlighting
RDL’s efficiency advantage.
6 Conclusion
This work introduces R EL B ENCH, a benchmark to facilitate research on relational deep learning (Fey
et al., 2024). R EL B ENCH provides diverse and realistic relational databases and define practical
predictive tasks that cover both entity-level prediction and entity link prediction. In addition, we
provide the first open-source implementation of relational deep learning and validated its effectiveness
over the common practice of manual feature engineering by an experienced data scientist. We hope
R EL B ENCH will catalyze further research on relational deep learning to achieve highly-accurate
prediction over complex multi-tabular datasets without manual feature engineering.
We thank Shirley Wu, Kaidi Cao, Rok Sosic, Yu He, Qian Huang, Bruno Ribeiro and Michi Yasunaga
for discussions and for providing feedback on our manuscript. We also gratefully acknowledge the
support of NSF under Nos. OAC-1835598 (CINES), CCF-1918940 (Expeditions), DMS-2327709
(IHBEM); Stanford Data Applications Initiative, Wu Tsai Neurosciences Institute, Stanford Institute
for Human-Centered AI, Chan Zuckerberg Initiative, Amazon, Genentech, GSK, Hitachi, SAP, and
UCB. The content is solely the responsibility of the authors and does not necessarily represent the
official views of the funding entities.
8
6 ( 0 ( E X E 7 G M I R X M W X
Y W I V I R K E K I Q I R X Y W I V Z S X I W
Y W I V F E H K I Y W I V P X Z
Y W I V G L Y V R E Q E ^ S R
M X I Q P X Z
M X I Q G L Y V R
M X I Q W E P I W
H V M Z I V H R J
H V M Z I V X S T H V M Z I V T S W M X M S R
Y W I V G L Y V R L Q
W X Y H ] E H Z I V W I
W X Y H ] S Y X G S Q I W M X I W Y G G I W W
0 M R I W S J '