Deep Learning Matching Market
Deep Learning Matching Market
Sai Srivatsa Ravindranatha , Zhe Fenga , Shira Lic , Jonathan Mab , Scott D. Kominersc , and
David C. Parkesa
a
John A. Paulson School of Engineering and Applied Sciences, Harvard University
saisr,zhe_feng,parkes@g.harvard.edu
arXiv:2107.03427v2 [cs.GT] 15 Nov 2023
b
Harvard College
jonathan.q.ma@gmail.com
c
Harvard Business School
shirali@g.harvard.edu, kominers@fas.harvard.edu
Abstract
We initiate the study of deep learning for the automated design of two-sided matching
mechanisms. What is of most interest is to use machine learning to understand the possibility
of new tradeoffs between strategy-proofness and stability. These properties cannot be achieved
simultaneously, but the efficient frontier is not understood. We introduce novel differentiable
surrogates for quantifying ordinal strategy-proofness and stability and use them to train differen-
tiable matching mechanisms that map discrete preferences to valid randomized matchings. We
demonstrate that the efficient frontier characterized by these learned mechanisms is substantially
better than that achievable through a convex combination of baselines of deferred acceptance
(stable and strategy-proof for only one side of the market), top trading cycles (strategy-proof
for one side, but not stable), and randomized serial dictatorship (strategy-proof for both sides,
but not stable). This gives a new target for economic theory and opens up new possibilities for
machine learning pipelines in matching market design.
1 Introduction
Two-sided matching markets, classically used for settings such as high-school matching, medical
residents matching, and law clerk matching, and more recently used in online platforms such as
Uber, Lyft, Airbnb, and dating apps, play a significant role in today’s world. As a result, there is a
significant interest in designing better mechanisms for two-sided matching.
The seminal work of Gale and Shapley [1962] introduces a simple mechanism for stable, one-to-one
matching in two-sided markets—deferred-acceptance (DA)—which has been applied in many settings,
including doctor-hospital matching [Roth and Peranson, 1999], school choice [Abdulkadiroğlu and
Sönmez, 2003, Pathak and Sönmez, 2008, Abdulkadiroğlu et al., 2009], and cadet-matching [Sönmez
and Switzer, 2013, Sönmez, 2013]. The DA mechanism is stable, i.e., no pair of participants prefer
each other to their match (or to being unmatched, if they are unmatched in the outcome). However,
the DA mechanism is not strategy-proof (SP), and a participant can sometimes misreport their
preferences to obtain a better outcome (although it is SP for participants on one side of the market).
Although widely used, this failure of SP for the DA mechanism presents a challenge for two main
reasons. First, it can lead to unfairness, where better-informed participants can gain an advantage
1
in knowing which misreport strategies can be helpful. Second, strategic behavior can lead to lower
quality, unintended outcomes, and outcomes that are unstable with respect to true preferences.
In general, it is well-known that there must necessarily be a tradeoff between stability and
strategy-proofness: it is provably impossible for a mechanism to achieve both stability and strategy-
proofness [Dubins and Freedman, 1981, Roth, 1982]. A second example of a matching mechanism is
random serial dictatorship (RSD) [Abdulkadiroglu and Sönmez, 1998], which is typically adopted
for one-sided assignment problems rather than two-sided matching. When adapted to two-sided
matching, RSD is SP but not stable. In fact, a participant may even prefer to remain unmatched
than participate in the outcome of the matching. A third example of a matching mechanism is the
top trading cycles (TTC) mechanism [Shapley and Scarf, 1974], also typically adopted for one-sided
assignment problems rather than problems of two-sided matching. In application to two-sided
matching, TTC is neither SP nor stable (although it is SP for participants on one side of the market).
There have been various research efforts to circumvent this impossibility result. Some relax the
definition of strategyproofness [Mennle and Seuken, 2021] while others characterize the constraints
under which stability and strategyproofness are achieved simultaneously [Kamada and Kojima,
2018, Hatfield et al., 2021, Hatfield and Milgrom, 2005]. The tradeoff between these desiderata
remains poorly understood beyond the existing point solutions of DA, RSD, and TTC. However, we
argue that real-world scenarios demand a more nuanced approach that considers both properties.
The case of the Boston school choice mechanism highlights the negative consequences of lacking
strategy-proofness, resulting in unfair manipulations by specific parents [Abdulkadiroglu et al., 2006].
At the same time, the importance of stability in matching markets is well understood [Roth, 1991].
Recognizing this, and inspired by the success of deep learning in the study of revenue-optimal
auction design [Duetting et al., 2019, Curry et al., 2020, Shen et al., 2019, Rahme et al., 2020],
we initiate the study of deep learning for the design of two-sided matching mechanisms We ask
whether deep learning frameworks can enable a systematic study of this tradeoff. By answering this
question affirmatively, we open up the possibility of using machine learning pipelines to open up
new opportunities for economic theory—seeking theoretical characterizations of mechanisms that
can strike a new balance between strategyproofness and stability.
We use a neural network to represent the rules of a matching mechanism, mapping preference
reports to a distribution over feasible matchings, and show how we can use an unsupervised learning
pipeline to characterize the efficient frontier for the design tradeoff between stability and SP. The
main methodological challenge in applying neural networks to two-sided matching comes from
handling the ordinal preference inputs (the corresponding inputs are cardinal in auction design) and
identifying suitable, differentiable surrogates for approximate strategy-proofness and approximate
stability.
We work with randomized matching mechanisms, for which the strongest SP concept is ordinal
strategy-proofness. This aligns incentives with truthful reporting, whatever an agent’s utility function
(i.e., for any cardinal preferences consistent with an agent’s ordinal preferences). Ordinal SP is
equivalent to the property of first-order stochastic dominance (FOSD) [Erdil, 2014], which suitably
defines the property that an agent has a better chance of getting their top, top-two, top-three, and
so forth choices when they report truthfully. As a surrogate for SP, we quantify during training the
degree to which FOSD is violated. For this, we adopt an adversarial learning approach, augmenting
the training data with defeating misreports that reveal the violation of FOSD. We also define a
suitable surrogate to quantify the degree to which stability is violated. This surrogate aligns with
the notion of ex ante stability—the strongest stability concept for randomized matching.
We train neural network representations of matching mechanisms by adopting stochastic gradient
descent (SGD) for loss functions that are defined on different convex combinations of the two
surrogate quantities, and construct the efficient frontier for stability and strategy-proofness for
2
different market settings. A further challenge with ordinal preference inputs as opposed to cardinal
inputs arises because ordinal preferences are discrete. In the present work, we simply enumerate the
possible misreports of an agent as a step when evaluating the derivative of error for a particular
training example. In contrast, the use of deep-learning approaches for the design of revenue-optimal
auctions works with a continuous space of agent valuations and gradient-ascent in this adversarial
step of identifying useful misreports [Duetting et al., 2019]. As discussed at the end of the paper,
the challenge that this presents in scaling to large numbers of agents can be resolved by assuming a
suitable structure on preference orderings in the domain, or on the language that is made available
to agents in reporting their preferences to the mechanisms.
Our main experimental results demonstrate that this novel use of deep learning can strike a
much better trade-off between stability and SP than that achieved by a convex combination of
the DA, TTC, and RSD mechanisms. Taken as a whole, these results suggest that deep learning
pipelines can be used to identify new opportunities for matching theory. For example, we identify
mechanisms that are provably almost as stable as DA and yet considerably more strategy-proof. We
also identify mechanisms that are provably almost as strategy-proof as RSD and yet considerably
more stable. These discoveries raise opportunities for future work in economic theory, in regard to
understanding the structure of these two-sided matching mechanisms as well as characterizing the
preference distributions for which this is possible.
2 Related work.
Dubins and Freedman [1981] and Roth [1982] show the impossibility of achieving both stability and
SP in two-sided matching. Alcalde and Barberà [1994] also show the impossibility of individually
rational, Pareto efficient, and SP allocation rules, and this work has been extended to randomized
matching [Alva and Manjunath, 2020]. RSD is SP but may not be stable or even individually
rational (IR) [Abdulkadiroglu and Sönmez, 1998]. We will see that the top trading cycles (TTC)
mechanism [Shapley and Scarf, 1974], when applied in a two-sided context, is only SP for one side,
and is neither stable nor IR. The DA mechanism [Gale and Shapley, 1962] is stable but not SP; see
also [Roth et al., 1993], who study the polytope of stable matchings. The stable improvement cycles
mechanism [Erdil and Ergin, 2008] achieves as much efficiency as possible on top of stability but
fails to be SP even for one side of the market. Finally, a series of results show that DA becomes SP
for both sides of the market in large-market limit contexts [Immorlica and Mahdian, 2015, Kojima
and Pathak, 2009, Lee, 2016].
Aziz and Klaus [2019] discuss different stability and no envy concepts. We focus on ex ante
stability [Kesten and Ünver, 2015], also discussed by [Roth et al., 1993] as strong stability. Mennle and
Seuken [2021] discuss different notions of approximate strategy-proofness in the context of matching
and allocation problems. In this work, we focus on ordinal SP and its analog of FOSD [Erdil, 2014].
This is a strong and widely used SP concept in the presence of ordinal preferences. There are a lot
of other desiderata, such as efficiency, that are also incompatible with strategyproofness. Mennle
and Seuken [2017] study this trade-off through hybrid mechanisms which are convex combinations of
a mechanism with good incentive properties with another which is efficient. In the context of social
choice, other work studies the trade-off between approximate SP and desiderata, such as plurality
and veto voting [Mennle and Seuken, 2016].
Conitzer and Sandholm [2002, 2004] introduced the automated mechanism design (AMD) ap-
proach that framed problems as a linear program. However, this approach faces severe scalability
issues as the formulation scales exponentially in the number of agents and items [Guo and Conitzer,
2010]. Overcoming this limitation, more recent work seeks to use deep neural networks to address
3
problems of economic design [Duetting et al., 2019, Feng et al., 2018, Golowich et al., 2018, Curry
et al., 2020, Shen et al., 2019, Rahme et al., 2020, Duan et al., 2022, Ivanov et al., 2022], but not
until now to matching problems. As discussed in the introduction, two-sided matching brings about
new challenges, most notably in regard to working with discrete, ordinal preferences and adopting
the right surrogate loss functions for approximate SP and approximate stability. Other work has
made use of support vector machines to search for stable mechanisms, but without considering
strategy-proofness [Narasimhan et al., 2016]. A different line of research is also considering stable
matching together with bandits problems, where agent preferences are unknown a priori [Das
and Kamenica, 2005, Liu et al., 2020, Dai and Jordan, 2021, Liu et al., 2022, Basu et al., 2021,
Sankararaman et al., 2021, Jagadeesan et al., 2021, Cen and Shah, 2022, Min et al., 2022].
There have also been other recent efforts that leverage deep learning for matching (in the context
of online bipartite matching [Alomrani et al., 2022]) and other related combinatorial optimization
problems [Bengio et al., 2021]. Most of these papers adopt a reinforcement learning based approach
to compute their solutions. Our approach, on the other hand, is not sequential but rather end-to-end
differentiable, and our parameter weights are updated through a single backward pass. Additionally,
the focus of our work is on matching markets and mechanism design, and is concerned with capturing
core economic concepts within a machine learning framework and balancing the trade-offs between
stability and strategy-proofness.
3 Preliminaries
Let W denote a set of n workers and F denote a set of m firms. A feasible matching, µ, is a
set of (worker, firm) pairs, with each worker and firm participating in at most one match. Let B
denote the set of all matchings. If (w, f ) ∈ µ, then µ matches w to f , and we write µ(w) = f
and µ(f ) = w. If a worker or firm remains unmatched, we say it is matched to ⊥. We also
write (w, ⊥) ∈ µ (resp. (⊥, f ) ∈ µ). Each worker has a strict preference order, ≻w , over the set
F = F ∪ {⊥}. Each firm has a strict preference order, ≻f , over the set W = W ∪ {⊥}. Worker
w (firm f ) prefers remaining unmatched to being matched with a firm (worker) ranked below ⊥
(the agents ranked below ⊥ are said to be unacceptable). If worker w prefers firm f to f ′ , then we
write f ≻w f ′ , similarly for a firm’s preferences. Let P denote the set of all preference profiles, with
≻= (≻1 , . . . , ≻n , ≻n+1 , ≻n+m ) ∈ P denoting a preference profile comprising of the preference order
of the n workers and then the m firms.
A pair (w, f ) forms a blocking pair for matching µ if w and f prefer each other to their partners
in µ (or ⊥ in the case that one or both are unmatched). A matching µ is stable if and only if there
are no blocking pairs. A matching µ is individually rational (IR) if and only if it is not blocked by
any individual; i.e., no agent finds its match unacceptable and prefers ⊥.1
For notational simplicity, we write gwf (≻) for the marginal probability of matching worker w (or ⊥)
and firm f (or ⊥).
1
Stability precludes empty matchings. For example, if a matching µ leaves a worker w and a firm f unmatched,
where w finds f acceptable, and f finds w acceptable, then (w, f ) is a blocking pair to µ.
4
Theorem 1 (Birkhoff von-Neumann). Given any randomized matching r, there exists a distribution
on matchings, ∆(B), with marginal probabilities equal to r.
The following definition is standard [Budish et al., 2013], and generalizes stability to randomized
matchings.
Definition 2 (Ex ante justified envy). A randomized matching r causes ex ante justified envy if:
(1) some worker w prefers f over some fractionally matched firm f ′ (including f ′ = ⊥) and firm
f prefers w over some fractionally matched worker w′ (including w′ = ⊥) (“w has envy towards w′ "
and “f has envy towards f ′ "), or
(2) some worker w finds a fractionally matched f ′ ∈ F unacceptable, i.e. rwf ′ > 0 and ⊥ ≻w f ′ ,
or some firm f finds a fractionally matched w′ ∈ W unacceptable, i.e. rw′ f > 0 and ⊥ ≻f w′ .
A randomized matching r is ex ante stable if and only if it does not cause any ex ante justified
envy. Ex ante stability reduces to the standard concept of stability for deterministic matching.
Part (1) of the definition includes non-wastefulness: for any worker w, we should have rw⊥ = 0 if
there exists some firm f ′ ∈ F for which rw′ f ′ > 0 and w ≻f ′ w′ and for for any firm f , we need
r⊥f = 0 if there exists some worker w′ ∈ W for which rw′ f ′ > 0 and f ≻w′ f ′ . Part (2) of the
definition captures IR: for any worker w, we should have rwf ′ = 0 for all f ′ ∈ F for which ⊥ ≻w f ′ ,
and for any firm f , we need rw′ f = 0 for all w′ ∈ W for which ⊥ ≻f w′ .
To define strategy-proofness, say that uw : F → R is a ≻w -utility for worker w when uw (f ) >
uw (f ′ ) if and only if f ≻w f ′ , for all f, f ′ ∈ F . We similarly define a ≻f -utility for a firm f .
The following concept of ordinal SP is standard [Erdil, 2014], and generalizes SP to randomized
matchings.
By this definition, no worker or firm can improve their expected utility (for any utility function
consistent with their preference order) by misreporting their preference order. For a deterministic
mechanism, ordinal SP reduces to standard SP. Erdil [2014] shows that first-order stochastic
dominance is equivalent to ordinal SP.
FOSD states that, whether looking at its most preferred firm, its two most preferred firms, or
so forth, worker w achieves a higher probability of matching on that set of firms for its true report
than for any misreport. We make use of a quantification of the violation of this condition to provide
a surrogate for the failure of SP during learning.
Theorem 5 ([Erdil, 2014]). A two-sided matching mechanism is ordinal SP if and only if it satisfies
FOSD.
5
3.2 Deferred Acceptance, RSD, and TTC.
We consider three benchmark mechanisms: the stable but not SP deferred-acceptance (DA) mecha-
nism, the SP but not stable randomized serial dictatorship (RSD) mechanism, and the Top Trading
Cycles (TTC) mechanism, which is neither SP nor stable. The DA and TTC mechanisms are ordinal
SP for the proposing side of the market but not for agents on both sides of the market.
• ∀f ∈ F : f tentatively accepts its best proposal (if any), and rejects the rest.
Theorem 7 (see [Roth and Sotomayor, 1990]). DA is stable but not Ordinal SP.
Definition 8 (Randomized serial dictatorship (RSD)). In the two-sided version of RSD, we first
sample a priority order, π, on the set W ∪ F , uniformly at random, such that π = (π1 , π2 , . . . , πm+n )
is a permutation on W ∪ F in decreasing order of priority. For the one-sided version, we sample a
priority order π on either W or F .
Proceed as follows:
• In round k = 1, . . . , |π|:
– If πk is not yet matched in µ, then add to µ the match between πk and its most preferred
unmatched agent, or ⊥ if all remaining agents are unacceptable to πk .
Theorem 9. RSD satisfies FOSD—and thus is ordinal SP by Theorem 5—but is not stable.
• Form a directed graph with each unmatched agent pointing to their most preferred option.
The agents can point at themselves if there are no acceptable options available. Every worker
that is a part of a cycle is matched to a firm it points to ( or itself, if the worker is pointing at
itself). The unmatched agents remove from their lists every matched agent from this round.
Theorem 11. TTC is neither strategy-proof nor stable for both sides.
Proof. Example 17 in Appendix B shows that TTC is neither strategyproof nor stable.
Remark 12. TTC, like RSD, is usually used in one-sided assignment problems, where it is SP, and
where the notion of stability which is an important consideration in two-sided matching, is not a
concern.
6
column − wise
normalization
β
β ...
× s̄11 s̄1m
p11 s11 σ+ r11 = min{ŝ11 , ŝ′11 }
(1) (R)
.. h1 h1 .. .. ..
β . . . β s̄n1 ... s̄nm .
ŝ
pnm sn+1m σ+ rn1 = min{ŝn1 , ŝ′n1 }
(1) (R) ×
h2 h2 ..
... s̄⊥1 . s̄⊥m ..
β ŝ′ .
.. .. β
q11 . . s′11 σ+ r1m = min{ŝ1m , ŝ′1m }
× s̄′11 ... s̄′1m s̄′1⊥
.. .. .. ..
β . (1) (R) . . β .. .
hJ1 hJR .
qnm s′nm+1 σ+ rnm = min{ŝnm , ŝ′nm }
× s̄′n1 ... s̄′nm s̄′n⊥
row − wise
normalization
Figure 1: Matching network g for a set of n workers and m firms. Given inputs p, q ∈ Rn×m the matching
network is a feed-forward network with R hidden layers that uses softplus activation to generate non-negative
scores and normalization to compute the randomized matching. We additionally generate a Boolean mask
matrix, β, and multiply it with the score matrix before normalization to ensure IR by making the probability
of matches that are unacceptable zero.
order of a worker and firm, respectively. We define p≻ w⊥ = 0 and q⊥f = 0, and to further illustrate
≻
this representation:
7
Formally, we have:
m
1 X
p≻ (3)
wj = 1j≻w ⊥ + 1j≻w j ′ − 1⊥≻w j ′
m
j ′ =1
n
!
≻ 1 X
(4)
qif = 1i≻f ⊥ + 1i≻f i′ − 1⊥≻f i
n
i′ =1
We have rwf = 0 whenever βwf = 0, ensuring that every matching in the support of the distribution
will be IR. Based on our construction, the allocation matrix r is weakly doubly stochastic, with rows
and columns summing to at most 1. Budish et al. [2013] show that any weakly doubly stochastic
matrix can be decomposed to a convex combination of 0-1, weakly doubly stochastic matrices.
This captures the first kind of ex ante justified envy in Definition 2. We can omit the second
kind of ex ante justified envy because the learned mechanisms satisfy IR through the use of the
Boolean mask matrix (and thus, there are no violations of the second kind).
8
The average P
stability violation (or just stability violation) of mechanism g θ on profile ≻ is stv (g θ , ≻
n m
) = 12 m f =1 stv wf (g , ≻). We define the expected stability violation, STV (g ) =
1
+ n1 θ θ
P
w=1
E≻ stv (g , ≻). We also write stv (g ) to denote the average stability violation on the training data.
θ θ
Ordinal SP violation. We turn now to quantifying the degree of approximation to ordinal SP.
Let ≻−i = (≻1 , . . . , ≻i−1 , ≻i+1 , . . . , ≻n+m ). For a valuation profile, ≻ ∈ P , and a mechanism g θ , let
∆wf (g θ , ≻′w , ≻) = gwf −w ) − gwf (≻w , ≻−w ). The regret to worker w (firm f ) is defined as:
θ (≻′ , ≻ θ
w
X
regretw (g θ , ≻) = max
′
max
′
∆wf (g θ , ≻′w , ≻) .
≻w ∈P f ≻w ⊥
f ≻w f ′
(6)
X
regretf (g θ , ≻) = max
′
max
′
∆wf (g θ , ≻′f , ≻) .
≻f ∈P w ≻f ⊥
w≻f w′
The expected regret is RGT (g θ ) = E≻ regret(g θ , ≻). We can also write rgt(g θ ) to denote the average
regret on training data.
Theorem 14. The regret to a worker (firm) for a given preference profile is the maximum amount
by which the worker (firm) can increase their expected normalized utility through a misreport, fixing
the reports of others.
increase in utility for worker w when the utility function is uw is given by f ∈F uw (f )∆wf (g θ , ≻′w , ≻).
P
The maximum amount by which worker w can increase P their expected normalized utility through
misreport ≻w is given by the objective: maxuw ∈Uw f ∈F uw (f )∆wf (g , ≻′w , ≻).
′ θ
. Thus, we can simplify our search space by only considering uw ∈ Uw where uw (fk+1 ), . . . , uw (fm ) =
0.
9
Define δk = uw (fk ), δk−1 = uw (fk−1 ) − uw (fk ), δ1 = uw (f1 ) − uw (f2 ). This objective can thus be
rewritten as:
Xk X k
max δi ∆wf (g θ , ≻′w , ≻) (9)
f =1 i=f
k
X
such that δi ≤ 1 and δ1 , . . . δk ≥ 0 (10)
i=1
This objective is of the form max∥x∥1 ≤1 xT y and it’s solution is given by the ||y||∞ . Thus, the
solution to the above maximization problem is given by maxi∈[k] if =1 ∆wf (g θ , ≻′w , ≻). But this
P
is the same as maxf ′ :f ′ ≻w ⊥ f :f ≻w f ′ ∆wf (g θ , ≻′w , ≻). Computing the maximum possible increase
P
over all such misreports gives us max≻w′ ∈P maxf ′ :f ′ ≻w ⊥ f :f ≻w f ′ ∆wf (g θ , ≻′w , ≻) . This quantity
P
is exactly regret w (g θ , ≻). The proof follows similarly for any firm f .
where λ ∈ [0, 1] controls the tradeoff between approximate stability and approximate SP. We use
SGD to solve (13). The gradient of the degree of violation of stability with respect to network
parameters is straightforward to calculate. The gradient of regret is complicated by the nested
maximization in the definition of regret. In order to compute the gradient, we first solve the inner
maximization by checking possible misreports. Let ≻ ˆ (ℓ)
i denote the defeating preference report for
agent i (a worker or firm) at preference profile ≻ that maximizes regreti (g θ , ≻(ℓ) ). Given this,
(ℓ)
we obtain the derivative of regret for agent i with respect to the network parameters, fixing the
misreport to the defeating valuation and adopting truthful reports for the others.
For the case of TTC and RSD, we also need to quantify the IR violation (this is necessarily zero
for the other mechanisms). We define the IR violation at profile ≻ as:
n m
1 XX
≻
irv (g, ≻) = gwf (≻) · max{−qwf , 0}
2m
w=1 f =1
n m (14)
1 XX
+ gwf (≻) · max{−p≻
wf , 0}
2n
w=1 f =1
10
Uncorrelated Preference 0.06 pcorr = 0.25
0.06 DA
DA
= 0.8 TTC 0.04 TTC
SP-violation
SP-violation
0.04
= 0.9
SP-violation
0.006
= 0.3 = 0.1
0.01 RSD 0.004
= 0.2 RSD
0.002
0.00 0.000
0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100
Stability Violation Stability Violation
Figure 2: Comparing stability violation and strategy-proofness violation from the learned mechanisms for
different choices of λ (red dots) with the best of worker- and firm-proposing DA, as well as TTC, and RSD, in
4 × 4 two-sided matching, and considering uncorrelated preference orders as well as markets with increasing
correlation (pcorr ∈ {0.25, 0.5, 0.75}). The stability violation for TTC and RSD includes IR violations.
For the case of RSD, we also include the average IR violation on test data when reporting the
stability violation.
5 Experimental Results
We report the results on a test set of 204,800 preference profiles, and use the Adam optimizer to
train our models for 50,000 mini-batch iterations, with mini-batches of 1024 profiles. We use the
PyTorch deep learning library, and all experiments are run on a cluster of NVIDIA GPU cores.
We study the following market settings:
• For uncorrelated preferences, for each worker or firm, we sample uniformly at random from
all preference orders, and then, with probability 0.2 (truncation probability), we choose at
random a position at which to truncate this agent’s preference order.
Specifically, we consider matching problems with n = 4 workers and m = 4 firms with uncorrelated
preference and varying probability of correlation pcorr = {0.25, 0.5, 0.75}.
We compare the performance of our mechanisms, varying parameter λ between 0 and 1, with
the best of worker- and firm- proposing DA and TTC (as determined by average SP violation over
the test data) and RSD3 . We also compare against convex combinations of DA, TTC, and RSD.
We plot the resulting frontier on stability violation (stv (g θ )) and SP violation (rgt(g θ )) in Figure 2.
3
We only plot the performance of one-sided RSD as it achieves lower stability violation the two-sided version
11
0.650
pcorr = 0.00 0.650
pcorr = 0.25 % similarity to DA vs
1.0
0.625 0.625
0.8
% Similarity to DA
Welfare per agent
Normalized Entropy
0.575
Welfare per agent
Figure 3: Left and Middle: Comparing welfare per agent of the learned mechanisms for different values
of the tradeoff parameter λ with the best of the firm- and worker- proposing DA, as well as TTC, and RSD.
Right, Top: Comparing the average instance-wise max similarity scores (sim(g θ )) of the learned mechanisms
with worker- and firm-proposing DA. Right, Bottom: Normalized entropy of the learned mechanisms for
different values of the tradeoff parameter λ. The results are shown for uncorrelated preferences as well as an
increasing correlation between preferences (pcorr ∈ {0.25, 0.5, 0.75})
As explained above, because TTC and RSD mechanisms do not guarantee IR, we include the IR
violations in the reported stability violation (none of the other mechanisms fail IR).
At λ = 0.0, we learn a mechanism that has very low regret (≈ 0) but poor stability. This
performance is similar to that of RSD. For large values of λ, we learn a mechanism that approximates
DA. For intermediate values, we find solutions that dominate the convex combination of DA, TTC,
and RSD and find novel and interesting tradeoffs between SP and stability. Notably, for lower
levels of correlations we see substantially better SP than DA along with very little loss in stability.
Given the importance of stability in practice, this is a very intriguing discovery. For higher levels of
correlations, we see substantially better stability than RSD along with very little loss in SP. It is
also interesting to see that TTC itself has intermediate properties, between those of DA and RSD.
Comparing the scale of the y-axes, we can also see that increasing correlation tends to reduce the
opportunity for strategic behavior across both the DA and the learned mechanisms.
Figure 3 shows the expected welfare for the learned mechanisms, measured here for the equi-
spaced utility function (the function used in the input representation). We define the welfare of a
mechanism g (for the equi-spaced utility function) on a profile ≻ as:
1 1 1 X X
welfare(g, ≻) = + gwf (≻) p≻ wf + q ≻
wf . (15)
2 n m
w∈W f ∈F
We compare against the maximum of the expected welfare achieved by the worker- and firm-
proposing DA and TTC mechanisms, as well as that from RSD. As we increase λ, and the learned
mechanisms come closer to DA, the welfare of the learned mechanisms improves. It is notable that for
choices of λ in the range 0.8 and higher, i.e., the choices of λ that provide interesting opportunities
for improving SP relative to DA, we also see good welfare. We also see that TTC, and especially
RSD have comparably lower welfare. It bears emphasis that when a mechanism is not fully SP,
as is the case for all mechanisms except RSD, this is an idealized view of welfare since it assumes
12
truthful reports. In fact, we should expect welfare to be reduced through strategic misreports and
the substantially improved SP properties of the learned mechanisms relative to DA (Figure 2) would
be expected to further work in favor of improving the welfare in the learned mechanisms relative to
DA.4 Lastly, we observe that for small values of λ the learned mechanisms have relatively low welfare
compared to RSD. This is interesting and suggests that achieving IR together with SP (recall that
RSD is not IR!) is very challenging in two-sided markets. In interpreting the rules of the learned
mechanisms, and considering the importance of DA, we can also compare their functional similarity
with DA. For this, let w-DA and f -DA denote the worker- and firm-proposing DA, respectively. For
a given preference profile, we compute the similarity of the learned rule with DA as
θ
P
M (≻)=1 gwf (≻)
(w,f ):gwf
θ
sim(g , ≻) = max P . (16)
M∈{w-DA,f -DA} (w,f ):g M (≻)=1 1
wf
This calculates the agreement between the two mechanisms, normalized by the size of the DA
matching, and taking the best of w-DA or f -DA. Let sim(g θ ) denote the average similarity score
on test data. As we increase λ, i.e., penalize stability violations more, we see in Figure 3 that the
learned matchings get increasingly close to the DA matchings, as we might expect. We also quantify
the degree of randomness of the learned mechanisms, by computing the normalized entropy per agent,
taking the expectation over all preference profiles. For a given profile ≻, we compute normalized
entropy per agent as (this is 0 for a deterministic mechanism):
Figure 3 shows how the entropy changes with λ. As we increase λ and the mechanisms come
closer to DA, the allocations of the learned mechanisms also becomes less stochastic.
6 Discussion
The methodology and results in this paper give a first but crucial step towards using machine
learning to understanding the structure of mechanisms that achieve nearly the same stability as DA
while surpassing DA in terms of strategy-proofness. This is an interesting observation, given the
practical and theoretical importance of the DA mechanism. Our experimental results also suggest
that achieving IR together with SP (recall that RSD is not IR) is challenging in two-sided markets,
and this is reflected in the lower welfare achieved by our learning mechanisms in the part of the
frontier that emphasizes SP. The experimental results also show that for larger weights assigned
to the importance of stability, the learned mechanisms recover the DA mechanism, whereas when
more emphasis is given to the importance of strategy-proofness, the learned mechanisms are more
randomized. There are other interesting questions waiting to be addressed. For instance, can we use
this kind of framework to understand other tradeoffs, such as tradeoffs between strategy-proofness
and efficiency?
As discussed in the introduction, a challenge in scaling to larger problems is the need to find
defeating misreports, as exhaustively enumerating all misreports for an agent becomes intractable as
4
In fact, the same is true for the stability of a non-SP mechanism such as DA, but it has become standard to
assume truthful reports to DA in considering the stability properties of DA.
13
the number of agents on the other side of the market increases (and thus the preference domain
increases). A simple remedy is to restrict the language available to agents in making preference
reports; e.g., it is commonplace to only allow for “top-k preferences" to be reported. Another remedy
is to work in domains where there exists some structure on the preference domain, so that not all
possible preference orders exist; e.g., single-peaked preferences are an especially stark example [Black,
1948]. It will also be interesting to study complementary approaches that relax the discrete set
of preference orderings to a continuous convex hull such as the Birkhoff polytope. Confining these
preference orderings to the Birkhoff polytope can be accomplished using differentiable operations,
such as the sinkhorn operator [Adams and Zemel, 2011]. Such continuous relaxation allows misreports
to be identified through gradient ascent rather than sampling (and would be similar to working with
continuous valuations when learning revenue-optimal auctions [Duetting et al., 2019]). Despite this
limitation, our current approach scales much further than other, existing methods for automated
design, which are not well suited for this problem. For instance, methods that use linear programs
or integer programs do not scale well because of the number of variables5 required to make explicit
the input and output structure of the functional that must be optimized over.
A second challenge is that we have not been able to find a suitable, publicly available dataset to
test our approach. As a fallback, we have endeavored to capture some real-world structures by varying
the correlation between agent preferences and the truncation probabilities of preferences. Using such
stylized, probabilistic models and simulations for validating approaches is a well-established and
prevalent practice, consistently utilized when investigating two-sided matching markets [Chen and
Sönmez, 2006, Echenique and Yariv, 2013, Das and Kamenica, 2005, Liu et al., 2020, Dai and Jordan,
2021]. For instance, Chen and Sönmez [Chen and Sönmez, 2006] design an environment for school
choice where they consider six different schools with six seats each and where the students’ preferences
are simulated to depend on proximity, quality, and a random factor. Echenique an Yariv [Echenique
and Yariv, 2013] use a simulation study with eight participants on each side of the market, with the
payoff matrix designed such that there are one, two, or three stable matches. Further, recent papers
on bandit models for stable matching model agent preferences through synthetic datasets [Das and
Kamenica, 2005, Liu et al., 2020, Dai and Jordan, 2021].
In closing, we see exciting work ahead in advancing the design of matching mechanisms that
strike the right balance between stability, strategyproofness, and other considerations that are critical
to real-world applications. As an example, it will be interesting to extend the learning framework
to encompass desiderata such as capacity limitations or fairness considerations. Also relevant is to
explore additional kinds of neural network architectures, such as attention mechanisms [Vaswani
et al., 2017] or architectures that incorporate permutation equivariance [Rahme et al., 2020, Duan
et al., 2022], with the advantage that they can significantly reduce the search space.
Acknowledgements
The source code for all experiments along with the instructions to run it is available from Github at
https://github.com/saisrivatsan/deep-matching/. This work is supported in part through an
AWS Machine Learning Research Award.
References
A. Abdulkadiroglu and T. Sönmez. Random serial dictatorship and the core from random endowments
in house allocation problems. Econometrica, 66(3):689–702, 1998.
5
(m + 1)!n · (n + 1)!m · nm output variables
14
A. Abdulkadiroglu, P. Pathak, A. E. Roth, and T. Sonmez. Changing the Boston School Choice
Mechanism. NBER Working Papers 11965, National Bureau of Economic Research, Inc, Jan. 2006.
URL https://ideas.repec.org/p/nbr/nberwo/11965.html.
R. P. Adams and R. S. Zemel. Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925,
2011.
J. Alcalde and S. Barberà. Top dominance and the possibility of strategy-proof stable solutions to
matching problems. Economic Theory, 4(3):417–435, 1994.
M. A. Alomrani, R. Moravej, and E. B. Khalil. Deep policies for online bipartite matching: A
reinforcement learning approach. Transactions on Machine Learning Research, 2022. ISSN
2835-8856. URL https://openreview.net/forum?id=mbwm7NdkpO.
S. Alva and V. Manjunath. The impossibility of strategy-proof, Pareto efficient, and individually
rational rules for fractional matching. Games and Economic Behavior, 119:15–29, 2020.
H. Aziz and B. Klaus. Random matching under priorities: stability and no envy concepts. Social
Choice and Welfare, 53(2):213–259, 2019. ISSN 01761714, 1432217X. URL http://www.jstor.
org/stable/45212389.
S. Basu, K. A. Sankararaman, and A. Sankararaman. Beyond log 2 (t) regret for decentralized bandits
in matching markets. In M. Meila and T. Zhang, editors, Proceedings of the 38th International
Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages
705–715. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/basu21a.html.
Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combinatorial optimization: A method-
ological tour d’horizon. European Journal of Operational Research, 290(2):405–421, 2021. ISSN
0377-2217. doi: https://doi.org/10.1016/j.ejor.2020.07.063. URL https://www.sciencedirect.
com/science/article/pii/S0377221720306895.
E. Budish, Y.-K. Che, F. Kojima, and P. Milgrom. Designing random allocation mechanisms: Theory
and applications. American Economic Review, 103(2):585–623, April 2013. doi: 10.1257/aer.103.2.
585. URL https://www.aeaweb.org/articles?id=10.1257/aer.103.2.585.
S. H. Cen and D. Shah. Regret, stability and fairness in matching markets with bandit learners.
In G. Camps-Valls, F. J. R. Ruiz, and I. Valera, editors, Proceedings of The 25th International
Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning
Research, pages 8938–8968. PMLR, 28–30 Mar 2022. URL https://proceedings.mlr.press/
v151/cen22a.html.
15
Y. Chen and T. Sönmez. School choice: an experimental study. Journal of Economic Theory,
127(1):202–231, 2006. ISSN 0022-0531. doi: https://doi.org/10.1016/j.jet.2004.10.006. URL
https://www.sciencedirect.com/science/article/pii/S0022053104002418.
V. Conitzer and T. Sandholm. Complexity of mechanism design. In Proceedings of the 18th Conference
on Uncertainty in Artificial Intelligence, pages 103–110, 2002.
X. Dai and M. Jordan. Learning in multi-stage decentralized matching markets. Advances in Neural
Information Processing Systems, 34:12798–12809, 2021.
S. Das and E. Kamenica. Two-sided bandits and the dating market. In IJCAI, volume 5, page 19.
Citeseer, 2005.
Z. Duan, J. Tang, Y. Yin, Z. Feng, X. Yan, M. Zaheer, and X. Deng. A context-integrated transformer-
based neural network for auction design. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari,
G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine
Learning, volume 162 of Proceedings of Machine Learning Research, pages 5609–5626. PMLR,
17–23 Jul 2022. URL https://proceedings.mlr.press/v162/duan22a.html.
A. Erdil. Strategy-proof stochastic assignment. Journal of Economic Theory, 151:146 – 162, 2014.
ISSN 0022-0531.
A. Erdil and H. Ergin. What’s the matter with tie-breaking? Improving efficiency in school choice.
American Economic Review, 98(3):669–89, 2008.
Z. Feng, H. Narasimhan, and D. C. Parkes. Deep learning for revenue-optimal auctions with budgets.
In Proceedings of the 17th Conference on Autonomous Agents and Multi-Agent Systems, pages
354–362, 2018.
D. Gale and L. S. Shapley. College admissions and the stability of marriage. The American
Mathematical Monthly, 69(1):9–15, 1962.
N. Golowich, H. Narasimhan, and D. C. Parkes. Deep learning for multi-facility location mechanism
design. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages
261–267, 2018.
16
M. Guo and V. Conitzer. Computationally feasible automated mechanism design: General approach
and case studies. In Proceedings of the 24th AAAI Conference on Artificial Intelligence, 2010.
J. W. Hatfield and P. R. Milgrom. Matching with contracts. American Economic Review, 95(4):
913–935, September 2005. doi: 10.1257/0002828054825466. URL https://www.aeaweb.org/
articles?id=10.1257/0002828054825466.
N. Immorlica and M. Mahdian. Incentives in large random two-sided markets. ACM Transactions
on Economics and Computation, 3:#14, 2015.
Y. Kamada and F. Kojima. Stability and strategy-proofness for matching with constraints: A
necessary and sufficient condition. Theoretical Economics, 13(2):761–793, 2018.
F. Kojima and P. A. Pathak. Incentives and stability in large two-sided matching markets. American
Economic Review, 99:608–627, 2009.
S. Lee. Incentive compatibility of large centralized matching markets. The Review of Economic
Studies, 84(1):444–463, 2016.
L. T. Liu, F. Ruan, H. Mania, and M. I. Jordan. Bandit learning in decentralized matching markets.
J. Mach. Learn. Res., 22(1), jul 2022. ISSN 1532-4435.
T. Mennle and S. Seuken. The pareto frontier for random mechanisms. In Proceedings of the 2016
ACM Conference on Economics and Computation, EC ’16, page 769, New York, NY, USA, 2016.
Association for Computing Machinery.
T. Mennle and S. Seuken. Hybrid mechanisms: Trading off strategyproofness and efficiency of
random assignment mechanisms, 2017.
T. Mennle and S. Seuken. Partial strategyproofness: Relaxing strategyproofness for the random
assignment problem. Journal of Economic Theory, 191:105–144, 2021.
17
Y. Min, T. Wang, R. Xu, Z. Wang, M. Jordan, and Z. Yang. Learn to match with no regret:
Reinforcement learning in markov matching markets. In A. H. Oh, A. Agarwal, D. Belgrave,
and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https:
//openreview.net/forum?id=R3JMyR4MvoU.
H. Narasimhan, S. Agarwal, and D. C. Parkes. Automated mechanism design without money
via machine learning. In Proceedings of the 25th International Joint Conference on Artificial
Intelligence, pages 433–439, 2016.
P. A. Pathak and T. Sönmez. Leveling the playing field: Sincere and sophisticated players in the
Boston mechanism. American Economic Review, 98(4):1636–1652, 2008.
J. Rahme, S. Jelassi, J. Bruna, and S. M. Weinberg. A permutation-equivariant neural network
architecture for auction design. CoRR, abs/2003.01497, 2020. URL https://arxiv.org/abs/
2003.01497.
A. E. Roth. The economics of matching: Stability and incentives. Mathematics of Operations
Research, 7(4):617–628, 1982.
A. E. Roth. A natural experiment in the organization of entry-level labor markets: Regional markets
for new physicians and surgeons in the united kingdom. The American Economic Review, 81(3):
415–440, 1991. ISSN 00028282. URL http://www.jstor.org/stable/2006511.
A. E. Roth and E. Peranson. The redesign of the matching market for american physicians: Some
engineering aspects of economic design. American Economic Review, 89(4):748–780, September
1999.
A. E. Roth and M. Sotomayor. Two-Sided Matching: A Study in Game-Theoretic Modeling and
Analysis, volume 18 of Econometric Society Monographs. Cambridge University Press, 1990.
A. E. Roth, U. G. Rothblum, and J. H. V. Vate. Stable matchings, optimal assignments, and linear
programming. Mathematics of Operations Research, 18(4):803–828, 1993.
A. Sankararaman, S. Basu, and K. Abinav Sankararaman. Dominate or delete: Decentralized
competing bandits in serial dictatorship. In A. Banerjee and K. Fukumizu, editors, Proceedings
of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of
Proceedings of Machine Learning Research, pages 1252–1260. PMLR, 13–15 Apr 2021. URL
https://proceedings.mlr.press/v130/sankararaman21a.html.
L. Shapley and H. Scarf. On cores and indivisibility. Journal of Mathematical Economics, 1
(1):23–37, 1974. ISSN 0304-4068. doi: https://doi.org/10.1016/0304-4068(74)90033-0. URL
https://www.sciencedirect.com/science/article/pii/0304406874900330.
W. Shen, P. Tang, and S. Zuo. Automated mechanism design via neural networks. In Proceedings of
the 18th International Conference on Autonomous Agents and Multiagent Systems, 2019.
T. Sönmez. Bidding for army career specialties: Improving the ROTC branching mechanism. Journal
of Political Economy, 121:186–219, 2013.
T. Sönmez and T. B. Switzer. Matching with (branch-of-choice) contracts at United States Military
Academy. Econometrica, 81:451–488, 2013.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.
Attention is all you need. Advances in neural information processing systems, 30, 2017.
18
A RSD is ordinal SP but not Stable
We first show RSD satisfies FOSD and is thus ordinal SP. Consider agent i in some position k in the
order. The agent’s report has no effect on the choices of preceding agents, whether workers or firms
(including whether agent i is selected by an agent on the other side). Reporting its true preference
ensures, in the event that it remains unmatched by position k, that it is matched with its most
preferred agent of those remaining. For the one-sided version, the same argument holds for agents
that are in the priority order. If an agent isn’t on the side that’s on the priority order, then that
agent’s report has no effect at all.
In the following example, we show RSD mechanism is not stable.
Example 16. Consider n = 3 workers and m = 3 firms with the following preference orders:
w1 : f2 , f3 , f1 , ⊥ f1 : w1 , w2 , w3 , ⊥
w2 : f2 , f1 , f3 , ⊥ f2 : w2 , w3 , w1 , ⊥
w3 : f1 , f3 , f2 , ⊥ f3 : w3 , w1 , w2 , ⊥
The matching found by worker-proposing DA is (w1 , f3 ), (w2 , f2 ), (w3 , f1 ). This is a stable match-
ing. If f1 truncates and misreports its preference as f1 : w1 , w2 , ⊥, w3 , the matching found is
(w1 , f1 ), (w2 , f2 ), (w3 , f3 ). Firm f1 is matched with a more preferred worker, and hence the mech-
anism is not strategy-proof. Now consider the matching under RSD. The marginal matching
probabilities r is given by:
11 1 7
24 4 24
1 3 1
r= 6 4 12
3 5
8 0 8
f2 and w2 are the most preferred options for w2 and f2 respectively and they would prefer to
be matched with each other always rather than being fractionally matched with each other. Here
(w2 , f2 ) is a blocking pair and thus RSD is not stable.
w1 : f1 , ⊥ f1 : w2 , w3 , w4 , ⊥
w2 : f2 , ⊥ f2 : w1 ⊥
w3 : f1 , ⊥ f3 : w3 ⊥
w4 : f3 , ⊥ f4 : ⊥
If all agents report truthfully, w1 is matched with f1 . This violates IR as ⊥ ≻f1 w1 and thus the
matching is not ex-ante stable. If f1 misreports its preference as f1 : w4 , w3 , ⊥, then w3 is matched
with f1 . Since f1 is matched with a more preferred worker w3 with w3 ≻f1 ⊥ ≻f1 w1 , TTC is not
strategyproof.
19
w1 w3
f2 f1 f3 f4
w2 w4
Figure 4: Round 1 of TTC. The solid lines represent workers and firms pointing to their top preferred agent
truthfully. The dashed line represents a misreport by f1
C Proof of Theorem 13
Proof. Since stv (g θ , ≻) ≥ 0 then STV (g θ ) = E≻ stv (g θ , ≻) = 0 if and only if stv (g θ , ≻) = 0 except
on zero measure events. Moreover, stv (g θ , ≻) = 0 implies stv wf (g θ , ≻) = 0 for all w ∈ W , all f ∈ F .
This is equivalent to no justified envy. For firm f , this means ∀w′ ̸= w, qwf ≻
≤ qw≻
′ f if gw ′ f > 0 and
θ
q ≻wf ≤ 0 if g⊥f
θ > 0. Then there is no justified envy for a firm f . Analogously, there is no justified
D Proof of Theorem 15
Proof. Since regret(g θ , ≻) ≥ 0 then RGT (g θ ) = E≻ regret(g θ , ≻) = 0 if and only if regret(g θ , ≻) = 0
except on zero measure events. Moreover, regret(g θ , ≻) = 0 implies regret w (g θ , ≻) = 0 for any worker
w and regret f (g θ , ≻) = 0 for any firm f . Thus, the maximum utility increase on misreporting is
at most zero, and hence g θ is ordinal SP. If g θ is Ordinal-SP, it is also satisfies FOSD 5 and it is
straightforward to show that regret(g θ , ≻) = 0.
20