-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Monotone constraint support for LightGBM #2330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LightGBM. This is done through the LightGBM Options class via the MonotoneConstraints member. To handle the monotone constraints, this adds the ability to specify whether to use a positive constraint or a negative constraint along with a range. Multiple ranges can be specified. This checkin also includes tests for the parsing of the ranges to validate the expected value that will be passed to LightGBM. This fixes dotnet#1651.
|
||
var transformedDataView = pipe.Fit(dataView).Transform(dataView); | ||
var model = trainer.Train(transformedDataView, transformedDataView); | ||
Done(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check if prediction values are really monotone? For example, feeding 10 examples such as [0, 0, 1, 0], [0, 0, 2, 0], [0, 0, 3, 0], ... into the trained model and then examine their prediction values. #Resolved
var subArguments = argument.Split(':'); | ||
if (subArguments.Length != 2) | ||
// Invalid argument, skip to the next argument | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's reasonable to force a user to know their slot numbers. The exact contents of the slots are generally a blackbox to ML.NET users (exception is for a purely numeric dataset without featurization which alters the slots).
Perhaps we could add support for just a plain "pos", "neg" keyword which then applies to all slots. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - so I added this feature along with tests.
In reply to: 252441134 [](ancestors = 252441134)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singlis: Is the change visible in the PR? Or is it yet to be committed? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! #Resolved
@Ivanidzo4ka brought up in #1992 (comment):
Would the Random Forest support be a good addition to this PR? #Resolved |
"specified followed by a range. For example, pos:0-2 neg:3,5 will apply a positive constraint to the " + | ||
"first three features and a negative constraint to the 4th and 6th feature. If feature index is not specified, " + | ||
"then no constraint will be applied.")] | ||
public string[] MonotoneConstraints; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public string[] MonotoneConstraints; [](start = 8, length = 36)
A) please unblock and run test RegenerateEntryPointCatalog on your local machine and push changes in core_ep-list.tsv.
B) How it would handle null value? I don't see test covering that.
C) How it would handle gibberish values? Like Bubba, TabbyCat
? What kind of exception it throw, was that exception be enough to understand something is wrong with this parameter? #Closed
} | ||
|
||
[ConditionalFact(typeof(Environment), nameof(Environment.Is64BitProcess))] // LightGBM is 64-bit only | ||
public void LightGBMBinaryEstimatorMonotoneThroughOptions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LightGBMBinaryEstimatorMonotoneThroughOptions [](start = 20, length = 45)
Can we also have baseline tests?
we have tests like this: RegressorLightGBMTest
var regPredictors = new[] { TestLearners.LightGBMReg };
you can define new learner like
public static PredictorAndArgs LightGBMRegWithMonotone = new PredictorAndArgs { Trainer = new SubComponent("LightGBMR", "nt=1 iter=50 v=+ booster=gbdt{l1=0.2 l2=0.2} lr=0.2 mil=10 nl=20 MonotoneConstraints=pos:1-3 MonotoneConstraints=neg:4"), Tag = "LightGBMReg", BaselineProgress = true, };
and add it to regPredictors.
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for all types of trainers, but it shouldn't be hard I believe. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[TlcModule.SweepableDiscreteParam("CatL2", new object[] { 0.1, 0.5, 1, 5, 10 })] | ||
public double CatL2 = 10; | ||
|
||
[Argument(ArgumentType.Multiple, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argument [](start = 9, length = 8)
please add ShortName, if someone would ever need to use that in command line, he would curse you for typing this words again and again. #Closed
@singlis thank you for doing this! I left few comments regarding tests and small improvements. #Resolved |
- Added Predictor tests that use the monotone constraints for LightGBM, LightGBMMC, and LightGBMR. - Added test to confirm that the monotonic trends work as expected - Added additional tests based upon review feedback.
// set the same constraint for all features. | ||
if (subArguments.Length == 1) | ||
{ | ||
for(int i = 0; i < featureCount; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for(int i = 0; i < featureCount; i++) | |
for (int i = 0; i < featureCount; i++) | |
``` #Resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Justin - Ill format the files.
#Resolved
|
||
// Process each range | ||
foreach(var indexRange in indexRanges) | ||
for(int i = indexRange.min; i <= indexRange.max; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for(int i = indexRange.min; i <= indexRange.max; ++i) | |
for (int i = indexRange.min; i <= indexRange.max; ++i) | |
``` #Resolved |
} | ||
|
||
// Process each range | ||
foreach(var indexRange in indexRanges) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach(var indexRange in indexRanges) | |
foreach (var indexRange in indexRanges) | |
``` #Resolved |
var negativeTestData = GenerateTestFeatures(1, numFeatures); | ||
|
||
var lastValue = float.MinValue; | ||
foreach(var x in positiveTestData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach(var x in positiveTestData) | |
foreach (var x in positiveTestData) | |
``` #Resolved |
} | ||
|
||
lastValue = float.MaxValue; | ||
foreach(var x in negativeTestData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach(var x in negativeTestData) | |
foreach (var x in negativeTestData) | |
``` #Resolved |
// neg:6-7 | ||
// Since the argument is a string array, mulitple ranges can be specified. | ||
// In the event of the same index being specified twice, the last one wins. | ||
// The end result is a string array of 1s,0s, and -1s that should be equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The end result is a string array of 1s,0s, and -1s that should be equal | |
// The end result is a string array of 1s, 0s, and -1s that should be equal | |
``` #Resolved |
var subArguments = argument.Split(':'); | ||
if (subArguments.Length > 2) | ||
// Invalid argument, skip to the next argument | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're currently silently ignoring the user's (bad) input:
if (subArguments.Length > 2)
// Invalid argument, skip to the next argument
continue;
Do we generally throw for bad input? #Resolved
else if (keyword.Equals(negativeKeyword, StringComparison.OrdinalIgnoreCase)) | ||
constraint = -1; | ||
else | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're also silently ignore the user input here. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Justin, I am now throwing. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for adding!
Minor feedback above.
Codecov Report
@@ Coverage Diff @@
## master #2330 +/- ##
=========================================
Coverage ? 71.3%
=========================================
Files ? 785
Lines ? 141185
Branches ? 16128
=========================================
Hits ? 100679
Misses ? 36036
Partials ? 4470
|
Not training a calibrator because it is not needed. | ||
TEST POSITIVE RATIO: 0.3702 (134.0/(134.0+228.0)) | ||
Confusion table | ||
||====================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
||====================== [](start = 10, length = 24)
Are they really different between release and debug, or we can move them to Common? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not different. I have another PR (not out yet) that moves all LightGBM tests to common, so I put these in release and debug for now and planned on moving on the next PR.
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go ahead and move these to Common
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not hard for you. It just if you add them here, and then move them to Common directory, we just increase size of our git repo, without any reason whatsoever.
In reply to: 254080535 [](ancestors = 254080535)
- Now throws when an invalid argument is parsed. - Updated tests
"first three features and a negative constraint to the 4th and 6th feature. If feature index is not specified, " + | ||
"then no constraint will be applied. The keyword of 'pos' or 'neg' without a range will apply the constraint to all features.", | ||
ShortName="mc")] | ||
public string[] MonotoneConstraints; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MonotoneConstraints [](start = 24, length = 19)
Sorry what I bring it on 9th iteration, but just curious, wouldn't it be easier to have two options, one is positive monotone constraints, and other negative with types int[]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or array of Range, or what class we use in textloader to specify range of columns.
In reply to: 254096866 [](ancestors = 254096866)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the TextLoader's text-based ranges offers interesting abilities, like "~" and "*
". For example setting "pos:3-*" so the user doesn't need to know the total number of slots.
The current method is interesting as it allows for setting everything positive, then some negative "pos neg:5-10". The gain here is the user doesn't need to know the total number of slots.
I don't see a shortcut to set positive on all except for a few: "pos neutral:5-10". If you know the total number of slots, you could do "pos:0-4,11-547542".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So initially I did not use TextLoader Range as it looked to contain properties specific to Text Loading -- however, your comments resonated with me so I have made some changes:
- I created a new Range class in Microsoft.ML.Core - this is meant to be a more generic Range class that can be used by other Arguments if a range is needed. Tests were added as well.
- I removed MonotonicConstraints as an argument and added MonotonicPositive and MonotonicNegative - these are now Range arrays.
- I updated tests and baseline files to reflect these changes.
Overall I like this better as it follows the range format that is already used for text loader. If you want to select all features you can use the * and set the range to be 0-*. Short names for these arguments are mp and mn.
Take a look - Im curious to what you think. There maybe an opportunity to merge the Microsoft.ML.Core Range and the TextLoader Range, but I think that should be a different discussion.
In reply to: 254097097 [](ancestors = 254097097,254096866)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't expect you to take me seriously on this proposal, I was just thinking out loud.
I'm fine with changes, but I would like @TomFinley to give his opinion regarding new Range
class, and how it fit in our API story.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ivanidzo4ka, it looks this PR fails .netcore30 builds because there is now a System.Range and that is causing some compiler confusion as to which one is being referenced. So I am curious as to what @TomFinley has to say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I think exposing Random Forest should be done in a separate PR. In reply to: 459224733 [](ancestors = 459224733) |
MonotonicNegative (based upon feedback). - Added a Range class to support specifying ranges similar to TextLoader. - Updated unit tests, updated entry points.
|
||
[Fact(Skip = "Execute this test if you want to regenerate the core_manifest and core_ep_list files")] | ||
// [Fact(Skip = "Execute this test if you want to regenerate the core_manifest and core_ep_list files")] | ||
[Fact] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill revert this.
var mlContext = new MLContext(); | ||
var dataView = GetRegressionPipeline(); | ||
var trainer = mlContext.Regression.Trainers.LightGbm(new Options() { | ||
MonotonicPositive = new Range[] { new Range(0,null) }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null [](start = 66, length = 4)
Adding null here was a test, I will revert this.
public void LightGBMMonotonicArgumentExtendToEnd() | ||
{ | ||
Dictionary<string, object> options = new Dictionary<string, object>(); | ||
options["monotone_constraints"] = (positive: new Range[] { new Range(0, null) }, negative: new Range[] { }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
negative: new Range[] { } [](start = 93, length = 25)
it looks weird to specify empty range. Is it necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am short circuiting things here for testing. This sets the monotone_constraints parameter in the options dictionary so that I can call ExpandMonotoneConstraints to test the output. But as for the user - they dont have to specify both, neither from command line usage or api usage.
{ | ||
public class Range | ||
{ | ||
public Range() { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Range [](start = 15, length = 5)
How it's related to Range defined in TextLoader?
|
||
namespace Microsoft.ML.Data | ||
{ | ||
public class Range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't be adding a public type named Range
. It conflicts with the new System.Range
.
This would be equivalent to naming one of our new classes String
or Object
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename the Range
type, or move it to be nested in some other class, like we do elsewhere.
This add the ability to set the monotone_constraints argument for
LightGBM. This is done through the LightGBM Options class via the
MonotoneConstraints member. To handle the monotone constraints, this
adds the ability to specify whether to use a positive constraint or a
negative constraint along with a range. Multiple ranges can be
specified.
This checkin also includes tests for the parsing of the ranges to
validate the expected value that will be passed to LightGBM.
This fixes #1651.