KEMBAR78
[python] [dask] add initial dask integration by SfinxCZ · Pull Request #3515 · microsoft/LightGBM · GitHub
Skip to content

Conversation

SfinxCZ
Copy link
Contributor

@SfinxCZ SfinxCZ commented Nov 1, 2020

As agreed with @jameslamb in dask/community#104 this pull request moves all functionality from https://github.com/dask/dask-lightgbm repo directly into main LightGBM repo.

Note that this is WIP and most of the code should be changed to match LightGBM standards.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SfinxCZ ! I left a small suggestion for setup.py.

Please also fix the the linting issues (https://travis-ci.org/github/microsoft/LightGBM/jobs/740698092). You can replicate them by running this from the root of the repo:

pycodestyle --ignore=E501,W503 --exclude=./compute,./.nuget .

pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^compute|test|example).*" --match="(?!^test_|setup).*\.py" .

@jameslamb
Copy link
Collaborator

jameslamb commented Nov 2, 2020

@SfinxCZ I want to be sure to respect your time...would you like me to open a pull request into this branch that tries to fix the other CI jobs?

Once CI is working, then we can have reviews that discuss the API, documentation, etc.

@jameslamb
Copy link
Collaborator

As you look at CI, you can safely ignore these:

  • GitHub Actions: R GitHub Actions / r-package (windows-latest, MINGW, R 3.6, cmake) (pull_request)
  • Travis: check-docs

We have open PRs trying to fix some issues with them right now.

@jameslamb
Copy link
Collaborator

As you look at CI, you can safely ignore these:

  • GitHub Actions: R GitHub Actions / r-package (windows-latest, MINGW, R 3.6, cmake) (pull_request)
  • Travis: check-docs

We have open PRs trying to fix some issues with them right now.

ok, these PRs I referenced have been merged. So if you merge the latest master into this branch, it should help with CI.

@jameslamb
Copy link
Collaborator

One thing I noticed that I want to bring to your attention @SfinxCZ ....I notice that all of your commits look anonymous (not connected to your GitHub account)

image

This probably means you are pushing from a place where you git config has an email that is not linked to your GitHub. I want to be sure that you get credit in the contributors list when we merge this, so I think you should overwrite these commits and be sure to configure git with an email tied to your GitHub account. I've documented how to do this here, linking in case it helps: uptake/pkgnet#284 (review)

@SfinxCZ SfinxCZ force-pushed the master branch 10 times, most recently from c3eef74 to 9c1db8c Compare November 7, 2020 21:27
@jameslamb
Copy link
Collaborator

jameslamb commented Nov 13, 2020

@SfinxCZ how can I help with this?

I see the Dask tests are failing on Windows. I think it's absolutely ok to skip the Dask tests on Windows for the sake of this PR. When we add lightgbm.dask it will be marked as "experimental" anyway, and we could try adding Windows support (or fixing Windows CI, if it's just a CI issue) in a separate PR.

xgboost has also had issues with Dask on Windows: https://github.com/dmlc/xgboost/blob/4ccf92ea344ac3704c9b501ae88bb106d82f71d6/python-package/xgboost/dask.py#L92

@SfinxCZ SfinxCZ marked this pull request as ready for review November 14, 2020 20:29
@SfinxCZ
Copy link
Contributor Author

SfinxCZ commented Nov 14, 2020

I've fixed the CI scripts, the only problem is in https://travis-ci.org/github/microsoft/LightGBM/jobs/743647169 which I assume is some random fail, however I cannot rerun the pipeline. @jameslamb, can you please rerun the pipeline to check whether the fail was indeed some random fluctuation?

The code is ready for review. There are some restrictions that I needed to set. First, only linux is supported for now (can be extended in different PR, as suggested by @jameslamb). And second is that only python >= 3.6 is supported. The reason to drop support for python 2.7 is that dask does not support python 2.7 either since version 2.0. If you really need the support I can try to add it but I believe that supporting python 2.7 is not necessary these days.

@StrikerRUS
Copy link
Collaborator

If you really need the support I can try to add it but I believe that supporting python 2.7 is not necessary these days.

We are going to drop Python 2.7 support in a next few days with 3.1.0 release: #3484 (comment). So, don't worry about it.

@jameslamb
Copy link
Collaborator

@jameslamb, can you please rerun the pipeline to check whether the fail was indeed some random fluctuation?

Yep, no problem! I just restarted it. That task (Travis, TASK=check-docs) scrapes all the links from our documentation and checks that they're still active. It fails sometimes due to random network issues.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for all the efforts and for being open to this, @SfinxCZ !

I left a few suggestions. I tried to only focus my review on things like the public API of this new module. We can consider other changes to the internals of these functions / classes through normal maintenance in later PRs.

I'd like to make two additional requests, related to governance over this code after this PR:

  1. Can you please add a line to CODEOWNERS (https://github.com/microsoft/LightGBM/blob/master/.github/CODEOWNERS#L32) that makes sure I'm added as a reviewer on any PRs that touch the Dask module? I committed to the other LightGBM maintainers that if we bring dask-lightgbm into this project, I'll take on the primary maintenance burden (#2791 (comment))

  2. Can you please let me know how involved you want to be after this PR is merged? I want to be respectful of all the awesome work you and the other dask-lightgbm maintainers have done, but I also want to give you the opportunity to eventually step away from it if you'd like. It sounds like @striajan is ok not being involved (dask/community#104 (comment)).


Notes for other LightGBM maintainers

I think we should not merge this until 3.1.0 is released (#3484 ), and then try to target including it in an eventual 3.2.0 release. That way, we can make breaking changes if necessary, add documentation with examples, try to add support for Mac and Windows, etc.

@SfinxCZ SfinxCZ requested a review from guolinke as a code owner November 15, 2020 11:17
@jameslamb jameslamb changed the title [python] add initial dask integration [python] [dask] add initial dask integration Dec 22, 2020
@jameslamb
Copy link
Collaborator

thanks very much for all your hard work @SfinxCZ ! And to the other dask-lightgbm maintainers for starting us on this path.

I'm going to merge this, and later tonight I'll add some issues to #2302 based on the review comments here and some other ideas I have for the Dask integration.

@jameslamb jameslamb merged commit d90a16d into microsoft:master Dec 22, 2020
@jameslamb
Copy link
Collaborator

Sorry, got busy with other things today. Will add more Dask issues to the backlog tomorrow.

@pseudotensor
Copy link

pseudotensor commented Dec 30, 2020

FYI:

Fix for early stopping: h2oai/dask-lightgbm@8cc8e83
Fix for pred_contribs: h2oai/dask-lightgbm@c19000b
Don't listen to worker num_threads, unrelated to n_jobs for lightgbm in typical 1 worker 1 process mode: h2oai/dask-lightgbm@d413698 (ignore DAI specific things)

@SfinxCZ @jameslamb

Early stopping would fail and I noticed dask-lightgbm package basically had no real dask support for it. So the above adds. It also fixes bug in pred_contribs and a poor setting for num_threads. These are pretty critical changes to dask-lightgbm.

@jameslamb
Copy link
Collaborator

@pseudotensor this is the first time I've seen h2oai/dask-lightgbm. Could you please explain the purpose of that project?

I'd prefer that you create pull requests here instead of making changes like that on a fork, so that LightGBM users have a single place to go for LightGBM-on-Dask, and so we don't duplicate efforts (the goal of dask/community#104).

The changes you linked above look general enough that they could be contributed directly to LightGBM.

@pseudotensor
Copy link

pseudotensor commented Dec 30, 2020

@jameslamb Purpose is to actually use dask-lightgbm. I wasn't getting nearly any responses for bugs I was reporting to the dask-lightgbm repo, so I didn't have much hope for making PRs. I only right now became aware that dask-lightgbm is being put into lightgbm, which is excellent, so I'm posting those changes so someone can make a PR to lightgbm directly.

@pseudotensor
Copy link

The changes you linked above look general enough that they could be contributed directly to LightGBM.

Ya, apart from the 3rd commit that has some import line, yes, all 3 should be applicable to lightgbm. I can make a PR in a few days if that is requested.

@jameslamb
Copy link
Collaborator

Thanks very much!

I'm posting those changes so someone can make a PR to lightgbm directly.
I can make a PR in a few days if that is requested.

I'd be happy to take these changes and PR them, if you'd like! Then you don't have to deal with tests and CI. But if you'd rather contribute them I'm also happy to let you do it.

I wasn't getting nearly any responses for bugs I was reporting to the dask-lightgbm repo, so I didn't have much hope for making PRs.

You should expect much more active development on the integration now that the code is here in LightGBM. So if you're using it and run into problems or feature gaps, we'd very much appreciate you creating issues and pull requests so we know what's missing.

Thanks for helping us improve LightGBM!

@pseudotensor
Copy link

Yes please make the relevant PRs if you can, thanks! I maybe didn't do the cleanest/shorted code, but I didn't like the indexing by raw index that wouldn't work in general for the parts.

@ffineis
Copy link
Contributor

ffineis commented Jan 2, 2021

Yes please make the relevant PRs if you can, thanks! I maybe didn't do the cleanest/shorted code, but I didn't like the indexing by raw index that wouldn't work in general for the parts.

Hey @pseudotensor heads up, I've opened #3708, noticed your changes to support validation sets. Happy to take these over for you so that we both get what we want - support for validation sets and group/pairwise ranking. Lmk!

@pseudotensor
Copy link

Ya, go for it. Just coordinate with @jameslamb since he also mentioned he would do some stuff.

@jameslamb
Copy link
Collaborator

I've added the suggestions from #3515 (comment) as individual issues, so let's continue the conversation there if either of you have further thoughts.

I'll start working on these this week and will review @ffineis 's ranker PR

@lostmygithubaccount
Copy link

@jameslamb apologies for the ping on the PR, not sure where a better location would be

I'm trying to build a distributed example on Azure Machine Learning using lightgbm.dask.

But more importantly, an internal Bing team is looking to move a large training workload from an internal tool to the public LightGBM package. We are prototyping the move now, and hoping to use an official release of lightgbm which contains the dask submodule.

When can we expect the release? Is there a place this is documented?

@jameslamb
Copy link
Collaborator

jameslamb commented Jan 11, 2021

@lostmygithubaccount no problem! That's so exciting!

Short answer: there is no place where we've documented when the next release (3.2.0) will come out.

Long answer: We've recently been trying to do LightGBM releases at least once every two months. This time of year, with people out for various holidays, you can expect it to be a little longer. The most recent release (3.1.1) was in early December, so I think it's reasonable to expect that 3.2.0 will be out in February. But that's only a best guess, not a commitment...it depends on the availability of maintainers and on discussions about what we want to be in that release, which have not been had yet.

I know it's not ideal, but if you want to start testing the new Dask integration you can try to adopt my testing code from #3515 (review). I'd actually really really appreciate any feedback and bug reports your team could offer on it. I'm putting a lot of attention into it over the next few weeks so we can feel confident in it when we release next.

If you have additional questions, please open a new issue and ask there. Other maintainers and contributors will see it better in an issue than on a closed PR.

@jameslamb
Copy link
Collaborator

@lostmygithubaccount in preparation for some meetup talks, I recently put together an example for using LightGBM with Dask that I think is a lot easier than the comment I linked you to above.

https://github.com/jameslamb/talks/tree/main/recent-developments-in-lightgbm

I'd love to hear more about what your team is working on and what features really matter for you in the new Dask module. Feel free to contact me at any of the places documented in https://github.com/jameslamb, and maybe we could set up some time to talk.

@github-actions
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants