KEMBAR78
How To Succeed With A ML Project | PDF | Image Segmentation | Statistical Classification
0% found this document useful (0 votes)
25 views52 pages

How To Succeed With A ML Project

Uploaded by

Efrat Magidov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views52 pages

How To Succeed With A ML Project

Uploaded by

Efrat Magidov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

How to succeed with a

Machine Learning
project
Eddie Smolyansky eddiesmo@gmail.com
CEO & cofounder @ Connected Papers
---
Previously, Head of Research @
Alibaba Israel Machine Intelligence Lab
How to succeed with a
Machine Learning
project
Lecture goals
Pass on knowledge and tips, based on previous experience with project (both in
and out of y-data), to improve your experience and chances of success.

Disclaimers:

- People of various backgrounds here


- Some of the slides will be trivial
- I still believe it’s good to hear this explicitly at least once

Let’s make this a discussion


Today’s agenda
- Motivation for a Y-data project
- What is a project? We’ll discuss the various stages:
project definition, literature review, iterative implementation, presentation.
- How to define a project well: goals, metrics, data
- Clarify team roles
- Suggest tools and habits for managing a project
- Set expectations for project introduction presentations
- Give an example project introduction presentation from Y-data2019
About me
Computer Vision engineer @CorePhotonics
(acquired by Samsung)

Head of Research @Visualead


(acquired by Alibaba)

Build the research groups (CV, AutoML)


@Alibaba Israel Machine Intelligence lab

* Disclaimer: biased towards deep learning,


computer vision problems
About me
Computer Vision engineer @CorePhotonics
(acquired by Samsung)

Head of Research @Visualead


(acquired by Alibaba)

Build the research groups (CV, AutoML)


@Alibaba Israel Machine Intelligence lab

Thinking about improving research

Guest lectures & Mentoring @Yandex Data


Science Academy (Deep Learning)
About me
Computer Vision engineer @CorePhotonics
(acquired by Samsung)

Head of Research @Visualead


(acquired by Alibaba)

Build the research groups (CV, AutoML)


@Alibaba Israel Machine Intelligence lab

Researching an idea-
Can we have a Goodreads/IMDB for papers?

Guest lectures & Mentoring @Yandex Data


Science Academy (Deep Learning)

Launched Connected Papers as a side project


Learn about the audience [results in brackets]
- Have you ever done a team project in the tech industry? [~50% yes]
-
Have you ever worked with GIT? [~70% yes]
- Remote computers? [~70% yes]
- Python beyond this course? [~70% yes]

- How many project team meeting have you had by now?


[~10%: 0, ~45%: 1, ~45%: 2+]

- Did you get at least your first vaccination? [~80% yes]


Motivation: Why do a project
1. Practice in a safe environment - guided end to end
2. Learn what you enjoy
3. Collaborate in a group with real tools
4. Optimize for learning
5. Add something to your resume
a. Network (students, mentors, industry)
Rough project timeline
1. Team<>Project assignment
2. Team introductions
3. Project in depth definition → Project intro presentation
4. Iterative implementation
a. Literature search
b. Implement
c. Evaluate and analyze errors
5. Final presentation (final week, if you’ve done the rest well)
Roles - students
Students: radical responsibility for the project - take ownership and make it
happen.

That means you make sure meetings happen and everyone’s there, prepare
meeting agendas and summary, action items and keep everyone in the loop. It’s
very little actual work and makes a huge difference in results.

One possible way to divide work between students is to have one take on more
technical responsibilities (access and pre process the data, set up technical
environment) and the other managing responsibilities (meeting and task
management, planning, documenting).
Roles - mentor
- Beginning
- Help jump-start the project
- Direct initial learning
- Middle
- Nudge students in the right direction
- Help avoid obvious pitfalls (experience, intuition) and direct to solid choices
- Help when stuck on technical/implementation difficulties
- End
- Help prepare for the final presentation

Give tips and guidance relevant to all aspects of the project: planning, research,
implementation.

Note: mentors are allocated ~1 hour a week to help, so use their time wisely.
Roles - industry partner
Provide well annotated, relevant and sufficient data

Make access to data as easy as possible

Provide special software infrastructure and guidance as needed

Define success (hopefully with tiers)

Continue with the team in the process, serve as a second mentor and help them
solve problems
Project milestones
Project definition
When a project starts you get a project page with a short description and goals.
Many times, as you start working on the project, you realize things are missing or
badly defined. For example:

- The data is irrelevant, badly annotated, or not enough


- The suggested metrics don’t make sense for the task/goal
- The goal is unrealistic
Project definition
Your job in the first ~month is to OWN the project description:

- Learn and understand the relevant parts of the project domain and all the
keywords in the project description
- Make sure you understand how the basic data pipeline and formats (what
type is the input/output) and how metrics are calculated
- Review existing projects/literature, to get a feel for possible performance
- Explore and play with the data, get a feel for it
- Discuss with your mentors
- Finally - review the project description page - does it still make sense? You’re
now ready to make a rough plan
Data exploration
- Statistical metadata exploration:
- Describe the dataset and find patterns. Try to visualize.
- Pay attention to data format and types, data ranges
- If multiple sources of data - pay attention to the differences
- Find anomalies and outliers, nulls/missing values
- Experience the data
- Take some {hours} and browse through the data. Make sure you “feel” it.
- Try to annotate it yourself, or as close to it as possible. How easy is it?
Are the annotations well defined?
- Find edge cases where things are unclear or break down
- Consider tweaking/pre-processing the data to better fit your problem
- What kind of augmentations would work well for this data?
Example - find branching
points
- Unclear where to mark exactly

- Different branch types - can’t


always recognize

- How to annotate cut branches?

19
Literature review
- Why it is important
- Don’t start from zero - start from the closest possible solution
- Learn about different approaches
- Get a feeling for the state of the art
- Tools
- Google, Semantic Scholar, Papers with Code, Connected Papers
Discover Skim Read Own

days to weeks minutes hours days to weeks

● Result: ● Result: ● Result: ● Result:


Long list of papers Aware of paper Understand the Able to reproduce
Short list of papers paper the paper, find
mistakes, improve
● How: ● How: on it
Keyword search ● How: Hard, non-linear,
Alerts Abstract iterative process ● How:
Figures, tables Code
Social groups
Results Experiments
Crawl references
Metadata
Blogs
(authors, lab)
Newsletters
Popularity
External sources
(blogs, videos)
Literature review - example
Example 1:
Make a plan
- Listen to your mentors!
- Depends on mentor recommendations and strengths
- Consider what you’re optimizing for
- In depth understanding → implement simple things yourself
- State of the art → find open sources and fine-tune them
- Break into small tasks/milestones and celebrate success.
- Move one step at a time and verify it works. Otherwise when things break
down - you won’t know where the problem is.
Tools and habits
Project communication - slack
Guidelines:

- Smooth and constant communication - keep everyone in the loop


- Receiver should manage notifications
- Use and respond to @
Meetings
- Set a weekly meeting in the calendar
- Send an agenda before each meeting

During the meeting:

- Review previous action items


- Take meeting notes
- Write down action items with @ and
soft due dates

The students are responsible for all the


above.
Document research, decisions and development
Why document?

- Provides clarity to everyone, including the writer.


- Everyone’s in sync, can easily catch up, knows where the project stands.
- No miscommunications.
- Builds confidence in past decisions - no need to revisit decisions 10 times
because no one remembers the reasoning.
- Automatically tracks progress.
- At the end of the project, have your final presentation 80% prepared. No need
to hunt for certain images, rerun experiments, try to remember decisions, etc.

Recommended platforms: Slite (free up to 100 docs),


Notion (free for person use + guests)
Bonus slide: writing is thinking
Writing forces clarity on your own thoughts.

- Writing well is a super important skill for engineers


- Even more important with remote work
- Even more important in larger organizations (but useful even when solo)

(1) Gergely Orosz on Twitter: "Writing is one of the best things you can invest in, as a software engineer. The more experienced people
become, the more they tend to realize this. Here's a thread on the 6 best writing resources I've found - both to "convince" you to write
more and to help you "level up":" / Twitter
Bonus: task management platforms
Good free options:

- Asana
- Monday
Your first hour+ as a team
- Do personal introductions: 1 hour
- Students:
- Talk about your strengths and weaknesses
- Suggested items: time availability, programming ability, experience with data
science projects, domain knowledge, enthusiasm, etc
- Talk about your personal goals for the project - what do you most hope to improve at?
- Industry partner:
- Overview of the project, project importance, background and goals.
- Discuss logistical issues: how students can get access to the data, what platforms they
need to be familiar with
- Mentor: availability, domain expertise, perhaps help jump-start the project
- Discuss a rough plan and milestones for the near future
- Set up a weekly meeting and a platform to accumulate knowledge and
manage tasks
Consider software infrastructure
GIT

PyCharm vs Jupyter lab

Virtual environments

Remote computation?

Etc
GIT
Learn: Git-Book online
Just chapters 1-3 and you can start
working. That’s ~23 pages.

Free Client for windows and mac:


Sourcetree

GIT can be confusing and frightening for


years, or you can invest a few hours
once and actually *get it* - it’s really not
hard and it’s everywhere - befriend it.
Time management tips
Time management has been a recurring problem in projects from past semesters.

- Work consistently from the start - don’t leave most of the work to the end. It
causes stress and reduces learning, which is the whole point.

- Use the first month to prepare everything which is not deep learning.

- Document as you go: decisions, experiment results, images and graphics. All
will be useful when you work on the final presentation.

- In the final weeks, think about which experiments you may need to run or
rerun and start them in advance - training neural nets takes time!

- Break the work into small milestones. The task will look more manageable
and you will feel the progress. Acknowledge small victories!
Project intro presentations
Instructions for your project intro presentations
Goals of the presentation:

- make sure that you really understand your project and have a rough plan.
- Expose weaknesses such as unclear goals, unavailable data, unrealistic
timelines or various miscommunications.
- Get everybody on board (students, mentors, company)

Listeners: other students, at least 1 mentor

Your goal is to get the other students to really understand your project in X
minutes.

Presentation time: roughly 10 min (don’t obsess over it)


Instructions for your presentations - 2
- Introduce the project context and goals ~1 minute
- Explain the basic concepts and keywords in the problem domain
- Present your data exploration summary, with examples
- Explain the basic data pipeline: input, output, etc.
- Discuss the metrics
- how are they calculated?
- Are they realistic? How do they compare to SOTA?
- What are the expected challenges in the project?
- Present a rough plan for moving forward

No need for: self introductions (other than name), company introductions, etc.
A presentation is successful if the listeners really understand your project now.
Example project - leaf segmentation
Example 2019 project:
Leaf Segmentation
Data:
Several datasets provided, one public.

Goals:

- Basic: Achieve “AP75>0.85” on easy


dataset: 1 leaf per image, no
background.
- Advanced 1: Expand to harder dataset,
with multiple leaves and real
background, no defined target.
- Advanced 2: Apply to video, possibly
with tracking, no defined target.
Project definition: what is segmentation

Classification Detection Segmentation

A classification of the Classification, and Pixel-Level multiclass


main object in the image bounding box classification of
objects/background
Semantic vs Instance segmentation

Input: RGB image

Semantic segmentation
output:
each pixel gets a value 0 or 1

Instance segmentation output:


Each pixel gets a value
[0,1,2,3…N] with N being the
number of instances
Metrics: what is “AP75>0.85”?
To understand average precision for segmentation, we must first understand
another metric: IoU - Intersection over Union

Average Precision 75 in this case basically means:

How many images got IoU > 75%?

Consideration: is AP75 the best metric? Why not just average IoU?

Another problem: the metric is unbalanced with regards to object size.


Data

Variable image resolutions,


Minimum 300x300
Data -
Visualized
better
Data -
Visualized
better
Data -
Visualized
better
Data
- Multiple different datasets
- Annotated by different people with different instructions and precision levels
- Imbalanced number of annotations per dataset
- Different environments
- Not all leaves annotated
- Some leaves very small

Bottom line: even a human can’t get a good result!


Literature review - example
Challenges
1. Different goals may require different models.
2. Clarify the right metric
3. Unclear how different datasets will interact and transfer learned knowledge -
many experiments required
Rough plan
1. Literature review for segmentation and instance segmentation solutions.
2. Start with a simple, easy to implement semantic segmentation network to
practice the pipeline and set a benchmark. (~3 weeks)
a. Experiment with training from scratch or fine-tuning from COCO.
3. Select a SOTA algorithm for Instance Segmentation task and try to fine-tune it
to our task. (~5 weeks)
4. For video: (~3 weeks)
a. Start by naively analyzing frame by frame
b. Research object trackers and add them
Let’s talk!
- What are your biggest fears going into the projects?
- Now that you’ve been assigned a project, what are your biggest challenges?
- Questions about intro presentations?

You might also like