5 Tips for public data science research

GPT-4 prompt: create a picture for working in a research group of GitHub and Hugging Face. Second iteration: Can you make the logos bigger and less crowded.

Intro

Why should you care?
Having a steady job in data science is demanding enough so what is the incentive of investing more time into any public research?

For the same reasons people are contributing code to open source projects (rich and famous are not among those reasons).
It’s a great way to practice different skills such as writing an appealing blog, (trying to) write readable code, and overall contributing back to the community that nurtured us.

Personally, sharing my work creates a commitment and a relationship with what ever I’m working on. Feedback from others might seem daunting (oh no people will look at my scribbles!), but it can also prove to be highly motivating. We often appreciate people taking the time to create public discourse, hence it’s rare to see demoralizing comments.

Also, some work can go unnoticed even after sharing. There are ways to optimize reach-out but my main focus is working on projects that are interesting to me, while hoping that my material has an educational value and potentially lower the entry barrier for other practitioners.

If you’re interested to follow my research — currently I’m developing a flan T5 based intent classifier. The model (and tokenizer) is available on hugging face, and the training code is fully available in GitHub. This is an ongoing project with lots of open features, so feel free to send me a message (Hacking AI Discord) if you’re interested to contribute.

Without further adu, here are my tips public research.

TL;DR

Upload model and tokenizer to hugging face
Use hugging face model commits as checkpoints
Maintain GitHub repository
Create a GitHub project for task management and issues
Training pipeline and notebooks for sharing reproducible results

Upload model and tokenizer to the same hugging face repo

Hugging Face platform is great. So far I’ve used it for downloading various models and tokenizers. But I’ve never used it to share resources, so I’m glad I took the plunge because it’s straightforward with a lot of benefits.

How to upload a model? Here’s a snippet from the official HF tutorial
You need to get an access token and pass it to the push_to_hub method.
You can get an access token through using hugging face cli or copy pasting it from your HF settings.

# push to the hub
model.push_to_hub("my-awesome-model", token="")
# my contribution 
tokenizer.push_to_hub("my-awesome-model", token="")
# reload
model_name = "username/my-awesome-model"
model = AutoModel.from_pretrained(model_name)
# my contribution
tokenizer = AutoTokenizer.from_pretrained(model_name)

Benefits:
1. Similarly to how you pull models and tokenizer using the same model_name, uploading model and tokenizer allows you to keep the same pattern and thus simplify your code
2. It’s easy to swap your model to other models by changing one parameter. This allows you to test other options with ease
3. You can use hugging face commit hashes as checkpoints. More on this in the next section.

Use hugging face model commits as checkpoints

Hugging face repos are basically git repositories. Whenever you upload a new model version, HF will create a new commit with that change.

You are probably already familier with saving model versions at your work however your team decided to do this, saving models in S3, using W&B model repositories, ClearML, Dagshub, Neptune.ai or any other platform. You’re not in Kensas anymore, so you have to use a public way, and HuggingFace is just perfect for it.

By saving model versions, you create the perfect research setting, making your improvements reproducible. Uploading a different version doesn’t require anything actually other than just executing the code I’ve already attached in the previous section. But, if you’re going for best practice, you should add a commit message or a tag to signify the change.

Here’s an example:

commit_message = "Add another dataset to training"
# pushing
model.push_to_hub(commit_message=commit_messages)
# pulling
commit_hash = ""
model = AutoModel.from_pretrained(model_name, revision=commit_hash)

You can find the commit has in project/commits portion, it looks like this:

2 people hit the like button on my model 🥲

How did I use different model revisions in my research?
I’ve trained two versions of intent-classifier, one without adding a certain public dataset (Atis intent classification), this was used a zero shot example. And another model version after I’ve added a small portion of the train dataset and trained a new model. By using model versions, the results are reproducible forever (or until HF breaks).

Maintain GitHub repository

Uploading the model wasn’t enough for me, I wanted to share the training code as well. Training flan T5 might not be the most fashionable thing right now, due to the surge of new LLMs (small and big) that are uploaded on a weekly basis, but it’s damn useful (and relatively simple — text in, text out).

Either if you’re purpose is to educate or collaboratively improve your research, uploading the code is a must have. Plus, it has a bonus of allowing you to have a basic project management setup which I’ll describe below.

Create a GitHub project for task management

Task management.
Just by reading those words you are filled with joy, right?
For those of you how are not sharing my excitement, let me give you small pep talk.

Asides from a must for collaboration, task management is useful first and foremost to the main maintainer. In research that are so many possible avenues, it’s so hard to focus. What a better focusing method than adding a few tasks to a Kanban board?

There are two different ways to manage tasks in GitHub, I’m not an expert in this, so please delight me with your insights in the comments section.

GitHub issues, a known feature. Whenever I’m interested in a project, I’m always heading there, to check how borked it is. Here’s a snapshot of intent’s classifier repo issues page.

There’s a new task management option in town, and it involves opening a project, it’s a Jira look a like (not trying to hurt anyone’s feelings).

They look so appealing, just makes you want to pop PyCharm and start working at it, don’t ya?

Training pipeline and notebooks for sharing reproducible results

Shameless plug — I wrote a piece about a project structure that I like for data science.

Philosophy of an Experimentation System — MLOPs Intro

What project structure suits data-science “experiments”?

serj-smor.medium.com

The gist of it: having a script for each important task of the usual pipeline.
Preprocessing, training, running a model on raw data or files, going over prediction results and outputting metrics and a pipeline file to connect different scripts into a pipeline.

Notebooks are for sharing a certain result, for instance, a notebook for an EDA. A notebook for an interesting dataset and so forth.

This way, we separate between things that need to persist (notebook research results) and the pipeline that creates them (scripts). This separation allows other to somewhat easily collaborate on the same repository.

I’ve attached an example from intent_classification project: https://github.com/SerjSmor/intent_classification

Summary

I hope this tip list have pushed you in the right direction. There is a notion that data science research is something that is done by experts, whether in academy or in the industry. Another concept that I want to oppose is that you shouldn’t share work in progress.

Sharing research work is a muscle that can be trained at any step of your career, and it shouldn’t be one of your last ones. Especially considering the special time we’re at, when AI agents pop up, CoT and Skeleton papers are being updated and so much exciting ground braking work is done. Some of it complex and some of it is pleasantly more than reachable and was conceived by mere mortals like us.

Source link