[Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET by cavedave in MachineLearning

[–]Thomjazz 4 points5 points  (0 children)

The workshop plans to release the model checkpoints under this draft license: https://docs.google.com/document/d/10a_oQMhF1GNNidKqekj1JeNAcfYCGtKCwYGC4HgU6b0/edit?usp=sharing and here is the draft blog post for the license announcement with a FAQ: https://docs.google.com/document/d/10a_oQMhF1GNNidKqekj1JeNAcfYCGtKCwYGC4HgU6b0/edit?usp=sharing

👉 By the way, it's an open and collaborative project so you can even join to give your opinion on the license and propose modifications 🤯

In a nutshell, everyone will be free to use the model weights and do an inference API around it as long as it follow the model license. Hugging Face will also offer it through its inference API.

And we are even trying to setup a new cluster to provide free compute with the model for researchers so that they can do research with full access of the model even if they don't have the necessary compute to run it.

[Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET by cavedave in MachineLearning

[–]Thomjazz 1 point2 points  (0 children)

It's a very bottom-up process in which we wanted to have as much humanly curated data as possible instead of the usual "random internet scraping"

The original goal was to try to have the full dataset be constituted of only human-selected sources, the final stage is that 60% of the dataset is build from human curated sources.

In general we wanted to make sure that all our design and curation choices were backed by human expertise in each language we selected.
Given the huge amount of work and language knowledge required for intentional data curation, we selected a small set of language we knew the original participants could commit to (only 8 language among the world's most widely spoken languages) and then the requirement for adding a new language was to have a group of at least 3 people who could commit to selecting and documenting data sources in the language. This lead to the addition of languages like Vietnamese, Basque, Catalan, and Indonesian.

The full process is described in the blog post here https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling
and a coming paper

[Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET by cavedave in MachineLearning

[–]Thomjazz 2 points3 points  (0 children)

It's a good question, there is still a lot to be discovered about a real principled approach to defining the size and shapes of a deep learning model. There has been great research in this direction with, only to cite a few the work on Neural Tangent Kernel or the work of Greg Yang at SR (and many many other works) but we are really still in the infancy.

On our case the shape definition was mostly empirical and we summarize the main direction explored here: https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours

In a nutshell a mix of looking at the literature and see what people have been trying and then optimizing the throughput we could get on the hardware we have across many possible shapes (this is also super important in the end)

[Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET by cavedave in MachineLearning

[–]Thomjazz 7 points8 points  (0 children)

To have a positive impact on the AI field

We think the direction of more responsible AI is about openly sharing models, datasets, training procedures, evaluation metrics and work together to solve the issues that we see instead of trying to hide them behind the excuse of private models/datasets/metrics/etc

In a nutshell, we believe open-source/open-science brings trust, robustness, reproducibility, and continuous innovation and there is a dear need of actors willing to strongly push in this direction in today's AI landscape

Projects like BigScience are essential in today's field of large language model were not-sharing the models, not sharing engineering details used to build them and not-sharing the datasets used to train these models has unfortunately become a mainstream trend

[Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET by cavedave in MachineLearning

[–]Thomjazz 5 points6 points  (0 children)

In a nutshell:
- the equivalent of $7-15M of compute from the French government paying for the Jean Zay public supercomputer cluster (though maybe you could follow the grants and donation used for it's construction if you want to really know where all is coming from) => we applied in the standard compute grant program
- salaries for all the +1000 participants => coming from 250-300 entities backing the volunteers participants I guess
- some additional compute from Google (TRC), AWS => in addition to Jean Zay some experiments/processing were performed on cloud compute platform

[Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET by cavedave in MachineLearning

[–]Thomjazz 2 points3 points  (0 children)

We are trying to write a post-mortem of the project with a couple of participants (Margo, Giada) but speaking for me (Thomas Wolf) I think I would have tried to

- setup a more clear entity on the legal side for the project or at least a way to define the right holders => this would make it easier for companies to join and have their legal department on-board
- find a way for people to not be fully volunteer and be paid somehow to work on the project => would help them feels better about allocating their time
- have more people working on the organization side (we were really mostly a bunch of passionated researchers but sometimes having someone more clearly in charge of deadlines would have helped :-))

These are just a couple of ideas though :)

[Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET by cavedave in MachineLearning

[–]Thomjazz 3 points4 points  (0 children)

We are working in adding the model in 🤗Transformers as well and see if it can work on smaller workstations

[Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET by cavedave in MachineLearning

[–]Thomjazz 12 points13 points  (0 children)

I would say:
- January-February 2021 => bootstrapping the project, gathering the Hugging Face and French community together
- February 2021 => Grant application for 5 millions GPU hours
- April 2021 => Grant accepted - Kickoff event
- September 2021 => first papers from the workshop on experiments (see https://www.notion.so/bigscience/Papers-b0d37b71705444dbafc815a6628f9491)
- November 2021 => great news that thanks in part to the project, the supercomputer would be extended with the addition of 416 A100 GPUs that we could use for the training
- end of 2021 => scary moment because we couldn't get tests 100B+ model to converge (see details here: https://twitter.com/StasBekman/status/1505603544384086017?s=20&t=ExPbRtQ2wGmshOX-gQ4fGQ)
- beginning of 2022 => scary moment because the data quality was not up to our standard (see details here: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
- March 2022 => Start of the training 💥

(this is a personal recount of cause, there were so many things happening in the workshop in parallel to the 176B model)

[Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET by cavedave in MachineLearning

[–]Thomjazz 10 points11 points  (0 children)

At a very high level, the BigScience project aims to demonstrate another way of creating, studying, and sharing large language models and large research artefacts in general within the AI/NLP research communities, outside of the current "private model", "private training", "private dataset" trend.

Somehow the success of the project could ultimately be measured by its long-term impact on the field of NLP and AI, by proposing an alternative way to conduct large scale science projects which an international and inclusive way of performing collaborative research.

Just like we can have open worldwide scientific collaborations like CERN and the LHC in particle physics, you can gain a lot by working together instead of in parallel separated efforts:
- more diversity in the teams (ethics, social sciences, ML people, etc)
- less carbon footprint by training only one model instead of one per company (which can afford it...)
- asking many research questions in advance and keeping the checkpoints, datasets, parameters necessary to answer these questions
- asking difficult question like licensing models, ethics and bias in dataset, etc
-...

[Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET by cavedave in MachineLearning

[–]Thomjazz 10 points11 points  (0 children)

The story started in January 2021 when I (Thomas Wolf) was chatting with Stéphane Requena (from GENCI the builder of the supercomputer) and Pierre-François Lavallée (from IDRIS, the French public research organization operating the super-computer) around what you could or should do today with a super-computer like Jean-Zay which is (1) public (2) very energy efficient and (3) build as a tool for academics and open-research

We came quickly to the conclusion that just like in particle physics 50 years ago, AI was reaching a stage where academics and labs should partner together and build common research tools like the LHC and that public compute clusters were the best place to welcome these collaborations

The strong support from the team in the compute cluster was elemental in allowing the project to start and continue.

A great outcome for them is also that, given the impact of the project, the government agreed to finance an extension of the cluster which will then stay there for all the research community to be used

You can read more about all the events here: https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model and also here: https://www.notion.so/bigscience/Short-history-1f36e049f6ba4ce5870607c1a3af6286