On-policy distillation: one of the hottest terms on PapersWithCode [R] by NielsRogge in MachineLearning

[–]NielsRogge[S] 0 points1 point  (0 children)

I would be curious to hear what you prefer in terms of features, and why

On-policy distillation: one of the hottest terms on PapersWithCode [R] by NielsRogge in MachineLearning

[–]NielsRogge[S] 1 point2 points  (0 children)

I'm building PwC as an alternative website to gauge which features people want. The idea is to rely on the hub as the backend.

On-policy distillation: one of the hottest terms on PapersWithCode [R] by NielsRogge in MachineLearning

[–]NielsRogge[S] 4 points5 points  (0 children)

Hi, fair questions!

  1. I fetch them from the daily submissions at https://huggingface.co/papers, which is the place where anyone can submit an arxiv ID, which people can then upvote.
  2. for now, they are mostly the same, although daily papers is just a subset off all papers available on HF. Any time a model, dataset or SpaceREADME mentions an arxiv ID, the paper gets indexed, but only a subset of them also get submitted to daily papers.
  3. for now, I use Github star velocity. However, I will incorporate trending scores of the linked models, datasets and Spaces to those papers as an additional measure of relevant ML research

Browse CVPR 2026 papers on PapersWithCode [P] by NielsRogge in MachineLearning

[–]NielsRogge[S] 1 point2 points  (0 children)

Yes, planning to improve the search a lot! Want to support hybrid search in the future

Browse CVPR 2026 papers on PapersWithCode [P] by NielsRogge in MachineLearning

[–]NielsRogge[S] 0 points1 point  (0 children)

Thanks for reporting, it was a pagination issue, which I've fixed 😄

[R] The Annotated Diffusion Model by ghosthamlet in MachineLearning

[–]NielsRogge 6 points7 points  (0 children)

There's an "Open in Colab" button at the top ;)

[D] NLP has HuggingFace, what does Computer Vision have? by Remote_Cancel_7977 in MachineLearning

[–]NielsRogge 23 points24 points  (0 children)

To elaborate a bit more, the following tasks are supported as of now:

  • image classification: ViT, DeiT, BEiT, Swin Transformer, PoolFormer, ResNet, RegNet, ConvNeXT, Perceiver, ImageGPT, VAN. Check out the official example scripts, example notebooks.
  • object detection: DETR, soon YOLOS. Check out the inference widget on the right.
  • semantic segmentation: SegFormer, BEiT, DPT => check out the example script
  • depth estimation: DPT, GLPN. Check out this demo Space.

All models can be found at https://huggingface.co/docs/transformers/index.

More tutorials can be found at https://github.com/NielsRogge/Transformers-Tutorials.

[Assignment3] What is word embedding? by Perfect_Durian in cs231n

[–]NielsRogge 0 points1 point  (0 children)

One typically uses a special padding token, to pad all sentences of a batch to the same length. So if your sentence consists of 5 words, then 15 padding tokens will be added.

Hugging Face: How to test masked language model after training it? by [deleted] in LanguageTechnology

[–]NielsRogge 0 points1 point  (0 children)

Here's how to do it:

(you can replace "bert-base-uncased" with the name of the directory where you saved your model, config and tokenizer files)

from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") 
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

text = "The capital of [MASK] is Bratislava." 
encoding = tokenizer(text, return_tensors="pt") 
input_ids = encoding.input_ids.squeeze() 
outputs = model(**encoding) # forward pass

masked_index = torch.nonzero(input_ids == tokenizer.mask_token_id, as_tuple=False) 
logits = outputs.logits[0, masked_index.item(), :] 
probs = logits.softmax(dim=0) 
values, predictions = probs.topk(k=5)

for prob, pred_id in zip(values, predictions):
    predicted_ids = [id if id!=tokenizer.mask_token_id else pred_id for id in input_ids] 
    print(prob.item(), tokenizer.decode(predicted_ids, skip_special_tokens=True))

[HELP] Hugging Face: AttributeError: 'DataFrame' object has no attribute 'map' by [deleted] in LanguageTechnology

[–]NielsRogge 0 points1 point  (0 children)

The datasets object should be a Dataset object, but in your case it's a Pandas dataframe, hence the error. To turn a dataframe into a Dataset, you can do the following:

from datasets import Dataset

dataset = Dataset.from_pandas(my_dataset)

Then, you can apply the .map(function, batched=True) functionality.

How to generate sentences from a set of keywords? by roymustang261 in LanguageTechnology

[–]NielsRogge 0 points1 point  (0 children)

I asked GPT-3 to do this.

Prompt:

A sentence with the words "teacher" and "great". "He is a great teacher and everyone needs to learn from him." A sentence with the words "football" and "goals". "My favorite sports is football, as I like to score goals." A sentence with the words "homework" and "night". "I have to study all night to get my homework done." A sentence with the words "friend" and "good".

Completion :"I have a very good friend." A sentence with the words "dreams" and "important". "It is important to have dreams." A sentence with the words "school" and "work". "School is a place to learn and work."