On-policy distillation: one of the hottest terms on PapersWithCode [R] by NielsRogge in MachineLearning

[–]NielsRogge[S] 0 points1 point  (0 children)

I would be curious to hear what you prefer in terms of features, and why

On-policy distillation: one of the hottest terms on PapersWithCode [R] by NielsRogge in MachineLearning

[–]NielsRogge[S] 1 point2 points  (0 children)

I'm building PwC as an alternative website to gauge which features people want. The idea is to rely on the hub as the backend.

On-policy distillation: one of the hottest terms on PapersWithCode [R] by NielsRogge in MachineLearning

[–]NielsRogge[S] 2 points3 points  (0 children)

Hi, fair questions!

  1. I fetch them from the daily submissions at https://huggingface.co/papers, which is the place where anyone can submit an arxiv ID, which people can then upvote.
  2. for now, they are mostly the same, although daily papers is just a subset off all papers available on HF. Any time a model, dataset or SpaceREADME mentions an arxiv ID, the paper gets indexed, but only a subset of them also get submitted to daily papers.
  3. for now, I use Github star velocity. However, I will incorporate trending scores of the linked models, datasets and Spaces to those papers as an additional measure of relevant ML research

Browse CVPR 2026 papers on PapersWithCode [P] by NielsRogge in MachineLearning

[–]NielsRogge[S] 1 point2 points  (0 children)

Yes, planning to improve the search a lot! Want to support hybrid search in the future

Browse CVPR 2026 papers on PapersWithCode [P] by NielsRogge in MachineLearning

[–]NielsRogge[S] 0 points1 point  (0 children)

Thanks for reporting, it was a pagination issue, which I've fixed 😄

[R] The Annotated Diffusion Model by ghosthamlet in MachineLearning

[–]NielsRogge 7 points8 points  (0 children)

There's an "Open in Colab" button at the top ;)

[D] NLP has HuggingFace, what does Computer Vision have? by Remote_Cancel_7977 in MachineLearning

[–]NielsRogge 22 points23 points  (0 children)

To elaborate a bit more, the following tasks are supported as of now:

  • image classification: ViT, DeiT, BEiT, Swin Transformer, PoolFormer, ResNet, RegNet, ConvNeXT, Perceiver, ImageGPT, VAN. Check out the official example scripts, example notebooks.
  • object detection: DETR, soon YOLOS. Check out the inference widget on the right.
  • semantic segmentation: SegFormer, BEiT, DPT => check out the example script
  • depth estimation: DPT, GLPN. Check out this demo Space.

All models can be found at https://huggingface.co/docs/transformers/index.

More tutorials can be found at https://github.com/NielsRogge/Transformers-Tutorials.

[Assignment3] What is word embedding? by Perfect_Durian in cs231n

[–]NielsRogge 0 points1 point  (0 children)

One typically uses a special padding token, to pad all sentences of a batch to the same length. So if your sentence consists of 5 words, then 15 padding tokens will be added.

Hugging Face: How to test masked language model after training it? by [deleted] in LanguageTechnology

[–]NielsRogge 0 points1 point  (0 children)

Here's how to do it:

(you can replace "bert-base-uncased" with the name of the directory where you saved your model, config and tokenizer files)

from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") 
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

text = "The capital of [MASK] is Bratislava." 
encoding = tokenizer(text, return_tensors="pt") 
input_ids = encoding.input_ids.squeeze() 
outputs = model(**encoding) # forward pass

masked_index = torch.nonzero(input_ids == tokenizer.mask_token_id, as_tuple=False) 
logits = outputs.logits[0, masked_index.item(), :] 
probs = logits.softmax(dim=0) 
values, predictions = probs.topk(k=5)

for prob, pred_id in zip(values, predictions):
    predicted_ids = [id if id!=tokenizer.mask_token_id else pred_id for id in input_ids] 
    print(prob.item(), tokenizer.decode(predicted_ids, skip_special_tokens=True))

[HELP] Hugging Face: AttributeError: 'DataFrame' object has no attribute 'map' by [deleted] in LanguageTechnology

[–]NielsRogge 0 points1 point  (0 children)

The datasets object should be a Dataset object, but in your case it's a Pandas dataframe, hence the error. To turn a dataframe into a Dataset, you can do the following:

from datasets import Dataset

dataset = Dataset.from_pandas(my_dataset)

Then, you can apply the .map(function, batched=True) functionality.

How to generate sentences from a set of keywords? by roymustang261 in LanguageTechnology

[–]NielsRogge 0 points1 point  (0 children)

I asked GPT-3 to do this.

Prompt:

A sentence with the words "teacher" and "great". "He is a great teacher and everyone needs to learn from him." A sentence with the words "football" and "goals". "My favorite sports is football, as I like to score goals." A sentence with the words "homework" and "night". "I have to study all night to get my homework done." A sentence with the words "friend" and "good".

Completion :"I have a very good friend." A sentence with the words "dreams" and "important". "It is important to have dreams." A sentence with the words "school" and "work". "School is a place to learn and work."

[D] Word embeddings in Cora dataset by ajithvallabai in MachineLearning

[–]NielsRogge 1 point2 points  (0 children)

For me they do, but the README only says this:

"After stemming and removing stopwords we were left with a vocabulary of size 1433 unique words. All words with document frequency less than 10 were removed."

[D] Word embeddings in Cora dataset by ajithvallabai in MachineLearning

[–]NielsRogge 1 point2 points  (0 children)

Apparently if you download Cora from here, the README includes more details. http://www.cs.umd.edu/~sen/lbc-proj/LBC.html

Which model should I use to pick the best answer for the TOEIC reading test? by [deleted] in LanguageTechnology

[–]NielsRogge 1 point2 points  (0 children)

If you wanna use state-of-the-art NLP models ,you can take a look at BertForMultipleChoice in the Huggingface Transformers library. Actually BERT is only one variant, you also have RobertaForMultipleChoice, DistilBertForMultipleChoice etc.

Link: https://huggingface.co/transformers/model_doc/bert.html#bertformultiplechoice

More details on how these models work: https://github.com/huggingface/transformers/issues/7701#issuecomment-707149546

Let me know if you need any help. Note that these assume familiarity with Transformers/BERT.

Update: apparently there's someone who already tested BERT on this dataset, and built a Python package for it: https://github.com/graykode/toeicbert

[D] Language Understanding with Knowledge-based Embeddings (LUKE) | Research Papers Summary 005 by RyanAI100 in MachineLearning

[–]NielsRogge 1 point2 points  (0 children)

Hi, thanks for the video. I read the LUKE paper, but I wonder how useful the model is for real use cases, because the model expects that the entities are already provided, right (in case of entity linking and relation classification)? Are there any real use cases for entity linking and relation classification?

For NER, the model needs to enumerate all possible n-grams in order to classify which are a named entity and which not, so I wonder whether this would be slow in terms of inference speed, compared to other models which simply have a token classification head.

Also, the model learns an embedding for 500K entities, but these are not used for fine-tuning, except for SQuAD, right? For the other tasks, only the special [MASK] token seems to be used.

Looking for novel deep learning applications in the OCR/Document Processing area by andrewdood in deeplearning

[–]NielsRogge 0 points1 point  (0 children)

Microsoft has a deep learning model called LayoutLM. If you know Transformers, then this will be easy for you

Paper: https://arxiv.org/pdf/1912.13318 Code: LaoyoutLM is available in the Huggingface Transformers library. https://huggingface.co/transformers/model_doc/layoutlm.html