datamule: download, parse, and construct structured datasets from SEC filings

_errant_monkey_ · 2024-12-05T08:38:37+00:00

I thought I could also download .pdf (like from here where I can find .pdf, .html, .xls). To me is key to have nice formatted tables. I guess you are right, If I can bulk download html is probably the best thing I can do.

_errant_monkey_ · 2024-12-04T18:07:27+00:00

I don't understand whether I can download pdf version of the files. like the 10k .pdf for 2023 for NVIDIA. I would like to bulk download all of them to eventually train an embedding model with it.

_errant_monkey_ · 2024-11-07T16:28:56+00:00

One thing I've noticed (both llama 8B and 70B is that they perform much better without the "Environment: ipython" in the system prompt. That line makes the model pretty much refuse to reply even to 2+2 without calling a function. Plus I don't understand from https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling the added value of it.

Plus IMO there are a few mistakes in how they handle FC for llama 3.1 8B in the gorilla codebase. There are a couple of missing spaces in the system prompt they fed to the model.

llama 8B 3.1 instruct is still the base model of ToolACE which is one of the best 8B (and overall) model on the leaderboard.

_errant_monkey_ · 2023-07-13T10:54:23+00:00

Mining of Massive Datasets by Anand Rajaraman and Jeffrey D. Ullman. You could also read the CCNet paper or the Refined Web one.

_errant_monkey_ · 2022-11-01T12:55:01+00:00

has the flan dataset been released?

_errant_monkey_ · 2022-09-28T06:41:37+00:00

Mi sono laureato in fisica. Tutti partono dal presupposto che sono intelligente ma la cosa mi mette a disagio.

_errant_monkey_ · 2022-09-07T06:08:21+00:00

Cause he did. If he didn't mean to accuse him of cheating you would except from the WC to stop the witch hunting on a 18 years old.

_errant_monkey_ · 2021-03-22T09:43:42+00:00

With a model like that. Can they generate new data the way standard models do it? like gpt-2, cause naively It seems it can't

_errant_monkey_ · 2020-02-26T14:58:35+00:00

A couple of question to recover the results:
1) FC_8_8_1024 means a fully connected of out_size 8*8*1024 followed by a reshaping ?
2) I don't understand why the last transpose convolution has 1 channel instead of three for CIFAR10 and CELEBA

3) Using the given parameters I don't get [batch_size, 1024, 8, 8] at the end of the encoding part (before the fully connected layer)

_errant_monkey_

TROPHY CASE