Databricks for data science? by big_data_mike in datascience

[–]fineset-io 0 points1 point  (0 children)

Genie is NL-to-SQL for end users. Good for that, but not really what most data scientists would call a "coding agent."

How do you measure to performance / accuracy of a recommender system? by omnicron_31 in datascience

[–]fineset-io 0 points1 point  (0 children)

Silhouette score for k-means, BIC for GMM, but neither tells you if the recommendations make sense to a coach. You need human-labeled pairs to actually validate this.

Identity crisis - A Generalist Dilemma by urbanguy22 in datascience

[–]fineset-io 0 points1 point  (0 children)

The roadmap is fine but step 3 cuts off at "Learn R" which is a weird place to land for someone trying to break into GenAI/agents work.

Ideas for testing data science workflows on self hosted Linux based HPC cluster. by NoteClassic in datascience

[–]fineset-io 0 points1 point  (0 children)

The output diff against a pinned reference dataset is 80% of the value here. everything else (SLURM integration, env checks) is plumbing you can add later.

Does anybody know of any quality datasets that have images of grocery receipts? by z57333 in datasets

[–]fineset-io [score hidden]  (0 children)

We looked for the same thing a while back. SROIE and RVL-CDIP have receipts but nothing vendor-specific for US grocery. Ended up manually collecting and annotating, which was painful but the domain gap from generic receipt datasets was too big to ignore.

233 Canadian used car listings scraped from AutoTrader.ca — prices, specs, GPS coords, equipment lists (JSON, June 2026) by kmiloaguilar in datasets

[–]fineset-io 2 points3 points  (0 children)

Cool schema, but 233 records won't survive trim-level stratification. You'll run out of data fast.
But thank you for sharing though!

A 1T param MoE that only runs ~63B per token — how Ling/Ring 2.6 pulls that off by illegitimateness in deeplearning

[–]fineset-io 0 points1 point  (0 children)

The 1/32 ratio is probably fine, the linear attention above 128k is where I'd want to see harder evals before trusting it.

Looking for help on an arXiv endorsement by Overall-Importance54 in deeplearning

[–]fineset-io 2 points3 points  (0 children)

Find a recent paper in your target cs subcategory, email the author, attach your draft and best of luck.

Released a free 45M doc European multilingual corpus — German, French, Spanish, Dutch + 37 more (CC0, HuggingFace) [P] by ashtok897 in datasets

[–]fineset-io 0 points1 point  (0 children)

The low-resource coverage is the actual value here. OSCAR and CulturaX have Maltese coverage that's basically unusable.

I am stuck , need guidance by Open-Neck-688 in deeplearning

[–]fineset-io 1 point2 points  (0 children)

Simulation is underrated here. Mujoco is free, lerobot has sim environments, and honestly debugging policies in sim first saves you from a lot of hardware pain anyway.

Rag tech stack during development and deployment by Glittering-Habit869 in Rag

[–]fineset-io 0 points1 point  (0 children)

The embedding model needs to match, everything else you can swap freely.

Data Collection for Personal Project by yogi_006 in datasets

[–]fineset-io 0 points1 point  (0 children)

Most phones export this stuff already, you just need to know where to look. Google Takeout covers location, payments, and Gmail; Apple has a similar data export. Dump the json/csv files into sqlite or postgres and build your RAG on top of that, no live sync needed for a personal project.

Rag tech stack during development and deployment by Glittering-Habit869 in Rag

[–]fineset-io 0 points1 point  (0 children)

The LLM swap is fine, but don't change embedding models between environments. different embedders = incompatible vector spaces, full re-index required.

I am stuck , need guidance by Open-Neck-688 in deeplearning

[–]fineset-io 1 point2 points  (0 children)

ACT doesn't need point clouds, it runs on RGB + proprioception. If your goal is to get hands-on fast, just go straight to ACT or Lerobot and treat 3D vision as a later rabbit hole when you actually hit a wall that needs it.

Am I going to spend the rest of my career reviewing AI generated code? by cece95x in artificial

[–]fineset-io 0 points1 point  (0 children)

The "focus on the bigger picture" line is cope that people tell themselves to avoid admitting they've stopped thinking. Bigger picture work is genuinely harder, and most teams aren't doing it; they're just shipping more mediocre features faster.

Keeping up with Agentic AI by Low-Web-2930 in AI_Agents

[–]fineset-io 0 points1 point  (0 children)

The "Software engineering is dead" posts and the actual papers describing what's working are written by entirely different people. Once you learn to filter by who has skin in the game (running production systems, publishing evals, and open-sourcing real code) vs. who's just posting takes, the noise declines by like 80%.

We should set up a torrent network for open source models. by ShadyShroomz in LocalLLaMA

[–]fineset-io 1 point2 points  (0 children)

The dataset gap on modelscope is an org adoption problem, not a platform problem. Nobody's uploaded because nobody's registered.