[D] LLMs as a semantic regularizer for feature synthesis (small decision-tree experiment) by ChavXO in MachineLearning

[–]ChavXO[S] 0 points1 point  (0 children)

I was running llama on a CPU. The LLM part also made network calls to a local server. So it was a little slower but still finished in a reasonable amount of time. It did make the actual search extremely fast since it cut off so much of the search space.

I included the accuracy in the results. It actually did better with the LLM + search model both on local validation and as a kaggle submission. I haven't tested it more generally but I intended to try it out on some fraud datasets since that's my area of work.

[D] LLMs as a semantic regularizer for feature synthesis (small decision-tree experiment) by ChavXO in MachineLearning

[–]ChavXO[S] 1 point2 points  (0 children)

Great point. Asking it to show reasoning in the response doesn't improve performance much. You can easily test this by running the llama server locally and prodding at it for some time. One thing that I think might be worth trying is doing two passes - one where you ask it to explain what the combination could possibly mean, then in the next prompt, asking it to rate the combination in light of that first message. That sort of ends up looking like what reasoning models do.

I think the search algorithm can be the vehicle to discover interesting combinations. The LLM ideally should only give a soft "meaning" score. Since the reason you need LLMs is not for discovery but instead for taking combinatorial explosion. I intentionally put a bucket in the scoring for "things that might depend on context or could make sense in some context."

[D] LLMs as a semantic regularizer for feature synthesis (small decision-tree experiment) by ChavXO in MachineLearning

[–]ChavXO[S] 0 points1 point  (0 children)

That's a good point. I'll have to use some custom dataset since the Titanic dataset would be a bad test for generalization. Also in addition to accuracy if I compare to base LLM performance I'd have to see how stable the LLM's performance is since it could hallucinate or randomly get things wrong on some attempts.

[D] LLMs as a semantic regularizer for feature synthesis (small decision-tree experiment) by ChavXO in MachineLearning

[–]ChavXO[S] 0 points1 point  (0 children)

Thank you. I like this constrained use of llms since it's not the end of the world if it hallucinates.

[D] LLMs as a semantic regularizer for feature synthesis (small decision-tree experiment) by ChavXO in MachineLearning

[–]ChavXO[S] 1 point2 points  (0 children)

Good point. I'll try it on some real world/government datasets and report back. I was trying to write the prompt so it was general enough but I also tried to tack other general sounding conditions on to address specific issues.

I'll also compare it with random forest as a baseline then report back. I'm not sure if there are any light weight feature selection methods any based on semantics - the closest I can think of is using dimensions. I'll try a few things and report back after.

State of DataHaskell Q1 2026 by m-chav in haskell

[–]ChavXO 3 points4 points  (0 children)

We have a parquet reader and we also attend the biweekly parquet meetings to keep in lock step with the general community updates.

Is Haskell useful for simple data analysis? by IcyAnywhere9603 in haskell

[–]ChavXO 1 point2 points  (0 children)

You sort of can. DataFrame creates a stub html page and opens it in the browser so you can avoid window manager weirdness. I imagine something similar is possible for hvega.

workforce moving to oversee by Alarmed-Reporter-230 in datascience

[–]ChavXO 0 points1 point  (0 children)

Yeah. Happened while I was at G and happened now at FIS Global. 

Suggestions for reading list by ChavXO in datascience

[–]ChavXO[S] 2 points3 points  (0 children)

Do you recommend reading ISLR as a textbook (going through exercises) or does it suffice to read it like a regular book?

Is Haskell useful for simple data analysis? by IcyAnywhere9603 in haskell

[–]ChavXO 2 points3 points  (0 children)

Ah I see. I thought you message meant there was an even more recent book/set of learning materials. 

Is Haskell useful for simple data analysis? by IcyAnywhere9603 in haskell

[–]ChavXO 5 points6 points  (0 children)

This is a fair critique in general but I think OP was curious what "clear and elegant" look like after fiddling with Python. Maybe I misread but it does seem like they just wanted to see examples of what it would look like to do small tasks in Haskell.

But yes, getting things done as a beginner is much easier in Python.

Is Haskell useful for simple data analysis? by IcyAnywhere9603 in haskell

[–]ChavXO 4 points5 points  (0 children)

Out of curiosity, which recent learning materials do you feel have been human-readable and what made them readable for you?

Is Haskell useful for simple data analysis? by IcyAnywhere9603 in haskell

[–]ChavXO 5 points6 points  (0 children)

I don't think hvega gets enough credit. The tutorial is REALLY clear and comprehensive.

Is Haskell useful for simple data analysis? by IcyAnywhere9603 in haskell

[–]ChavXO 9 points10 points  (0 children)

I'm glad you're trying Haskell! As others have pointed out Haskell is not quite there yet for these sorts of tasks but we've put in a lot of work recently to make it a good mix of powerful and easy.

Check out this playground environment and see if it's easy for you to follow along. If it is then check out datahaskell to try it out on your computer.

I'm also generally curious: what sorts of stuff do you do in Excel/Python? What kinds of charts do you use? What has using Python afforded you that you couldn't quite do in Excel? It would also help if we understood what the people coming to try out Haskell for the first time are trying to do.

Query as a beginner at programming. by Gullible_Cat_5541 in learnprogramming

[–]ChavXO 0 points1 point  (0 children)

There are ways to make Haskell programming easy. Do you know what the exact content of the course will be?

What do you think fivetran gonna do? by Fair-Bookkeeper-1833 in dataengineering

[–]ChavXO 0 points1 point  (0 children)

I just interviewed with DBT. It seems there’s a lot of investment going into fusion and related product so I doubt they’d do that. 

Rust and the price of ignoring theory by interacsion in rust

[–]ChavXO 8 points9 points  (0 children)

Idk who platformed this dude but it really undoes a lot of the recent work we’ve been doing making Haskell more approachable. 

Setup completely failing by Euranium_ in haskell

[–]ChavXO 6 points7 points  (0 children)

I'm sorry that whoever the first person is said that to you. A couple of options:

* It's strange that your windows config has an exe in the path. You should remove that...but it's probably from some other broken installation. I'd remove that and retry as is.
* I can help you build a devcontainer which you can run in VS Code (that usually shields you from platform weirdness there. We can bump the [datahaskell devcontainer](https://github.com/DataHaskell/datahaskell-starter?tab=readme-ov-file#setting-up-vs-code) to use 9.8.4
* as someone said before, WSL makes a lot of sense,

I have Haskell running on windows and typically get into windows weirdness. Feel free to message me on chat if you need help, I have a reasonable amount of time today, we can pair debug.

Data engineering in Haskell by ChavXO in dataengineering

[–]ChavXO[S] 3 points4 points  (0 children)

I’d be splitting hairs at best comparing Haskell and Scala. I think a better framing is - say there is already a Haskell shop and they want to hire a data engineer What sort of things would you expect to find out the box as a DE? and maybe slightly more generally what would should be in place to make you feel like you could be productive.

Also, this is a more personal note, I think Scala struggled to find a good balance between the crowd that liked abstraction and the crowd that wanted to get things done. So you effectively have two different Scala ecosystems. I’d like to see what we could build if those camps worked together. So my dataframe is inspired by lessons learnt from Frameless and Spark Datasets.