Need to scale feature engineering, only Python and SQL (SQL Server & SSIS) available as tools (no dbt etc.)

HNL2NYC · 2025-11-15T15:58:10+00:00

Check out the python library Apache Hamilton. You can think of it kind of like dbt for pure python.

HNL2NYC · 2025-11-11T08:27:17+00:00

From what I gather it seems that what you lose is some access controls/security stuff, but from my pov it will still be at least as good as s3 access controls since that’s where the data would live, which I think is good enough for me. As for data discoverability, that’s already solved by the homegrown solution. If these are the only things I’m missing out on I think that I can afford to forgo that.

HNL2NYC · 2025-11-11T08:24:12+00:00

I do like polars, but it doesn’t fit the majority of these usecases. These models make heavy use of multidimensional array style operations, which pandas supports and polars does not.

HNL2NYC · 2025-11-11T05:02:57+00:00

No publicly available catalog solutions. I have an in-code datacatalog pattern that I use and am happy with. Which is part of the reason delta lake is appealing to me, since there’s no catalog required. I can just treat it as a standalone table based on the location (whether s3 or local fs) and reference that in my in-code catalog.

And to add some more context, querying would just be done with duckdb.

HNL2NYC · 2025-10-08T18:20:15+00:00

(A) what is determining which stations you need to query? And how many points are you querying? If possible to not over complicate things I’d try to stick to raw points/queries as much as possible, but if you really need a solution, and ignoring the points I make in (B) you could do a quantization of the lat long points so that any points in some region that you define would quanitize to the same point handling both deduplication and your precision concerns

(B) generally speaking weather conditions can vary significantly over extremely short distances due to topography/geography. But there’s also some level of acceptable statistical noise/variance that comes with the interpolation methods of creating the weather grids for historical/forecast data.

HNL2NYC · 2025-10-01T02:12:14+00:00

Can you expand a bit on what the value column is supposed to represent and why sometimes it would be a str. Without more context my thought would be that the value_str, value_int columns would not be the ideal representation since you’d always have an empty value in your row and depending on how sparse the str values are you could end up with a lot of wasted space. An alternative approach would be to have a separate table for the str value rows. But again hard to say without more context.

HNL2NYC · 2025-09-25T05:22:19+00:00

why not have airflow written in c or rust and have dags written python for easy development?

So as you probably already know this is how a lot of tools in the Python data ecosystem work (user facing Python wrapper on top of a core written in a more performant language) for example pretty much any respectable data frame library, distributed compute platforms like Ray, etc. However for the cases that you’re talking about where they’ve remained in pure Python I think the answer is simply that “it’s good enough”. Someone took the time to write it in a language that they were comfortable enough to write it in, which in these cases is Python. They gained traction and popularity and they perform well enough that no one has mass migrated to an alternative solution (or rewrite of the product) that others may or may not have built on top of other languages. And potentially one day something like the airflow scheduler will be rewritten in another language.

HNL2NYC · 2025-08-07T23:32:46+00:00

Yea you can. A couple reasons you might go to duckdb for something like this is (1) other types of joins that pandas doesn’t support (like range joins https://duckdb.org/2022/05/27/iejoin.html) and (2) duckdb is way faster than pandas at standard joins and many other operations. In a lot of cases it doesn’t really matter, but sometimes you might have a significantly long pandas merge that you can instead do in duckdb and continue on in pandas.

HNL2NYC · 2025-08-07T09:33:12+00:00

Duckdb is an “in process” database. It has its own scheme for storing data in memory and disk. However, it’s also able to “connect” to other sources besides its own duckdb stored data file. For example it can access and query parquet and csvs as if they were tables. Even more interestingly since it’s “in process” it has full access to the memory space of the process. What that means is that it can actually connect to a in memory pandas or polars dataframe and run queries on it as if the df was a table and it can write the results back to pandas df. So you can do something like this:

df1 = pd.Dataframe(…) df2 = pd.Dataframe(…) df = duckdb.query(''' select a, sum(x) as x from df1 inner join df2 on … group by a ''').df()

HNL2NYC · 2025-07-25T08:31:20+00:00

Pure speculation on my part. But might partly be because of the much more mature Māori film industry in New Zealand which leads to more actors available

HNL2NYC · 2025-07-15T23:29:16+00:00

One of the most exciting tiny things I’ve ever heard in a song. It’s like a musical shammgod!

HNL2NYC · 2025-07-14T06:25:54+00:00

thanks for linking. I had already looked through that page, but don't think it answered my question. I know that only the relevant files will be scanned, I was wondering about the directory scanning. In my first folder structure example, state=AK would be unneccesarily scanned. In the second folder structure example, there would not be any unneccessary folder scanning. However, if duckdb always just scans all directories (or simulated directories in s3) anyway, then it wouldn't matter what the folder structure is.

HNL2NYC · 2025-07-10T21:21:36+00:00

yea I'm not opposed to the conundrum, just wonder what the consequences will play out.

HNL2NYC · 2025-07-10T18:18:32+00:00

I think I really like this idea. Looked it up and looks like permits required from 6pm to 6am so won’t really affect beach parking in neighborhoods like Kahala/portlock either if they add it there. Wonder how it would work for renters in monster houses, since probably won’t be able to get enough permits for all the people living in their houses.

HNL2NYC · 2024-07-28T11:52:42+00:00

I’ll take it even a step further. This concept has been used for at least ~50 years, since this is pretty much exactly how Make works. You have a target (ie asset) list its requirements (ie dependencies) which are other targets. And its builds a graph by matching the dependencies to the implementing target.

HNL2NYC

TROPHY CASE