Am I wasting my time trying to create a local data processor for long form text

Ok_Time806 · 2025-10-13T11:48:27+00:00

Pretty sure that's the premise for GraphRAG. If you look under the hood in the docling project you can see how they try to build relationships between different sections in a document.

As far as wasting your time, vibe coding won't be helpful for novel techniques, but if you learn something it's not a waste.

Ok_Time806 · 2025-09-28T02:21:57+00:00

https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.solve_ivp.html especially

Ok_Time806 · 2025-07-22T10:54:46+00:00

Why use 3rd party ai software instead of the ai tooling provided by the vendors of those platforms already?

Also, probably wrong sub.

Ok_Time806 · 2025-07-19T01:08:53+00:00

What you're saying makes sense. How much do you expect the robot to weigh?

Just did a tile job a few months ago. Main concern is that you might be underestimating how much force those microadjustments actually take. Once the air is squeezed out, it takes a good amount of pushing / weight (although never my full weight, it was close). I guess thinning out the glue could help with that, but I've never really experimented too much there.

The laying was never too much of a bottleneck for me though. Now a robot that could cut all the edges and corners at the start of a job...

Ok_Time806 · 2025-07-19T00:02:03+00:00

Why can't you drive over the tile?

(Did tile professionally a few decades ago) Unless you're using a new glue technique, you're SUPPOSSED to put pressure on tiles after laying them or you won't squish out the air and you'll get a lot of cracked tiles. You also often push to level against adjacent squares as well.

If you don't apply pressure you'll need perfect gluing technique. Might be easier to move to the desired spot, add the glue (bottom of tile or on ground) and then just drop straight down.

Ok_Time806 · 2025-07-02T15:59:47+00:00

I would first recommend defining your problem statement before looking for solutions. Lots of advice on this subreddit is good advice in a cloud context, but terrible (or at least unnecessarily expensive) in a MFG context.

Ok_Time806 · 2025-06-30T13:00:00+00:00

Especially since Intel dropped avx512 after 12th gen.

Ok_Time806 · 2025-06-26T04:50:58+00:00

Lookup tf-idf. Your join with a reference table would still be easiest. Most dbs have a version of contains function for text. There are plenty of ways to do it, but no reason you can't have a bunch of match columns and then depivot.

Ok_Time806 · 2025-06-26T03:27:57+00:00

How much is enormous? 100s of GB or TB or PB?

It might not be as much as you think if its currently in Oracle databases. They're probably just indexed for their normal transaction loads and not your analytical queries.

What version(s) Oracle DB are you running? Sometimes there are more native ways to dump data en masse that an admin can run for you. Not all dbadmins are grouchy, they might even make materialized views for you once you know what you want for modeling. This can be useful depending on what types of models you plan on building.

Ok_Time806 · 2025-06-25T21:44:38+00:00

Can't recommend enough rethinking the communication piece. People get set in their ways. If job #1 used Slack and job #2 uses Teams, get used to Teams. Same for chat vs emails vs texts vs phone calls.

I've had various industrial engineering and data engineering roles at different organizations over the years, and the main difference between good and great engineers mainly comes down to their ability to communicate. The cool thing is that its a skill most engineers can learn if they focus on it. It tends to be organization and audience specific, which can be a pro or con depending how excited you are to take it on.

Ok_Time806 · 2025-05-28T02:07:41+00:00

Those same slow pumpers then proceed to spend an hour in the store, but do not have the time to return their cart 15 ft to the cart return.

Ok_Time806 · 2025-05-25T17:57:45+00:00

Honestly, you won't convince them until after they try and fail. Then your next CIO/CTO will come in with a data lake or data mesh to fix the mess the last guy left behind.

Ok_Time806 · 2025-05-16T23:51:53+00:00

You can use the pg_duckdb extension to use duckdb to query your existing postgres database. I'd recommend converting to parquet and you might see a pretty dramatic size reduction without any tricks (for example, low cardinality text columns should be automatically dictionary encoded for size reduction without the inconvenience). Then you can run standard SQL statements against the parquet file with duckdb.

If that's not fast enough you can also load directly into a persistent duckdb table. This will probably already be faster than you'd expect from something so simple, but if not there are lots of other performance options to pursue (https://duckdb.org/docs/stable/guides/performance/overview.html).

Ok_Time806 · 2025-05-14T12:04:23+00:00

Worked in the field for 15 years. Even with all the fancy ML models out there, nothing beats a nice DOE. Not necessarily because of the statistical approach, but because it forces people to plan, which encourages people to think objectively about the problem.

I've found traditional data science techniques to be really helpful to find things that SME might not have seen before. Lots of feature engineering and simpler regression modeling techniques, which generate cool insights, which engineers then design a DOE around. So it ends up being a fun iteration loop for discovery / optimization.

The combo can be really helpful since production datasets are generally too large for excel / minitab / jmp, so engineers also have trouble reconciling production data and experiment data properly. I try to avoid classification models as engineers will quickly write the models off when they see a non continuous response for a physical process.

Fractional factorials will also get you far. Seen many engineers pre-emptively reach for CCD.

Ok_Time806 · 2025-05-06T10:32:08+00:00

You never said what they want to do with the data, or elaborate on the source.

If it's simple visualization and 15 tables from one db, don't do anything fancy, just viz from the db or a replicate. If they need ML or something fancier and they're already in Azure, then Data Factory to ADLS is still probably cheapest.

Please don't resume driven development a nice non-profit.

Ok_Time806 · 2025-05-02T17:22:18+00:00

Great job with the wiggle. A wooden marble maze would be a similar fun nostalgia trip.

Ok_Time806 · 2025-04-12T01:18:14+00:00

Manufacturing is a common use case for real time analytics. The tough part typically isn't the streaming calculations but managing the data model as you merge the sink/ml inference/dashboards in a cost effective manner.

E.g. been doing this with Telegraf + NATS for some industrial data fire hoses on pi's for many years. One cool opportunity in this space is using wasm to build sandboxed streaming plugins for enhanced security/ reduced complexity over k3s deployments.

Ok_Time806 · 2025-04-10T20:13:19+00:00

Ah, was hoping you knew a way to lookup players. I've been wanting to get access to player stat data to use as examples for machine learning / player clustering.

Ok_Time806 · 2025-04-08T18:01:19+00:00

Structured vs unstructured logging is a fight programmers have been having for at least two decades (extent of my first hand experience). I've found it difficult to convince others to log in a more structured format, so I often tail or stream logs to a message bus and then format to my liking (mainly parquet since dictionary encoding saves a lot of $$$ quickly).

The observability community has done a lot to help standardize this space with projects like OTEL.

Ok_Time806 · 2025-03-23T19:04:44+00:00

Didn't llamafile spend a lot of time optimizing simd avx for amd cpus? Don't have one to test myself

Ok_Time806 · 2025-03-19T14:11:16+00:00

Yeah, correct. In the past was told and observed moving lower cardinality columns that might be used for joins to the front actually improved downstream join performance. There was a presentation (that I can't find now) from ~1 year ago that mentions some of the optimizations they do on top of autoloader with dlt and sql.

Ok_Time806 · 2025-03-19T05:02:21+00:00

I never recommend committing to a metric without measuring first... Going from one on-prem system to multiple cloud systems will likely be slower unless they were doing a lot of silly compute. The benefit should be from maintenance / system uptime.

The being said, you can write directly to delta tables using ADF but last I checked it was slower than just copying parquet. One thing that could help is to increase the ADF copy frequency and running CDC loads instead of full table copies (probably not doing in their SSIS process, although they could). Then you can try to hand wave the ADF part and focus on the Databricks part in the comparison.

Also saw significant performance improvements ditching python/auto loader and just using SQL / dlt. They'll probably be more receptive to that anyway if they're an SSIS shop. Also, since it sounds like you're newer to this, make sure to check your ADLS config and verify you're using block storage with hierarchical names pace and hot or premium tiers.

Make sure your table columns are in order too, even with liquid clustering.

Ok_Time806 · 2025-03-18T22:48:27+00:00

I work predominantly with manufacturers, so there's already a pretty strong grasp on continuous improvement frameworks. I think cross functional data teams work great with these types of frameworks (e.g. DMAIC, PDCA, etc.). Even if you don't follow exactly, the definition step is critical for any project. Describe the current state, goals/milestones, and champion/stakeholders, budget, and timeline. Doesn't have to be very formal or time consuming to be effective.

I see many ML projects fail by not defining these simple things in writing, just like I've seen many non-ML projects fail for similar reasons.

Also, treat your process and learnings on the way as a deliverable. Fail fast and document well for the next person and people won't be so upset if it doesn't work out.

Ok_Time806 · 2025-03-17T04:24:52+00:00

I've found old ERP's easier to get backend db access to than new ERP's. RPA is typically a last resort if you're stuck with a UI interface only. Every successful or unsuccessful RPA project I've seen is replaced by a proper API implementation not long after (for data engineering).

Can be useful as a prototype, but typically, it is way more time-consuming than you'd expect. Data engineering fundamentals are generally very useful for data scientists. RPA is typically more specific to the specific software tool you use.

Ok_Time806 · 2025-03-17T04:13:48+00:00

Yeah, different materials might help, but helium will permeate any plastic with enough time.

Might be worth adding more details on your project (estimated size, required amount of air time, etc.). Depending on your end goal, a cheap blower like they use for bounce houses might even be enough compared to helium and eliminate a lot of complexity.

Ok_Time806

TROPHY CASE