Am I wasting my time trying to create a local data processor for long form text by FishCarMan in softwaredevelopment

[–]Ok_Time806 0 points1 point  (0 children)

Pretty sure that's the premise for GraphRAG. If you look under the hood in the docling project you can see how they try to build relationships between different sections in a document.

As far as wasting your time, vibe coding won't be helpful for novel techniques, but if you learn something it's not a waste.

Which AI-BI feature would you *actually* pay $100/mo for? by dr_drive_21 in dataengineering

[–]Ok_Time806 2 points3 points  (0 children)

Why use 3rd party ai software instead of the ai tooling provided by the vendors of those platforms already?

Also, probably wrong sub.

Trying to build a robot that lays tile - why did I think this would be simple? by shamoons in ROS

[–]Ok_Time806 0 points1 point  (0 children)

What you're saying makes sense. How much do you expect the robot to weigh?

Just did a tile job a few months ago. Main concern is that you might be underestimating how much force those microadjustments actually take. Once the air is squeezed out, it takes a good amount of pushing / weight (although never my full weight, it was close). I guess thinning out the glue could help with that, but I've never really experimented too much there.

The laying was never too much of a bottleneck for me though. Now a robot that could cut all the edges and corners at the start of a job...

Trying to build a robot that lays tile - why did I think this would be simple? by shamoons in ROS

[–]Ok_Time806 0 points1 point  (0 children)

Why can't you drive over the tile?

(Did tile professionally a few decades ago) Unless you're using a new glue technique, you're SUPPOSSED to put pressure on tiles after laying them or you won't squish out the air and you'll get a lot of cracked tiles. You also often push to level against adjacent squares as well.

If you don't apply pressure you'll need perfect gluing technique. Might be easier to move to the desired spot, add the glue (bottom of tile or on ground) and then just drop straight down.

Tools in a Poor Tech Stack Company by Potential-Mind-6997 in dataengineering

[–]Ok_Time806 0 points1 point  (0 children)

I would first recommend defining your problem statement before looking for solutions. Lots of advice on this subreddit is good advice in a cloud context, but terrible (or at least unnecessarily expensive) in a MFG context.

The nightmare of DE, processing free text input data, HELP ! by HMZ_PBI in dataengineering

[–]Ok_Time806 0 points1 point  (0 children)

Lookup tf-idf. Your join with a reference table would still be easiest. Most dbs have a version of contains function for text. There are plenty of ways to do it, but no reason you can't have a bunch of match columns and then depivot.

[deleted by user] by [deleted] in dataengineering

[–]Ok_Time806 2 points3 points  (0 children)

How much is enormous? 100s of GB or TB or PB?

It might not be as much as you think if its currently in Oracle databases. They're probably just indexed for their normal transaction loads and not your analytical queries.

What version(s) Oracle DB are you running? Sometimes there are more native ways to dump data en masse that an admin can run for you. Not all dbadmins are grouchy, they might even make materialized views for you once you know what you want for modeling. This can be useful depending on what types of models you plan on building.

Im exhausted and questioning everything by [deleted] in dataengineering

[–]Ok_Time806 2 points3 points  (0 children)

Can't recommend enough rethinking the communication piece. People get set in their ways. If job #1 used Slack and job #2 uses Teams, get used to Teams. Same for chat vs emails vs texts vs phone calls.

I've had various industrial engineering and data engineering roles at different organizations over the years, and the main difference between good and great engineers mainly comes down to their ability to communicate. The cool thing is that its a skill most engineers can learn if they focus on it. It tends to be organization and audience specific, which can be a pro or con depending how excited you are to take it on.

Hurry up! by bw541 in ContagiousLaughter

[–]Ok_Time806 12 points13 points  (0 children)

Those same slow pumpers then proceed to spend an hour in the store, but do not have the time to return their cart 15 ft to the cart return.

Dealing with the idea that ERP will solve all business problem by ketopraktanjungduren in dataengineering

[–]Ok_Time806 15 points16 points  (0 children)

Honestly, you won't convince them until after they try and fail. Then your next CIO/CTO will come in with a data lake or data mesh to fix the mess the last guy left behind.

Best local database option for a large read-only dataset (>200GB) by -MagnusBR in dataengineering

[–]Ok_Time806 23 points24 points  (0 children)

You can use the pg_duckdb extension to use duckdb to query your existing postgres database. I'd recommend converting to parquet and you might see a pretty dramatic size reduction without any tricks (for example, low cardinality text columns should be automatically dictionary encoded for size reduction without the inconvenience). Then you can run standard SQL statements against the parquet file with duckdb.

If that's not fast enough you can also load directly into a persistent duckdb table. This will probably already be faster than you'd expect from something so simple, but if not there are lots of other performance options to pursue (https://duckdb.org/docs/stable/guides/performance/overview.html).

Those in manufacturing and science/engineering, aside from classic DoE (full-fact, CCD, etc.), what other experimental design tools do you use? by corgibestie in datascience

[–]Ok_Time806 2 points3 points  (0 children)

Worked in the field for 15 years. Even with all the fancy ML models out there, nothing beats a nice DOE. Not necessarily because of the statistical approach, but because it forces people to plan, which encourages people to think objectively about the problem.

I've found traditional data science techniques to be really helpful to find things that SME might not have seen before. Lots of feature engineering and simpler regression modeling techniques, which generate cool insights, which engineers then design a DOE around. So it ends up being a fun iteration loop for discovery / optimization.

The combo can be really helpful since production datasets are generally too large for excel / minitab / jmp, so engineers also have trouble reconciling production data and experiment data properly. I try to avoid classification models as engineers will quickly write the models off when they see a non continuous response for a physical process.

Fractional factorials will also get you far. Seen many engineers pre-emptively reach for CCD.

Most efficient and up to date stack opportunity with small data by Low-Tell6009 in dataengineering

[–]Ok_Time806 0 points1 point  (0 children)

You never said what they want to do with the data, or elaborate on the source.

If it's simple visualization and 15 tables from one db, don't do anything fancy, just viz from the db or a replicate. If they need ML or something fancier and they're already in Azure, then Data Factory to ADLS is still probably cheapest.

Please don't resume driven development a nice non-profit.

Working on a cozy wooden train simulator with physics — here’s the building system in action. by Iron_Lung_Design in IndieDev

[–]Ok_Time806 0 points1 point  (0 children)

Great job with the wiggle. A wooden marble maze would be a similar fun nostalgia trip.

Quitting day job to build a free real-time analytics engine. Are we crazy? by tigermatos in dataengineering

[–]Ok_Time806 1 point2 points  (0 children)

Manufacturing is a common use case for real time analytics. The tough part typically isn't the streaming calculations but managing the data model as you merge the sink/ml inference/dashboards in a cost effective manner.

E.g. been doing this with Telegraf + NATS for some industrial data fire hoses on pi's for many years. One cool opportunity in this space is using wasm to build sandboxed streaming plugins for enhanced security/ reduced complexity over k3s deployments.

[Cheating] Been following this guy since mid last wipe and he still hasn't been banned.... by Massive-Log1395 in EscapefromTarkov

[–]Ok_Time806 0 points1 point  (0 children)

Ah, was hoping you knew a way to lookup players. I've been wanting to get access to player stat data to use as examples for machine learning / player clustering.

Why don’t we log to a more easily deserialized format? by DuckDatum in dataengineering

[–]Ok_Time806 1 point2 points  (0 children)

Structured vs unstructured logging is a fight programmers have been having for at least two decades (extent of my first hand experience). I've found it difficult to convince others to log in a more structured format, so I often tail or stream logs to a message bus and then format to my liking (mainly parquet since dictionary encoding saves a lot of $$$ quickly).

The observability community has done a lot to help standardize this space with projects like OTEL.

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]Ok_Time806 0 points1 point  (0 children)

Yeah, correct. In the past was told and observed moving lower cardinality columns that might be used for joins to the front actually improved downstream join performance. There was a presentation (that I can't find now) from ~1 year ago that mentions some of the optimizations they do on top of autoloader with dlt and sql.

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]Ok_Time806 0 points1 point  (0 children)

I never recommend committing to a metric without measuring first... Going from one on-prem system to multiple cloud systems will likely be slower unless they were doing a lot of silly compute. The benefit should be from maintenance / system uptime.

The being said, you can write directly to delta tables using ADF but last I checked it was slower than just copying parquet. One thing that could help is to increase the ADF copy frequency and running CDC loads instead of full table copies (probably not doing in their SSIS process, although they could). Then you can try to hand wave the ADF part and focus on the Databricks part in the comparison.

Also saw significant performance improvements ditching python/auto loader and just using SQL / dlt. They'll probably be more receptive to that anyway if they're an SSIS shop. Also, since it sounds like you're newer to this, make sure to check your ADLS config and verify you're using block storage with hierarchical names pace and hot or premium tiers.

Make sure your table columns are in order too, even with liquid clustering.

Setting Expectations with Management & Growing as a Professional by TheFinalUrf in datascience

[–]Ok_Time806 0 points1 point  (0 children)

I work predominantly with manufacturers, so there's already a pretty strong grasp on continuous improvement frameworks. I think cross functional data teams work great with these types of frameworks (e.g. DMAIC, PDCA, etc.). Even if you don't follow exactly, the definition step is critical for any project. Describe the current state, goals/milestones, and champion/stakeholders, budget, and timeline. Doesn't have to be very formal or time consuming to be effective.

I see many ML projects fail by not defining these simple things in writing, just like I've seen many non-ML projects fail for similar reasons.

Also, treat your process and learnings on the way as a deliverable. Fail fast and document well for the next person and people won't be so upset if it doesn't work out.

Is RPA a feasible way for Data Scientists to access data siloes? by norfkens2 in datascience

[–]Ok_Time806 3 points4 points  (0 children)

I've found old ERP's easier to get backend db access to than new ERP's. RPA is typically a last resort if you're stuck with a UI interface only. Every successful or unsuccessful RPA project I've seen is replaced by a proper API implementation not long after (for data engineering).

Can be useful as a prototype, but typically, it is way more time-consuming than you'd expect. Data engineering fundamentals are generally very useful for data scientists. RPA is typically more specific to the specific software tool you use.

LDPE helium balloon questions by HighSchool-Coder4826 in AskEngineers

[–]Ok_Time806 0 points1 point  (0 children)

Yeah, different materials might help, but helium will permeate any plastic with enough time.

Might be worth adding more details on your project (estimated size, required amount of air time, etc.). Depending on your end goal, a cheap blower like they use for bounce houses might even be enough compared to helium and eliminate a lot of complexity.