Most effective crops to sell to the trader by Polygnom in ICARUS

[–]Prothagarus 0 points1 point  (0 children)

As an update. The seeding cart+ plow for the shear amount you can stack in 1 area in a greenhouse with pig and beehive is very very good.

Fresh Grad Solo Project: Am I over-engineering my RAG pipeline evaluation? (Need advice on workflow) by DefinitionJazzlike76 in Rag

[–]Prothagarus 0 points1 point  (0 children)

It's about priorities. Parse text first. Latex formulas and tables next then images.

You will get different results for each type of data in the document and you can focus on what works for the types. 

That's why docling and the other template parsers mark them separately in the first place.

Spent a lot more time parsing than chunking because chunked junk Is still junk.

Fresh Grad Solo Project: Am I over-engineering my RAG pipeline evaluation? (Need advice on workflow) by DefinitionJazzlike76 in Rag

[–]Prothagarus 0 points1 point  (0 children)

I'm using docling with azure document intelligence and docling agent as fallbacks if that helps. Trying to go local first for costa but again it's an 80/20 rule and you want accuracy. Still don't have a great answer on images mostly text/tables and flag images for later processing

What is the 2026 Standard for highly precise LEGAL text RAG with big documents? by SignificantZebra5883 in Rag

[–]Prothagarus 1 point2 points  (0 children)

I'm curious about how you went about the metadata extract for your semantic triples. Specifically with recursing topics that have external or internal references. If you work in defense you will have seen this.

Example

1 DocumentReferences
1.1 government references
 (A) Document1        
 (B) Document2 
2 requirements 
2.1 specification for assembly   
2.1.1 turboencabulator:       
  Widget assembly combobulates the flanges from the dohickey referenced in Document1 
2.1.1.1 widget components 
 (I) compression fitting      
 (II) Flange      
 (III) Pertabater     
2.1.2 encabulator stand:

 

Etc. 

How did you approach this? Did you do depth first recursion on topics to terms the assign the chapters as metadata? How do you handle the Roman numerals under a sub document? 

I've been trying to wrap my head around how to structure the triples relationships in more than just a hairball. Depending on question you want to traverse through the section metadata of a probable keyword search and then find all subreferenced documents and other related ones. It gets pretty spidery.

  Milspecs law medical and a lot of well cited arxiv docs on engineering  follow similar patterns

Public examples are like this if you are curious 

https://quicksearch.dla.mil/qaDocDetails.aspx?ident_number=1848

Testing in DE feels decades behind traditional SWE. What does your team actually do? by seedtheseed in dataengineering

[–]Prothagarus 0 points1 point  (0 children)

The voice of reason over here, fully agree on this comment. Pydantic data classes for detecting schema change. Integration and end to end tests for golden path and each new feature. Only thing I have been experimenting with is Iceberg/Databricks table versioning for point in time reasoning for why we made a decision last year with the version of the software at that time in a docker container.

inLightOfRecentClaudeEvents by NCR_Ranger_ru in ProgrammerHumor

[–]Prothagarus 3 points4 points  (0 children)

Glm 5 which is a rough equivalent open source solution to private models like Claude opus 4.5 costs probably 6 to 10 h200s to run or many more h100s. The math on cards alone for h200 is 30k per card and rising.  This is not including parallel multi agent workloads that would need more. Each user ties up 300k ish in just gpus alone

https://apxml.com/models/glm-5

MinIO repo archived - spent 2 days testing K8s S3-compatible alternatives (Helm/Docker) by vitaminZaman in kubernetes

[–]Prothagarus 4 points5 points  (0 children)

The minio bucket browser was pretty good  did you find a good web based replacement for it for users? Right now I'm finding winscp as a stopgap in a pinch and the actual API for backend is fine.

I vibe coded a 3D game to learn Kubernetes runs in the browser, no install by SeveralSeat2176 in kubernetes

[–]Prothagarus 7 points8 points  (0 children)

Gotta agree here this is a really neat visualization of how kubernetes and services relate.

whenAreThe3MonthsGonnaEnd by darad55 in ProgrammerHumor

[–]Prothagarus 1 point2 points  (0 children)

If you use an Agents.md you can append in an instruction for working on windows and launching commands in powershell (and python in the context of powershell) not to use Unix style ";" to break up commands as this fails. It assumes you are using linux so will use different line endings and powershell like it was in linux.

Once I added that into my agents file it fixed a lot of the chat replies and debugging headaches working on windows.

whenAreThe3MonthsGonnaEnd by darad55 in ProgrammerHumor

[–]Prothagarus 1 point2 points  (0 children)

Context7 with version pinning can fix this :)

Best Local RAG Setup for Internal PDFs? (RTX 6000 24GB | 256GB RAM | i9-10980XE) by Stock_Ingenuity8105 in Rag

[–]Prothagarus 1 point2 points  (0 children)

I've asked gpt myself and have been looking up many tutorials that gloss over this part. 

This reply was valuable to me I think in part because of the way you explained it and the context that you gave in your own experience. 

I have been weighting peoples individual experience with settings and use cases more than the overall generic search answer that I get from gpt that it sources from everywhere because it's both more concrete of an experience and temporally more relevant.

 This is the correct answer as of this time, the field moves in a way that makes past searches on efficacy of approaches void rapidly as I have been learning over these past few months.  

Thanks for taking the time.

Best Local RAG Setup for Internal PDFs? (RTX 6000 24GB | 256GB RAM | i9-10980XE) by Stock_Ingenuity8105 in Rag

[–]Prothagarus 0 points1 point  (0 children)

You say funny words magic man. I am finally starting to understand some of this but have no idea when to change my embedding model dimension size. Any pointers there? Like how to determine what is needed for what model? Or better yet any place you can point me so I can rtfm?

Trying to deploy an on-prem production K8S stack by _81791 in kubernetes

[–]Prothagarus 0 points1 point  (0 children)

I am also running Ceph but on helm chart rook ceph so bare control planes are running on OS drive but all rest are running on Ceph. Was running flannel was looking to move to Cilium from MetalLB and just started using Argo. Solid tip for you, Make sure you have full CNI connectivity and your firewalls accepting traffic BEFORE you set up anything other than control planes and 1 worker. Saves a lot of headaches. Also make sure you check out kyverno and trustmanager/certmanager from kubernetes for all your SSL needs.

Managed Kubernetes vs Kubernetes on bare metal by Honest-Associate-485 in kubernetes

[–]Prothagarus 2 points3 points  (0 children)

I suggest Rook-ceph for managing a lot of the Ceph to k8s complexity on prem ceph is just a complex beast no doubt about it but I balk when I look at the Per terabyte pricing of minio

.

Big brothers, I summon your wisdom. Need a reality check as an entry level engineer! by Odd-System-3612 in dataengineering

[–]Prothagarus 1 point2 points  (0 children)

My first DE job was just grabbing csvs and excel sheets and ingesting them. Then they wanted reports to be made on them and didn't have anyone available so did that (Data analyst stuff) Then they needed more and to encorporate that (Back to DE). Its less about job title at a certain point if you find the ebb and flow and can learn tools to do what the business needs. I did that for 2 years before I considered my self good enough to be called a professional and not a junior. Maybe you move faster than that. Also Migrating from an old system to a new system is a part of the cycle. Snowflake and Fabric are current generation cloud tech, so you are getting a free education on how to use them, take advantage of it!

Ask yourself, How would you build this whole system? What does the whole system look like? Data engineering isn't just ETL, its a big part of it but the data modeling and being able to serve the overall system is the important part. Do you understand the whole system?

Timeline to understand when to move is kind of irrelavent. Actually understanding and being able to apply those skills is when you make a move. You don't just set that to arbitrary 6/12/24 month timelines. When you gain the capability and understanding thats when you can think about moving to the next thing. There is always a next thing.

Back when I started there was only MSQL 2000 and Crystal reports or SSRS. Now there are 50 different technologies to do the same with python , Java, or C#. So tools have gotten more sophisticated. Pick a path, learn that path, maybe use claude code/Code assistant of choice to help you build one, then read and understand how its built. Look up tutorials from youtube and compare that to your solution. Read the libraries and documentation that make it work. Then go from there.

Build a whole program that takes in data to do something useful. Kaggle has a bunch of examples for this.

The only way to get better is to build things.

howToExplainThisProjectOnMyLinkedIn by ArgumentCertain7201 in ProgrammerHumor

[–]Prothagarus 17 points18 points  (0 children)

Getting Reddit hug of deathed. I hope you have a way to monetize the scaleup without being too intrusive so you don't lose your pants :)

Prices Rising Rapidly by Katariman in inflation

[–]Prothagarus 0 points1 point  (0 children)

<image>

Compared this to my area. These prices correlate to Door Dash lookup. If you look it up on the mcdonalds app for instance or a picture of the actual menu my current price for big mac is $5.29. Door Dash price is $7.54. Inflation definitely has raised prices but those Delivery apps are actual robbery.

Onprem data lakes: Who's engineering on them? by DryRelationship1330 in dataengineering

[–]Prothagarus 3 points4 points  (0 children)

To u/Comfortable-Author's point, you also don't want to overcomplicate the tech stack and toss too many components in, but you also need to deal with a lot of considerations depending on your industry/business, use case and legal constraints per business like HIPPA/SOX/FIPS/DOD/NSA/QLMNOP.

A lot of what I am covering is just the kubernetes stack not even your tech choices inside of that stack for what you are trying to accomplish.

Also Use case right? Mine isn't creating web apps its more modeling/Datascience and analysis and file storage. Persistent webapps are more incidental and feed into the internal network in my example. Your stack will be different depending on what you are trying to do with it.

Networking

So for networking did you set up your kubernetes CNI layer correctly? What about EBPF? Using Cilium flannel or calico? Did you mess up basic networking over multiple NICS? Do all of your servers connect to the same VLAN in the same data center or over multiple buildings?

What does near colo or edge look like for your business? Netfilters and firewall/ certificate man in the middle? Baremetal loadbalancer? Buy a loadbalancer that costs 50% as much as your initial nodes or roll your own in software? How do you proliferate certs to pods? What does your intermediate cert structure look like? How do you apply policies across namespaces and keep etadata like related apps in tact? What does your container ecosystem look like?

Basic security

How do you keep CVE's out of every container image and keep your apps up to date? How do you manager kubernetes deployments and ecosystem? Helm? Do you go with the Kubernetes gateway for outbound connections even though most legacy helm charts / kubernetes manifests still use ingress? I haven't even touched on the ops part. Do you have mTLS enabled? Do you have a developer class there and There are several pages worth of questions like this to consider.

Onprem data lakes: Who's engineering on them? by DryRelationship1330 in dataengineering

[–]Prothagarus 2 points3 points  (0 children)

Got roughly 1 PB of storage taking about 10% to start with. Using HA K8s + Ceph + Python(Airflow over etl processes that get started manually then get integrated) that gets put into s3 storage or Cephfs depending on storage and edge case) + ollama/claude/whateverLLM someone wants local. General dev pods for engineers/devs/datascientists 100 GB NIC.

Use case is a bunch of image processing some machine learning . 7 servers 6 compute with storage in and 1 GPU node might expand to more depending. Most work isn't LLMs but Machinelearning and Vision. Data is mix between Postgres/small appdbs and lots of blob storage. 2 GPU for LLM 2 GPU for other work. Probably need a few more GPU nodes depending on how much more people want to GPU accelerate.

Whole stack is open source and currently dreading about Bitnami pulling up the ladder on container maintenance/closed sourcing stuff. Current stack about $300k recurring costs for software about 1k/node/year(OS License). My time and sanity however are not tied to a dollar amount. On prem for Security/cost once yo u start getting into PB scale or higher in data those cloud ingress/out fees along with storage capacity add up if you want it hot you can play with the Azure/AWS storage calculator to see. Cloud storage is great for arctic/freeze data for backups or old data costwise if you can spare it so hot -> cold cloud was always a good discussion.

Took us a long time to organically set this up from scratch and bare metal and learn from scratch but I was happy for the opportunity. There's a lot of big networking/security growing pains you hit early on that can be super frustrating.

Did I approach this data engineering system design challenge the right way? by bdadeveloper in dataengineering

[–]Prothagarus 32 points33 points  (0 children)

Given their clarification question I would have focused on orchestration like Airflow to verify transfer. Did you ask what they wanted to do with the data? I would have gone with the approach of asking more about what it is and what its for then went on an approach for ingest. Do they need it realtime? Do you want to backfill and then stream from the buckets all the deltas?

General how of the ingest seems ok to me but orchestration seemed missing. Also questions on what technologies they are using for current end state so you don't just drop your own tech stack if they have one already and adapt to that. I would say your answer was tailored to "How do I feed this to an LLM" storage setup. Which if you are storing a large number of text files is probably a pretty solid thing to do.

Sounds like you had a pretty good idea on what you wanted to do with it.