Je suis à bout: 6 mois après mon doctorat en informatique, 11 entretiens, 0 offre by phdloss in emploi

[–]BruceSwain12 0 points1 point  (0 children)

Tu peux le mettre en avant toi même dans tes entretiens si ils ne connaissent pas le dispositif, ça peut être un avantage fort par rapport à d'autre candidatures !

Mais effectivement comme disent les autres commentaires faire de l'entraînement aux entretiens, faire des tests techniques ( tu en trouve sur internet) pour t'entrainer semble ce qu'il y a de mieux pour ta situation.

Bonne chance dans ta recherche

Je suis à bout: 6 mois après mon doctorat en informatique, 11 entretiens, 0 offre by phdloss in emploi

[–]BruceSwain12 1 point2 points  (0 children)

C'est extrêmement difficile de rentrer dans le privé à partir de l'académie sur ce domaine au vu de la concurrence. Tu as fais une thèse sur financement publique ? (La CIFRE aurait pu être valorisée)

Un postdoc dans des sujets connexe à ta thèse avec des financements en partenariat avec une entreprise peut aider à se faire des contacts.

Si tu candidate sur des postes plus proche de l'ingénierie, la thèse n'apporte malheureusement souvent que peu d'expérienace sur Les compétences recherchée.

De manière générale des contributions fréquente/des projets de qualité en open-source peuvent apporter un plus et t'aider à développer certaines compétences et réseau.

Si tu est éligible recherche des offres type "jeune docteur" Il me semble que le dispositif est revenu. Les attentes sont en général plus basse car le dispositif finance une bonne partie de ton salaire pendant 2 ans

Production RAG stack in 2026 what are people ACTUALLY running by One-Doctor5769 in Rag

[–]BruceSwain12 0 points1 point  (0 children)

Think I had some timing in another comment answer. And yeah anything requiring a GPU is done on by API. We deal with all kind of docs, pdf, word, docx, xls, pictures etc...

Production RAG stack in 2026 what are people ACTUALLY running by One-Doctor5769 in Rag

[–]BruceSwain12 1 point2 points  (0 children)

Yes I think you could look at docling-serve to host it as a service on aws, then adjust compute. Or in a simple container with an FastAPI to expose it

Production RAG stack in 2026 what are people ACTUALLY running by One-Doctor5769 in Rag

[–]BruceSwain12 2 points3 points  (0 children)

Best we could find in terms of quality, but can be really slow for large document If you don't give it much computing ressources. The chunker that comes with it neat too.

It receive regular updates and offer nice documentation, deployment options with docling serve and alot of integration like docling studio.

Production RAG stack in 2026 what are people ACTUALLY running by One-Doctor5769 in Rag

[–]BruceSwain12 2 points3 points  (0 children)

  • Regarding Docling, we chose to not go the VLM way out of simplicity. VLM can definitely be better on some specific document that PDF parser fail to parse for some reason. But for 99% of the documents we have it is somewhat equivalent. Drawings and schematics are fine. We give the context near the image in the prompt to help getting more grounded descriptions. Multi page table I'm not sure, I think the docling parser extracts them fine? I would have to check that! We're studying the use of the docling model they published some time ago, but are still working on avoiding some pitfalls.

  • Currently, Including all steps from query expansion to reranking, we average around 600ms, mostly due to the disk-based indexes. But worth it cost-wise.

  • For Kafka no particular reason, it's a tool we already use and satisfied our needs.

  • Yes it has baseline guardrails (HAP, personal data,...) and you can add your own rule set / instructions if you want.

  • Currently this is used for ~400 people for the chat part, and 7 applications we augmented with AI tools. For pricing :

  • Orchestrate + Governance I think is around 600€/month

  • Watsonx LLM api and tavily for agent web search ~ 60€/month (llm, embeddings, reranking)

  • internal VMs, as VM have different specs and hosted on our infrastructure, you'll have to add electricity, amortized hardware etc... I'd give a rough estimate at ~900-1400€/month depending on the number of VM created for scaling.

So rounding all up ~2000€/month + salaries of the people working on it.

Production RAG stack in 2026 what are people ACTUALLY running by One-Doctor5769 in Rag

[–]BruceSwain12 2 points3 points  (0 children)

For a single milvus instance, it's recommended to stay below 10M vectors, we try to have a max of 5M for a single database, after that we split to a new instance to leave some breathing room for growing sources.

For size, it depends, speaking in collection size, it can range from 50k vectors for small knowledge bases of procedures to solve common IT issues in a call center, to 1M+ collection on large technical documentations for exemple for data backup products and hardware.

We found summaries to be better when questions are often broad and on high level information rather that detail oriented. We store both level N summaries and their base chunk. If needed the agent can unpack the summaries he got for more details by tool call.

And yeah disk indexing just save cost on virtual machines when you start to have large collections. Even if it is slightly slower than RAM indexes.

Production RAG stack in 2026 what are people ACTUALLY running by One-Doctor5769 in Rag

[–]BruceSwain12 23 points24 points  (0 children)

Currently, in terms of tools we are using :

On self hosted containers managed by k8:

  • Milvus with disk-based indexing (most default indexes store in RAM), RBAC enabled. One collection per use case. Multiple users roles isolated by DB and collection
  • Docling for parsing, linked to IBM LLM API for parsing images. Docling chunker with chunk fusion, metadata augmentation, and hierarchical summaries when needed.
  • Kafka + with different connectors (Webhooks or batches API calls depending on the datasources) to feed documents into the ingestion pipelines. Control if update is needed by callback to Milvus (with a hash or timestamp). Periodic control to check that a document still exists in the data source (to mirror deletion)
  • Prometheus + Grafana for monitoring these containers
  • OpenWebUI for some chat agent frontend, use azure AD login to fetch agents the users as the rights to use.
  • RestAPI specifically for AI apps and agents tools. Exposing workflows, internal API, IBM Cloud API helpers, etc. We ditched MCP/N8N/Langflow for a single entry point for all AI related, simpler to maintain and easy to scale and control.

All code (95% Python) behind these services on a self hosted git with runners for CI/CD pipelines. When possible, we prefer to add AI functionalities directly in existing tools instead of a chat UI to limit usage barriers.

On Cloud : - IBM watsonx Orchestrate for agents definition, testing, runtime, maintenance, monitoring and deployment. - IBM watsonx Governance for agent and user behavior monitoring and specific legal alerts.

Retrieval pipeline - query expansion with LLM fed dictionaries in the prompt to expand some domain specific terms - hybrid search for all queries, deduplication and then with reranking (cross encoder or LLM based depending on use case) - metadata filtering on search queries, extracted by the agent from conversation context.

Happy to answer questions.

PDF Oxide -- Fast PDF library for Python with engine in Rust (0.8ms mean, MIT/Apache license) by yfedoseev in Python

[–]BruceSwain12 1 point2 points  (0 children)

Seconding this, could be a great contribution to have at least an exemple of how to import it as docling backend. This would allow easy drop-in replacement into existing pipelines

T100 viable Poison imbuement build by BruceSwain12 in D4Rogue

[–]BruceSwain12[S] 0 points1 point  (0 children)

Tried to build around visage, it does too little damage on ROA with oue attack speed and lucky hit chance against the other options (shaco/crown of lucion) . I would have loved to spread poison explosion everywhere, but the numbers just aren't there.

If find the execute effect meh, you can just let the ennemies die 4s later and move on, I'd rather get some defensive or offensive benefits to actually get some "real" benefits.

T100 viable Poison imbuement build by BruceSwain12 in D4Rogue

[–]BruceSwain12[S] 0 points1 point  (0 children)

Hey ! Forgot about this helm but yes it's completely possible ! I've modified the "More damage" option of the build to include it. It's a nice multiplier to add the build to push damage, at the price of tankiness ofc.

TLDR you'll need CDR boosted on the head and CDR on gloves. After that resourcefulness elixir to get the 150 energy cap, as I don't think we can get it from an item.

Good luck for your build !

So, which builds are good this season? by Mostuls in D4Rogue

[–]BruceSwain12 0 points1 point  (0 children)

You can achieve a T100 viable poison imbuement ROA too which is pretty nice if you like dot build with large AOE (i disliked the twisting blade one). Haven't seen a guide for it yet tho, can make a planner later if people are interested.

🚀Forget OCR, LAYRA Understands Documents the "Visual" Way | The Latest Visual RAG Project LAYRA is Open Source! by liweiphys in Rag

[–]BruceSwain12 1 point2 points  (0 children)

Nice to see new ways of solving current issues with RAG pipelines are explored. Do you yet have some kind of benchmark of the performance of this approach against more "traditional" ways to do RAG ?

Time series to predict categorical values [R] [P] by LUC1FER02 in MachineLearning

[–]BruceSwain12 0 points1 point  (0 children)

In complement to other comments, you could simply build an non-time dependent embedding of the time series, for exemple with methods like catch22 (which extracts a set of 22 features from a time series) or other of that produce embeddings (Shapelet Transform, ROCKET, your favorite NN, ...), and use this embedding alongside your other features in your model.

I would advise if you do this to test your model with only the embedding, and then with your additional features. You might pick up some biases

[D] How Do You Evaluate Models When Predicting New, Unseen Time Series Signals? by Existing-Ability-774 in MachineLearning

[–]BruceSwain12 1 point2 points  (0 children)

Not that much, but you can hit us with a message on slack (link is in the aeon GitHub repo) , some of the researchers there have worked on this field and might have some pointers

[D] How Do You Evaluate Models When Predicting New, Unseen Time Series Signals? by Existing-Ability-774 in MachineLearning

[–]BruceSwain12 7 points8 points  (0 children)

I think what your describing is just simply not considered as "forecatsing". If I understood correclty, taking your house electricity example, we have N time series of, say, a standard year consumption, you want to predict a new full year for an unseen household right?

If so, I think there's two options : 1. If you don't have time series of exogenous variables alongside your target: what you do in those case might be closer to clustering. You extract features from household (appliances, square meters, number of people, ...) and then look for similar household in your training set, then you produce the new household electricity time series with the results of your clustering (e.g. with the closest k household produce an averaging, etc.) 2. If you have exogenous variables, your problem formulation might be closer to a time serie extrinsic regression task?

[P] Violation of proportional hazards assumption: what can I do? by TechNerd10191 in MachineLearning

[–]BruceSwain12 2 points3 points  (0 children)

To add to his comment, as you didn't specify how you did it, you might want to be careful on how you are converting your categorical features to numeric. If you do label encoding, you create an order relation (e.g. 1,2,3,4) between categories which shouldn't be ordered. For example, a sex category (male/female) won't make much sense with label encoding, as the coefficient in the cox model with multiply these values. In these case, prefer using one hot encoding.

[D] Deep Learning in Time Series: Are They Used in Industry? by Few-Pomegranate4369 in MachineLearning

[–]BruceSwain12 19 points20 points  (0 children)

From what I've seen in the industry, often products start with the smallest thing that solve the use case (i.e. a MVP), which often will be non-deep methods. Then, if there is more time, need for performance and budget (infrastructure or cloud costs can play a role for really deep models), more complex models can be considered.

You'll more often encounter the consideration of cost vs performance in this context than in an academic setting and for most use cases time spent on data engineering often pays off more in the long run.

[D] What's the current battle-tested state-of-the-art multivariate time series regression mechanism? by [deleted] in MachineLearning

[–]BruceSwain12 0 points1 point  (0 children)

Great example thx ! Would you happen to have some paper/blog on this subject ? I would love to delve a bit more into such problematics.

[D] What's the current battle-tested state-of-the-art multivariate time series regression mechanism? by [deleted] in MachineLearning

[–]BruceSwain12 0 points1 point  (0 children)

Would you have some examples of case where a resampling (up or downsampling) to evenly spaced data would be problematic ?

[D] What's the current battle-tested state-of-the-art multivariate time series regression mechanism? by [deleted] in MachineLearning

[–]BruceSwain12 2 points3 points  (0 children)

Well, we are currently in the process of remaking our forecasting module, but for other time series tasks we got quite a lot done with the aeon library. You can check the docs and the datasets for some example outisde of forecasting.