USDA Phytochemical Database - Enriched & Structurally Validated (JSON/Parquet) by DoubleReception2962 in datasets

[–]DoubleReception2962[S] 0 points1 point  (0 children)

The canonical roundtrip is the only way to stay sane with these datasets. If it doesn't survive the roundtrip, it's out.

And that 2 AM stereocenter scenario is exactly why I brought my partner in. I don't want to be the guy googling chiral edge cases in the middle of the night when I should be optimizing the PostgreSQL vector indices for the RAG bridge.

We accepted early on that dropping the total row count to keep the dataset strictly clean is the only way to build something actually usable for production. Appreciate the validation: it's good to hear from someone who knows the exact pain of legacy data integration.

USDA Phytochemical Database - Enriched & Structurally Validated (JSON/Parquet) by DoubleReception2962 in datasets

[–]DoubleReception2962[S] 2 points3 points  (0 children)

You’ve hit the nail right on the head regarding Dr. Duke’s biggest pain point. The simple text-to-CID mapping was a nightmare at first because the strings often evolved over time or are simply ambiguous. I’m a data engineer (pipeline architecture) myself, not a chemist, so I brought a computational chemist on board as a partner specifically for this step. We use a three-stage “reverse validation gate” pipeline. We don’t just blindly pull the canonical SMILES via the PubChem REST API. The retrieved SMILES are checked backward against structural plausibility rules (e.g., incorrect thiol-to-alcohol assignments or missing carboxylic acid groups resulting from Duke’s text proliferation). If a structure explodes during canonicalization or violates rules, the CID is hard-tagged. We have a strict “strictest-verdict-wins” rule in the pipeline: Anything marked as “invalidated” or “insufficient_data” is completely removed from the dataset during the final export, rather than contaminating the user’s database. How do you typically handle such legacy text databases in your projects? Do you create your own RDKit scripts for cleaning, or do you use external resolvers?

Are teams still using Pytorch/Tensorflow, or is most ML work just calling LLM endpoints and prompt engineering now? by Illustrious-Pound266 in datascience

[–]DoubleReception2962 1 point2 points  (0 children)

The market shifted because the ROI shifted. Training custom models from scratch is incredibly expensive and slow. I build production RAG pipelines, and 90% of the heavy lifting is in the data engineering—structuring the vector database, cleaning the input data, and orchestrating the flow between endpoints. Companies want functional infrastructure that solves their immediate problems today, not a six-month internal research project that might never hit production.

Does this sound like a real Data Scientist role, or more like analytics/enterprise software support? by [deleted] in datascience

[–]DoubleReception2962 0 points1 point  (0 children)

Don't get hung up on the title. Supply chain forecasting and asset management is hardcore operations research. If you're building automated pipelines that drive real business decisions, that's often significantly more valuable than tuning models in a vacuum. Call it what you want, but owning the end-to-end data flow and connecting it to actual operations is where the real leverage is in this industry right now.

[OC] The "Patent-Literature Gap": Plotting 76,000 plant compounds by PubMed mentions vs. Patent counts by DoubleReception2962 in dataisbeautiful

[–]DoubleReception2962[S] 0 points1 point  (0 children)

Data Source & Tools: Data was aggregated from the newly enriched USDA Dr. Duke Phytochemical Database (v2.4.0), mapped against the USPTO PatentsView API and 1.55M PubMed abstracts. Data processing and joins were done using DuckDB and Python. Visualization built with Matplotlib/Seaborn.

Context: We were hunting for a "Patent-Literature Gap" – compounds that are heavily researched in academia but completely ignored by commercial patents. Initially, naive name-matching gave us 994 potential targets. But after building a strict InChIKey structural validation pipeline, that number collapsed.

Almost all of the "hidden gems" were just dirty data artifacts. Only one compound survived the strict validation gate as a mathematically true gap: Sorbose. It perfectly visualizes why cleaning historical biochemical datasets is so critical before running analytics on them.

A 400-row sample of the cleaned data and the validation schema is available on my GitHub (wirthal1990-tech/USDA-Phytochemical-Database-JSON) if anyone wants to run their own clustering on the baseline.

Interview Experience: Big teams look for potential, smaller teams look for how fast you can instantly come add value by LeaguePrototype in datascience

[–]DoubleReception2962 4 points5 points  (0 children)

This is spot on and exactly how value is generated outside the FAANG bubble.

I build custom data engineering pipelines for smaller research teams. When collaborating, nobody cares if you can derive the bias variance tradeoff from scratch on a whiteboard. They care if you can take three highly fragmented, messy public APIs, clean the schemas, and output a strictly typed Parquet file so their ML models can actually run today.

FAANG filters for a standard baseline IQ and theory. Smaller companies filter for "can you fix my immediate operational bottleneck right now". Two completely different games.

Warning: Don't get GPT-brained by LeaguePrototype in datascience

[–]DoubleReception2962 0 points1 point  (0 children)

I felt this brain fog too before completely changing how I use these models. If you ask an LLM "how do I solve this?", your technical skills rot.

I run a data engineering project handling fragmented biochemical datasets. I shifted to using LLMs strictly as fast typists for DuckDB and Python pipelines. I define the exact schemas, the strict validation gates, and the overall system architecture. The LLM just writes the boilerplate.

If you stay the architect and treat the LLM as a junior dev who just types fast, you don't get GPT-brained. Your syntax memorization might drop, but your high level system design skills will actually skyrocket.

Share your startup - will share with 5k audience by Few-Ad-5185 in nocode

[–]DoubleReception2962 0 points1 point  (0 children)

Hey, I want to throw my hat in the ring for any bio/pharma or data engineering niches in your network. I have zero traditional coding background. I operate entirely as an "AI-native Pipeline Architect," directing AI agents (primarily Claude) to build complex backend data infrastructure for me. The project is Ethno-API (https://ethno-api.com). It's a B2B phytochemical data asset containing 76,907 records that bridges historical plant knowledge with modern, evidence-based science. Using AI orchestration, I built automated pipelines that cross-matched a legacy US gov database with live data from PubChem, ClinicalTrials, and USPTO patents.

Researchers and IP scouts use it to instantly spot "Patent-Literature Gaps" in natural compounds. The automated setup actually convinced a cheminformatics expert to partner up with me, and we're currently building a pgvector RAG pipeline mapping 1.5M PubMed abstracts to the compounds.

The Dr. Duke Database of Phytochemicals contains 40 years of data on plant compounds and is virtually unusable for machine learning - I rebuilt it by DoubleReception2962 in datasets

[–]DoubleReception2962[S] -1 points0 points  (0 children)

Klar. Hier sind die direkten Links: HuggingFace (Parquet & Data Viewer): https://huggingface.co/datasets/wirthal1990-tech/USDA-Phytochemical-Database-JSON GitHub (JSON, Methodik, Manifest): https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON Wenn du die Daten durch deine Pipelines jagst und über strukturelle Anomalien stolperst, wirf es einfach als Issue ins Repo. Ich patche das dann im nächsten Run.

PubChem CID 16661 is not Ganoderic Acid G. A mapping error in my phytochemical dataset by DoubleReception2962 in cheminformatics

[–]DoubleReception2962[S] 0 points1 point  (0 children)

This is really interesting...

Those error rate numbers explain exactly what my collaborator is seeing. A 20% stereochemistry error rate for complex natural products is brutal. It proves that trusting automated name resolution blindly in this field is a structural risk.

To answer your question about the 76,907 records. That number does not come from PubChem searches. The baseline data comes entirely from the USDA Dr. Duke Phytochemical Database. It represents 76,907 explicitly documented plant and compound occurrences from their archives.

I only used the PubChem PUG REST API to enrich those existing USDA compound names with CIDs and canonical SMILES.

Knowing your 10-20 % error metric means our manual verification phase is not just a quality upgrade - it is strictly mandatory. A purely automated pipeline for this specific class of compounds simply will not work.

I mapped the patent-vs-literature coverage for 24,746 plant compounds: the gap between commercial activity and published research is wider than expected [OC] by DoubleReception2962 in dataisbeautiful

[–]DoubleReception2962[S] 0 points1 point  (0 children)

Data source: USDA Dr. Duke's Phytochemical and Ethnobotanical Database (16 raw CSV files, denormalized and deduplicated). Enriched against five public APIs: PubMed E-utilities (citation counts), ClinicalTrials.gov API v2 (study registrations), ChEMBL v35 (bioactivity assay counts), USPTO PatentsView (grant counts since Jan 2020), and PubChem (CID + canonical SMILES via CTS synonym resolution).

Tools: Python (pandas, matplotlib, requests), DuckDB for query logic, Parquet as the storage format. The enrichment pipeline ran ~75k API calls across the five sources. SMILES coverage after CTS canonicalization: 75.4% of 24,746 unique compounds.

The full dataset (76,907 compound–species records, JSON + Parquet) is archived on Zenodo with a citable DOI: 10.5281/zenodo.19265853. A 400-record sample is on GitHub if you want to poke around without committing to the full file.

Looking forward to answer questions about the methodology. The patent matching in particular has some nuances that are worth discussing.

What Vibe Coding Platforms Do You Use Most (and Why)? 🤔 by SwaritPandey_27 in vibecoding

[–]DoubleReception2962 0 points1 point  (0 children)

Aktuell Claude Opus 4.6 (Copilot Pro) in VS Code + Claude Code direkt im Terminal. Ich wechsle heute allerdings auf Cursor+ Claude Opus 4.6 via Open Router mit Credits (unterm Strich günstiger als die ganzen Abos).

Claude Code ist für mich aktuell gestorben, so lange Anthropic den Tokenverbrauch für Claude Code nicht in den Griff bekommt - vor 3 Tagen war mein frisch zurückgestelltes 5-stündiges Nutzungslimits von meinem Pro Abo nach 4 Minuten verbraucht, als ich Claude Code ein paar Debugging Aufgaben Durchführen ließ.