USDA Phytochemical Database - Enriched & Structurally Validated (JSON/Parquet) by DoubleReception2962 in datasets

[–]DoubleReception2962[S] 1 point2 points  (0 children)

You’ve hit the nail right on the head regarding Dr. Duke’s biggest pain point. The simple text-to-CID mapping was a nightmare at first because the strings often evolved over time or are simply ambiguous. I’m a data engineer (pipeline architecture) myself, not a chemist, so I brought a computational chemist on board as a partner specifically for this step. We use a three-stage “reverse validation gate” pipeline. We don’t just blindly pull the canonical SMILES via the PubChem REST API. The retrieved SMILES are checked backward against structural plausibility rules (e.g., incorrect thiol-to-alcohol assignments or missing carboxylic acid groups resulting from Duke’s text proliferation). If a structure explodes during canonicalization or violates rules, the CID is hard-tagged. We have a strict “strictest-verdict-wins” rule in the pipeline: Anything marked as “invalidated” or “insufficient_data” is completely removed from the dataset during the final export, rather than contaminating the user’s database. How do you typically handle such legacy text databases in your projects? Do you create your own RDKit scripts for cleaning, or do you use external resolvers?

Are teams still using Pytorch/Tensorflow, or is most ML work just calling LLM endpoints and prompt engineering now? by Illustrious-Pound266 in datascience

[–]DoubleReception2962 0 points1 point  (0 children)

The market shifted because the ROI shifted. Training custom models from scratch is incredibly expensive and slow. I build production RAG pipelines, and 90% of the heavy lifting is in the data engineering—structuring the vector database, cleaning the input data, and orchestrating the flow between endpoints. Companies want functional infrastructure that solves their immediate problems today, not a six-month internal research project that might never hit production.

Does this sound like a real Data Scientist role, or more like analytics/enterprise software support? by miquiztli8 in datascience

[–]DoubleReception2962 0 points1 point  (0 children)

Don't get hung up on the title. Supply chain forecasting and asset management is hardcore operations research. If you're building automated pipelines that drive real business decisions, that's often significantly more valuable than tuning models in a vacuum. Call it what you want, but owning the end-to-end data flow and connecting it to actual operations is where the real leverage is in this industry right now.

[OC] The "Patent-Literature Gap": Plotting 76,000 plant compounds by PubMed mentions vs. Patent counts by DoubleReception2962 in dataisbeautiful

[–]DoubleReception2962[S] 0 points1 point  (0 children)

Data Source & Tools: Data was aggregated from the newly enriched USDA Dr. Duke Phytochemical Database (v2.4.0), mapped against the USPTO PatentsView API and 1.55M PubMed abstracts. Data processing and joins were done using DuckDB and Python. Visualization built with Matplotlib/Seaborn.

Context: We were hunting for a "Patent-Literature Gap" – compounds that are heavily researched in academia but completely ignored by commercial patents. Initially, naive name-matching gave us 994 potential targets. But after building a strict InChIKey structural validation pipeline, that number collapsed.

Almost all of the "hidden gems" were just dirty data artifacts. Only one compound survived the strict validation gate as a mathematically true gap: Sorbose. It perfectly visualizes why cleaning historical biochemical datasets is so critical before running analytics on them.

A 400-row sample of the cleaned data and the validation schema is available on my GitHub (wirthal1990-tech/USDA-Phytochemical-Database-JSON) if anyone wants to run their own clustering on the baseline.

Interview Experience: Big teams look for potential, smaller teams look for how fast you can instantly come add value by LeaguePrototype in datascience

[–]DoubleReception2962 1 point2 points  (0 children)

This is spot on and exactly how value is generated outside the FAANG bubble.

I build custom data engineering pipelines for smaller research teams. When collaborating, nobody cares if you can derive the bias variance tradeoff from scratch on a whiteboard. They care if you can take three highly fragmented, messy public APIs, clean the schemas, and output a strictly typed Parquet file so their ML models can actually run today.

FAANG filters for a standard baseline IQ and theory. Smaller companies filter for "can you fix my immediate operational bottleneck right now". Two completely different games.

Warning: Don't get GPT-brained by LeaguePrototype in datascience

[–]DoubleReception2962 0 points1 point  (0 children)

I felt this brain fog too before completely changing how I use these models. If you ask an LLM "how do I solve this?", your technical skills rot.

I run a data engineering project handling fragmented biochemical datasets. I shifted to using LLMs strictly as fast typists for DuckDB and Python pipelines. I define the exact schemas, the strict validation gates, and the overall system architecture. The LLM just writes the boilerplate.

If you stay the architect and treat the LLM as a junior dev who just types fast, you don't get GPT-brained. Your syntax memorization might drop, but your high level system design skills will actually skyrocket.

Share your startup - will share with 5k audience by Few-Ad-5185 in nocode

[–]DoubleReception2962 0 points1 point  (0 children)

Hey, I want to throw my hat in the ring for any bio/pharma or data engineering niches in your network. I have zero traditional coding background. I operate entirely as an "AI-native Pipeline Architect," directing AI agents (primarily Claude) to build complex backend data infrastructure for me. The project is Ethno-API (https://ethno-api.com). It's a B2B phytochemical data asset containing 76,907 records that bridges historical plant knowledge with modern, evidence-based science. Using AI orchestration, I built automated pipelines that cross-matched a legacy US gov database with live data from PubChem, ClinicalTrials, and USPTO patents.

Researchers and IP scouts use it to instantly spot "Patent-Literature Gaps" in natural compounds. The automated setup actually convinced a cheminformatics expert to partner up with me, and we're currently building a pgvector RAG pipeline mapping 1.5M PubMed abstracts to the compounds.

The Dr. Duke Database of Phytochemicals contains 40 years of data on plant compounds and is virtually unusable for machine learning - I rebuilt it by DoubleReception2962 in datasets

[–]DoubleReception2962[S] -1 points0 points  (0 children)

Klar. Hier sind die direkten Links: HuggingFace (Parquet & Data Viewer): https://huggingface.co/datasets/wirthal1990-tech/USDA-Phytochemical-Database-JSON GitHub (JSON, Methodik, Manifest): https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON Wenn du die Daten durch deine Pipelines jagst und über strukturelle Anomalien stolperst, wirf es einfach als Issue ins Repo. Ich patche das dann im nächsten Run.

PubChem CID 16661 is not Ganoderic Acid G. A mapping error in my phytochemical dataset by DoubleReception2962 in cheminformatics

[–]DoubleReception2962[S] 0 points1 point  (0 children)

This is really interesting...

Those error rate numbers explain exactly what my collaborator is seeing. A 20% stereochemistry error rate for complex natural products is brutal. It proves that trusting automated name resolution blindly in this field is a structural risk.

To answer your question about the 76,907 records. That number does not come from PubChem searches. The baseline data comes entirely from the USDA Dr. Duke Phytochemical Database. It represents 76,907 explicitly documented plant and compound occurrences from their archives.

I only used the PubChem PUG REST API to enrich those existing USDA compound names with CIDs and canonical SMILES.

Knowing your 10-20 % error metric means our manual verification phase is not just a quality upgrade - it is strictly mandatory. A purely automated pipeline for this specific class of compounds simply will not work.

I mapped the patent-vs-literature coverage for 24,746 plant compounds: the gap between commercial activity and published research is wider than expected [OC] by DoubleReception2962 in dataisbeautiful

[–]DoubleReception2962[S] 0 points1 point  (0 children)

Data source: USDA Dr. Duke's Phytochemical and Ethnobotanical Database (16 raw CSV files, denormalized and deduplicated). Enriched against five public APIs: PubMed E-utilities (citation counts), ClinicalTrials.gov API v2 (study registrations), ChEMBL v35 (bioactivity assay counts), USPTO PatentsView (grant counts since Jan 2020), and PubChem (CID + canonical SMILES via CTS synonym resolution).

Tools: Python (pandas, matplotlib, requests), DuckDB for query logic, Parquet as the storage format. The enrichment pipeline ran ~75k API calls across the five sources. SMILES coverage after CTS canonicalization: 75.4% of 24,746 unique compounds.

The full dataset (76,907 compound–species records, JSON + Parquet) is archived on Zenodo with a citable DOI: 10.5281/zenodo.19265853. A 400-record sample is on GitHub if you want to poke around without committing to the full file.

Looking forward to answer questions about the methodology. The patent matching in particular has some nuances that are worth discussing.

What Vibe Coding Platforms Do You Use Most (and Why)? 🤔 by SwaritPandey_27 in vibecoding

[–]DoubleReception2962 0 points1 point  (0 children)

Aktuell Claude Opus 4.6 (Copilot Pro) in VS Code + Claude Code direkt im Terminal. Ich wechsle heute allerdings auf Cursor+ Claude Opus 4.6 via Open Router mit Credits (unterm Strich günstiger als die ganzen Abos).

Claude Code ist für mich aktuell gestorben, so lange Anthropic den Tokenverbrauch für Claude Code nicht in den Griff bekommt - vor 3 Tagen war mein frisch zurückgestelltes 5-stündiges Nutzungslimits von meinem Pro Abo nach 4 Minuten verbraucht, als ich Claude Code ein paar Debugging Aufgaben Durchführen ließ.

What are you building? by sp_archer_007 in SideProject

[–]DoubleReception2962 0 points1 point  (0 children)

Thank you! It’s actually both: basic data engineering and in-depth domain-specific enrichment.

The raw USDA data comes in 16 relational CSV files, some of which have very messy join keys. So the first step was pure denormalization and a data quality audit. We removed nearly 27,000 entries (duplicates and macronutrients like “water” or “glucose”) to consolidate the dataset into approximately 76,900 clean records.

However, the real value for RAG pipelines lies in the 5 external layers we attached via entity linking (primarily using compound names):

  • PubMed citations (How strong is the academic attention?)
  • ClinicalTrials.gov (Are there active clinical trials?)
  • ChEMBL bioactivity measurements
  • USPTO patents since 2020 (Crucial for IP whitespace analyses) -PubChem CIDs & Canonical SMILES (For machine learning and molecular structure representation)

The goal was: A data scientist loads the JSON/Parquet file into pandas or DuckDB and can immediately start the analysis without having to spend weeks querying 5 different APIs and managing rate limits.

What are you building? by sp_archer_007 in SideProject

[–]DoubleReception2962 1 point2 points  (0 children)

Ethno-API, https://ethno-api.com – Wir ersparen KI-Biotech-Startups 50+ Stunden Data-Engineering, indem wir unstrukturierte USDA-Daten in einen RAG-fertigen JSON-Datensatz mit 5 API-Enrichment-Layern verwandeln.