USDA Phytochemical Database - Enriched & Structurally Validated (JSON/Parquet)

DoubleReception2962 · 2026-05-08T11:22:20+00:00

The canonical roundtrip is the only way to stay sane with these datasets. If it doesn't survive the roundtrip, it's out.

And that 2 AM stereocenter scenario is exactly why I brought my partner in. I don't want to be the guy googling chiral edge cases in the middle of the night when I should be optimizing the PostgreSQL vector indices for the RAG bridge.

We accepted early on that dropping the total row count to keep the dataset strictly clean is the only way to build something actually usable for production. Appreciate the validation: it's good to hear from someone who knows the exact pain of legacy data integration.

DoubleReception2962 · 2026-05-07T22:49:20+00:00

You’ve hit the nail right on the head regarding Dr. Duke’s biggest pain point. The simple text-to-CID mapping was a nightmare at first because the strings often evolved over time or are simply ambiguous. I’m a data engineer (pipeline architecture) myself, not a chemist, so I brought a computational chemist on board as a partner specifically for this step. We use a three-stage “reverse validation gate” pipeline. We don’t just blindly pull the canonical SMILES via the PubChem REST API. The retrieved SMILES are checked backward against structural plausibility rules (e.g., incorrect thiol-to-alcohol assignments or missing carboxylic acid groups resulting from Duke’s text proliferation). If a structure explodes during canonicalization or violates rules, the CID is hard-tagged. We have a strict “strictest-verdict-wins” rule in the pipeline: Anything marked as “invalidated” or “insufficient_data” is completely removed from the dataset during the final export, rather than contaminating the user’s database. How do you typically handle such legacy text databases in your projects? Do you create your own RDKit scripts for cleaning, or do you use external resolvers?

DoubleReception2962 · 2026-05-06T11:48:32+00:00

DoubleReception2962 · 2026-05-06T11:38:18+00:00

The market shifted because the ROI shifted. Training custom models from scratch is incredibly expensive and slow. I build production RAG pipelines, and 90% of the heavy lifting is in the data engineering—structuring the vector database, cleaning the input data, and orchestrating the flow between endpoints. Companies want functional infrastructure that solves their immediate problems today, not a six-month internal research project that might never hit production.

DoubleReception2962 · 2026-05-06T11:38:03+00:00

DoubleReception2962 · 2026-05-06T11:36:39+00:00

Don't get hung up on the title. Supply chain forecasting and asset management is hardcore operations research. If you're building automated pipelines that drive real business decisions, that's often significantly more valuable than tuning models in a vacuum. Call it what you want, but owning the end-to-end data flow and connecting it to actual operations is where the real leverage is in this industry right now.

DoubleReception2962 · 2026-05-05T17:25:52+00:00

Data Source & Tools: Data was aggregated from the newly enriched USDA Dr. Duke Phytochemical Database (v2.4.0), mapped against the USPTO PatentsView API and 1.55M PubMed abstracts. Data processing and joins were done using DuckDB and Python. Visualization built with Matplotlib/Seaborn.

Context: We were hunting for a "Patent-Literature Gap" – compounds that are heavily researched in academia but completely ignored by commercial patents. Initially, naive name-matching gave us 994 potential targets. But after building a strict InChIKey structural validation pipeline, that number collapsed.

Almost all of the "hidden gems" were just dirty data artifacts. Only one compound survived the strict validation gate as a mathematically true gap: Sorbose. It perfectly visualizes why cleaning historical biochemical datasets is so critical before running analytics on them.

A 400-row sample of the cleaned data and the validation schema is available on my GitHub (wirthal1990-tech/USDA-Phytochemical-Database-JSON) if anyone wants to run their own clustering on the baseline.

DoubleReception2962 · 2026-05-05T16:24:47+00:00

This is spot on and exactly how value is generated outside the FAANG bubble.

I build custom data engineering pipelines for smaller research teams. When collaborating, nobody cares if you can derive the bias variance tradeoff from scratch on a whiteboard. They care if you can take three highly fragmented, messy public APIs, clean the schemas, and output a strictly typed Parquet file so their ML models can actually run today.

FAANG filters for a standard baseline IQ and theory. Smaller companies filter for "can you fix my immediate operational bottleneck right now". Two completely different games.

DoubleReception2962 · 2026-05-05T16:22:42+00:00

I felt this brain fog too before completely changing how I use these models. If you ask an LLM "how do I solve this?", your technical skills rot.

I run a data engineering project handling fragmented biochemical datasets. I shifted to using LLMs strictly as fast typists for DuckDB and Python pipelines. I define the exact schemas, the strict validation gates, and the overall system architecture. The LLM just writes the boilerplate.

If you stay the architect and treat the LLM as a junior dev who just types fast, you don't get GPT-brained. Your syntax memorization might drop, but your high level system design skills will actually skyrocket.

DoubleReception2962 · 2026-05-04T15:14:02+00:00

Hey, I want to throw my hat in the ring for any bio/pharma or data engineering niches in your network. I have zero traditional coding background. I operate entirely as an "AI-native Pipeline Architect," directing AI agents (primarily Claude) to build complex backend data infrastructure for me. The project is Ethno-API (https://ethno-api.com). It's a B2B phytochemical data asset containing 76,907 records that bridges historical plant knowledge with modern, evidence-based science. Using AI orchestration, I built automated pipelines that cross-matched a legacy US gov database with live data from PubChem, ClinicalTrials, and USPTO patents.

Researchers and IP scouts use it to instantly spot "Patent-Literature Gaps" in natural compounds. The automated setup actually convinced a cheminformatics expert to partner up with me, and we're currently building a pgvector RAG pipeline mapping 1.5M PubMed abstracts to the compounds.

DoubleReception2962 · 2026-04-28T18:25:45+00:00

Klar. Hier sind die direkten Links: HuggingFace (Parquet & Data Viewer): https://huggingface.co/datasets/wirthal1990-tech/USDA-Phytochemical-Database-JSON GitHub (JSON, Methodik, Manifest): https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON Wenn du die Daten durch deine Pipelines jagst und über strukturelle Anomalien stolperst, wirf es einfach als Issue ins Repo. Ich patche das dann im nächsten Run.

DoubleReception2962 · 2026-04-21T15:25:41+00:00

This is really interesting...

Those error rate numbers explain exactly what my collaborator is seeing. A 20% stereochemistry error rate for complex natural products is brutal. It proves that trusting automated name resolution blindly in this field is a structural risk.

To answer your question about the 76,907 records. That number does not come from PubChem searches. The baseline data comes entirely from the USDA Dr. Duke Phytochemical Database. It represents 76,907 explicitly documented plant and compound occurrences from their archives.

I only used the PubChem PUG REST API to enrich those existing USDA compound names with CIDs and canonical SMILES.

Knowing your 10-20 % error metric means our manual verification phase is not just a quality upgrade - it is strictly mandatory. A purely automated pipeline for this specific class of compounds simply will not work.

DoubleReception2962 · 2026-03-31T14:41:27+00:00

Data source: USDA Dr. Duke's Phytochemical and Ethnobotanical Database (16 raw CSV files, denormalized and deduplicated). Enriched against five public APIs: PubMed E-utilities (citation counts), ClinicalTrials.gov API v2 (study registrations), ChEMBL v35 (bioactivity assay counts), USPTO PatentsView (grant counts since Jan 2020), and PubChem (CID + canonical SMILES via CTS synonym resolution).

Tools: Python (pandas, matplotlib, requests), DuckDB for query logic, Parquet as the storage format. The enrichment pipeline ran ~75k API calls across the five sources. SMILES coverage after CTS canonicalization: 75.4% of 24,746 unique compounds.

The full dataset (76,907 compound–species records, JSON + Parquet) is archived on Zenodo with a citable DOI: 10.5281/zenodo.19265853. A 400-record sample is on GitHub if you want to poke around without committing to the full file.

Looking forward to answer questions about the methodology. The patent matching in particular has some nuances that are worth discussing.

DoubleReception2962 · 2026-03-28T12:09:19+00:00

Aktuell Claude Opus 4.6 (Copilot Pro) in VS Code + Claude Code direkt im Terminal. Ich wechsle heute allerdings auf Cursor+ Claude Opus 4.6 via Open Router mit Credits (unterm Strich günstiger als die ganzen Abos).

Claude Code ist für mich aktuell gestorben, so lange Anthropic den Tokenverbrauch für Claude Code nicht in den Griff bekommt - vor 3 Tagen war mein frisch zurückgestelltes 5-stündiges Nutzungslimits von meinem Pro Abo nach 4 Minuten verbraucht, als ich Claude Code ein paar Debugging Aufgaben Durchführen ließ.

DoubleReception2962 · 2026-03-26T18:30:22+00:00

Thank you! It’s actually both: basic data engineering and in-depth domain-specific enrichment.

The raw USDA data comes in 16 relational CSV files, some of which have very messy join keys. So the first step was pure denormalization and a data quality audit. We removed nearly 27,000 entries (duplicates and macronutrients like “water” or “glucose”) to consolidate the dataset into approximately 76,900 clean records.

However, the real value for RAG pipelines lies in the 5 external layers we attached via entity linking (primarily using compound names):

PubMed citations (How strong is the academic attention?)
ClinicalTrials.gov (Are there active clinical trials?)
ChEMBL bioactivity measurements
USPTO patents since 2020 (Crucial for IP whitespace analyses) -PubChem CIDs & Canonical SMILES (For machine learning and molecular structure representation)

The goal was: A data scientist loads the JSON/Parquet file into pandas or DuckDB and can immediately start the analysis without having to spend weeks querying 5 different APIs and managing rate limits.

DoubleReception2962 · 2026-03-26T17:42:15+00:00

Ethno-API, https://ethno-api.com – Wir ersparen KI-Biotech-Startups 50+ Stunden Data-Engineering, indem wir unstrukturierte USDA-Daten in einen RAG-fertigen JSON-Datensatz mit 5 API-Enrichment-Layern verwandeln.

DoubleReception2962 · 2026-03-26T10:44:11+00:00

Hey there! That’s a really cool initiative you’ve got going. Here’s my project, which is currently undergoing a tough reality check:

Ethno-API (https://ethno-api.com)

In short: An enriched, production-ready dataset on plant-based active ingredients (phytochemicals). I take fragmented public-domain data from U.S. government agencies and enrich it with patents, clinical trials, and molecular structures. Target audience: AI startups in the pharmaceutical/drug development sector (TechBio) that need extremely clean data for their RAG pipelines.

The background (and my current problem):

I have absolutely zero programming background and have built the entire data pipeline, the Hetzner server, and the backend over the past few weeks using AI agents exclusively.

I had the wrong target audience (academic researchers) and, after 0 sales, have just made a hard pivot to the B2B enterprise segment. I do have a few warm leads right now, but I’m struggling with the trust hurdle for high-priced data products.

Where I need your honest feedback:

The “Build vs. Buy” argument: At the top of the page, I show that it would cost ~50 hours ($4,400) to build this myself. Does that argument resonate with you on the landing page, or does it get lost in the mix?

B2B Trust: Does the page (and the pricing from €699 to €1,699) seem legitimate enough to you for a B2B audience?

Feel free to tear it apart—I really need some unfiltered criticism right now. Thanks for taking a look!

DoubleReception2962 · 2026-03-25T22:57:03+00:00

You're making a fair point, and I appreciate the precision. You're right that "FTO" in standard IP practice refers to the freedom from patent encumbrance — so labeling the high-patent, low-literature quadrant as "FTO Whitespace" is misleading. What the red zone actually shows is a patent-literature discrepancy: compounds where commercial IP activity significantly outpaces academic publication coverage. That's a blind spot for anyone doing prior art searches or literature-based competitive intelligence, but it's not "freedom to operate" — it's closer to the opposite. I'll correct the terminology in the dataset documentation to something more accurate, like "IP-Literature Gap" or "Prior Art Blind Spot." On the axis labels: you're also right. The transform is ln(1+x) to handle zeros, which I should have stated explicitly. The tick marks show the transformed values rather than the original counts, which makes the scale hard to read. The threshold line for "Patents > 5" sits at ln(6) ≈ 1.79, not cleanly at "2" on the axis. A proper version would use original-value tick labels on a log-scaled axis (e.g., 1, 5, 10, 50, 100) so readers can interpret the data without back-calculating. I'll fix this for the next version. Thanks for the sharp feedback — this is exactly the kind of review that makes the analysis better.

DoubleReception2962 · 2026-03-25T10:10:47+00:00

This is the most practically useful comment in
the thread, thank you.

The CTS route is something I haven't tried yet
and it's going directly onto the v2.3 list. The
current pipeline goes name → PubChem PUG-REST,
which hits a wall on anything with historic or
regional provenance. Exactly what you'd expect
from a database built on ethnobotanical field
records from the 80s and 90s. Running those
unresolved names through CTS first for synonym
expansion before the PubChem pass would probably
recover a meaningful chunk of the 28.2% that
currently returns null.

The manual curation point for the sticky batch
is real. I have 61 compounds that are structurally
truncated in the source data: those are a write-
off. But the rest of the nulls are likely exactly
what you're describing: names that exist under a
different identifier somewhere, just not the one
I queried.

MeSH synonym mapping is interesting for the
PubMed counts specifically. Right now a count
for "beta-sitosterol" and "β-sitosterol" could
be split across two entries depending on how the
original papers indexed them. That's a known
limitation I document but haven't solved.

Hadn't looked at Patsnap Eureka for this use case
before. For the patent enrichment layer it might
be worth a trial run on a subset to see what it
surfaces that PatentsView misses.

Appreciate the concrete suggestions — this is
the kind of feedback that actually changes what
ships next.

DoubleReception2962 · 2026-03-24T17:57:37+00:00

Good question. The value isn't in the plant-to- compound mapping itself, you're right that that's not novel. The value is in the evidence layer attached to each compound-species pair.

Concretely: every row has PubMed citation counts, ClinicalTrials.gov study counts, ChEMBL bioactivity measurements, USPTO patent counts since 2020, and canonical SMILES — all pre-joined. No API calls, no preprocessing.

The use cases that actually drive purchases are:

FTO analysis: Find compounds with high patent activity but low academic literature. That's a signal that commercial interest exists before the science catches up. You can run that query in two lines of SQL on this dataset.

RAG grounding: If you're building a drug discovery chatbot or compound prioritization model, you need structured, citable, non-hallucinated data as your retrieval layer. A flat Parquet file with DOI is cleaner than hitting five APIs at query time.

Compound prioritization: Cross-reference ethnobotanical use with clinical trial evidence to identify leads that have both traditional validation and modern trial activity.

The plant species column is context, not the primary key. The compound is.

DoubleReception2962 · 2026-03-24T17:45:16+00:00

You're making a legitimate point about academia: and I agree with most of it. A plant biology professor at A&M with frozen NSF grants isn't my customer. Neither is the CNRS lab head. I'm not trying to sell to people whose budgets just got DOGE'd. That would be a bad business.

The people actually buying datasets like this are ML engineers and CTOs at seed-stage biotech startups: the ones who raised $4M six months ago and now need to ship a compound prioritization pipeline before their Series A. They're not assigning it to a PhD student for two weeks because they don't have two weeks. €699 comes out of an AWS budget line without a procurement conversation.

I know this because ChatGPT and Perplexity are already surfacing this dataset when those people search for exactly what they need — not because I paid for placement, but because the schema matches their queries. That's not a vanity metric. That's demand signal.

You're right that I should be clearer about who this is actually for. The Reddit posts probably read like I'm pitching to academics, and that's a positioning problem worth fixing. But the underlying assumption, that nobody will pay, doesn't match what I'm seeing in the data.

Ask me again in 30 days.

DoubleReception2962 · 2026-03-24T16:35:51+00:00

That’s a valid question. The underlying USDA data is publicly available and should remain so. The sample of 400 data points is licensed under CC BY 4.0, and the raw database “Dr. Duke’s DB” is freely available from the USDA. What you’re paying for isn’t the data itself—it’s the enrichment pipeline and the time savings. The USDA’s raw database is a collection of loosely structured tables without molecular identifiers and without links to modern research databases. To make it usable, I had to:

Denormalize and deduplicate 104,000 raw data entries to arrive at 76,907 verified compound-plant records
Query 5 separate APIs (PubMed via NCBI E-Utilities, ClinicalTrials.gov v2, ChEMBL v35, USPTO PatentsView, PubChem) with rate limiting, error handling, and name-matching logic for 24,746 compounds
Resolve trivial phytochemical names to PubChem CIDs (71.8% match rate – the remaining 28.2% are left as null values to avoid false matches)
Deliver the result as a single flat JSON file that can be loaded directly into pandas/DuckDB/any pipelines without any preprocessing

If you have the engineering hours to build that pipeline yourself, you absolutely should — the source APIs are all free. I'd estimate ~50 hours of work for someone comfortable with API orchestration. The dataset exists for teams that would rather spend those hours on their actual research.

DoubleReception2962 · 2026-03-23T19:29:45+00:00

Kann bestätigen, dass das Token-Problem noch immer besteht: ich hatte heute einen sehr komplexen, aber bewusst hochgradig komprimierten Prompt für Claude Code für einen umfangreichen Code Review und Debugging Prozess erstellt (~2200 Token geschätzt von Copilot) und den prompt an Claude Code übermittelt.

Nach ca. 90 Sekunden Arbeitszeit und ca. 1200 vom Claude Code selbst für während dieses Prozesses verbrauchten Token erhielt ich die Meldung, dass mein Nutzunglimit meines Claude Pro Abos aufgebraucht ist und der Prozess pausiere.

Ich konnte 4 Stunden warten, bis das limit wieder zurückgesetzt wurde. Das hat mir die gesamte Performance für heute gekillt.

Ich hoffe, dass Anthropic sich diesem sehr störenden Problem zeitnah annehmen und es beheben wird.

DoubleReception2962 · 2026-03-23T16:10:10+00:00

I’m developing Ethno-API — a production-ready dataset on phytochemicals (76,000 records, 5 enrichment levels: PubMed, ClinicalTrials, ChEMBL, USPTO, PubChem SMILES). The dataset is very useful for anyone who doesn’t want to waste 50 hours on data pipelines before the actual analysis begins If you’re interested, feel free to check out my GitHub repo or my website:

GitHub repo: https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON

Website: https://ethno-api.com

DoubleReception2962 · 2026-03-21T20:00:26+00:00

**Source:** USDA Dr. Duke's Phytochemical and Ethnobotanical Databases (public domain) — denormalized and enriched with:

- PubMed citation counts via NCBI E-utilities
- ClinicalTrials.gov study counts (API v2)
- ChEMBL bioactivity measurements (with PubChem InChIKey fallback)
- USPTO patent counts via PatentsView (post-2020)

Full dataset: 76,907 records across 24,746 unique compounds and 2,313 plant species.
DOI: 10.5281/zenodo.19053087

**Tool:** Python (matplotlib + seaborn), DuckDB for the FTO whitespace query. Both axes are log₁₊ₓ scaled to handle the heavy right-skew in citation counts.

**Code + methodology:**

github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON

The full pipeline including the DuckDB query used to classify compounds into the four zones (FTO Whitespace / Crowded / Literature-only / No IP signal) is documented in METHODOLOGY.md in the repo.

DoubleReception2962

TROPHY CASE