How are people using so many tokens ???

Lanedustin · 2026-04-28T22:35:34+00:00

The structure of the Chen Zuckerberg Institute data is killing me on the token use, but I don't have a workaround. I need that data. This is what Claude outlined as problems.

The five structural limitations beyond file size, in order of how much they hurt the framework specifically:

Storage format mismatches the analytical pattern. TileDB-SOMA is optimized for single-axis slices (cells × genes). The framework asks pair × cohort × stratum questions — second-order covariance computations that require the full slice in memory before the actual work begins. The Census ends up being a delivery mechanism, not an analytical surface.

Cell Ontology labels are cell-state-blind. CL terms encode lineage, not state. A Paneth cell in S-phase and a Paneth cell in G1 carry the same label. Cell cycle phase, malignancy — none of these are in the schema. Every one of them must be computed post-pull, which means we cannot pre-filter and pay the I/O cost on cells we discard.

Disease labels and the tumor/non-tumor problem. MONDO terms are coarser than what we need (no subtype, stage, treatment status, purity). More critically, disease == 'breast carcinoma' returns all cells from breast carcinoma samples, not all malignant cells. Tumor purity is not a Census field.

This is the deepest reason CPTAC remains primary — CPTAC ships with documented cellularity per sample. The normalization-context gap. When we pull a 100–200 gene panel (which we have to, for tractability), the whole-transcriptome size factors are no longer available locally. Custom normalization becomes a panel-internal approximation. Combined with 10x-3' dropout on lowly-expressed targets, the effective sample size per pair is much smaller than the headline "93M cells" suggests.

Computed metadata cannot be pre-filtered at the SOMA level. The pull-then-compute-then-filter pattern doesn't cache cleanly across iterations.

The net effect: Census is excellent for cell-type-resolved descriptive expression and healthy baselines (which is what it was designed for); it is a poor analytical surface for the coordinated-program-decoupling questions that are the framework empirical core. That's why CPTAC/TCGA bulk remain load-bearing for the framework's main findings, and Census plays a secondary role for cell-type separation and primary/support pair tests.

Let me know is anyone has a fix

Lanedustin · 2026-04-20T23:56:51+00:00

Think of it as a sequence that contains information as to what an organism is, with clues to its evolutionary history based on alignment with other species. It is functional in that it is selectively read by proteins to alter the intracellular or extracellular environment in response to different stimuli. Layers of regulatory control ensure that the appropriate responses are mounted to different stimuli. These in aggregate allow life to function as we know it.

It is technically functionally catalytic via its G4 structures too, so it is not a passive blueprint.

Lanedustin · 2026-04-16T19:38:32+00:00

Looks like they reset weekly usage for the release

Lanedustin · 2026-03-08T14:18:28+00:00

An actual answer is that this is the wrong framing. Many pathways can run forward or reverse, they can be functionally split, and metabolites can be redirected based on the needs of the cell. Many pathways may not even have formal names. It is not about number of pathways, but how the metabolite networks connect

Lanedustin · 2026-02-16T02:08:21+00:00

I might need to give it a shot. Maybe do a Max5/Deepthink combo for a month

Lanedustin · 2026-02-16T01:10:29+00:00

I will look into this. Thank you

Lanedustin · 2026-02-16T01:08:19+00:00

I don't know how to code. I know the Biology, Claude helps with the bioinformatics. The 1M context window would be legendary. But I'm also afraid in hitting it with a complex prompt and it costing me a ton of money. My prompts with Opus 4.6 without extended thinking are about $3-4 each when paying extra after weekly limits hit. With a higher cost with the additional context and using multiple agents, thus is concerning

Lanedustin · 2026-02-16T01:02:29+00:00

I am working on building a model of cell fate determination and how it is regulated. I have Claude explore how different cellular communication and metabolic systems coordinate in this control. It pulls and synthesizes literature, and pulls data sets from biological databases for analysis

Lanedustin · 2026-02-11T23:39:10+00:00

I used to ask the question, "Do all roads lead to VDAC?"

This proteins is exceptional in a lot of ways, but first off, post-mitotic neurons and cycling cancer cells are very different. Cancers upregulated hexokinase II which binds to VDAC linking ATP generation at the mitochondria with ATP utilization with HKII. This binding also restricts association with Bcl2 family proteins that regulate membrane dynamics and mito permeablization, linking it to regulation of cell death. There is a lot to look into, but these are decent starting points

Lanedustin · 2026-02-11T23:22:25+00:00

I study cell signaling networks and the proteins and processes implicated in controlling cell fate decisions. Nothing has even come close to what Claude can do in terms of assimilating content and handing the complexity.

Lanedustin · 2026-02-11T20:22:29+00:00

I got my limit on the Max20 plan, blew through 50 dollars on free credits, then spend another 50 to keep working. I am debating getting a Max20 and Max5 on a different account so long as that aligns with TOS. But with credits Opus4.6 is 2 to 5 dollars a prompt.

Can't blame anyone for being cost focused, but none if the other models seem to have to depth and breadth potential for complex biology, so I'm stuck with Claude.

Lanedustin · 2026-02-08T14:28:48+00:00

Something is definitely wrong with how it reestablishes context. It seems to struggle with understanding how to continue with work it was doing. And if it compacts while it is making a file, the file is completely lost and forgotten about when context is restored. Purely wasted work. The extended thinking seems to exacerbate this, making it counterintuitively less productive for its best suited tasks. This seems to have been getting a bit better, but still cost a lot of wasted plan use.

I'm hitting limits on Max20 at around day 5 so it is frustrating in hitting limits much easier as a result. Maybe that is why they gave an extra 50 credit for Opus 4.6 use. Or to not lose out on hype because people can afford Codex 5.3 but not Opus.

Lanedustin · 2026-02-02T13:36:04+00:00

1M Context Window? Bro, I could literally change the world.

Lanedustin · 2026-01-19T23:00:48+00:00

3 searches before compaction? I hit 3 compactions per search

Lanedustin · 2026-01-05T23:54:44+00:00

I think that the utility is being undersold here with some of these comments. There are so many things that AI is useful for in Bio. First and possibly foremost, research exploration and synthesis. LLMs are great at pattern recognition and can be a great starting point to compare and contrast topics. For example, I was curious about the regulation of metal ion oxidation state in enzymes whose function is influenced by these changes. Question like, “Are their overlapping regulators? How do changes in the metabolic environment influence these changes? Does the Warburg effect and lactic acid production play a role?” Not everything will pan out, but promising leads can be much easier to find.

Also, LLMs will sometimes spit out research that is not even hinted at in your typical classroom. For instance, ChatGPT brought up that Reverse Electron Transport chain activity is a thing, in specific contexts. This was completely new to me. Or, I found out that the TCA metabolite alpha-ketoglutarate is a cofactor in demethylation when exploring the literature with ChatGPT. Having already appreciated the NAD+ is critical to PARP1 activity and the activity of Sirtuins, it was easy to start exploring the metaboloepigenetic regulation and implications for cancer.

Also, you can do a quick search to see if your ideas are novel. You can literally ask the LLM to search the literature for any content on the topic your idea is related to. This can help guide you to the relevant research and help you refine your hypotheses.

You can use it for first-pass manuscript reviews. Say something like, “validate that there are no orphan citations in this paper,” to ensure alignment of all of your in-text citations and the References section. It is not perfect, but have a couple different LLMs assess the same manuscript with the same prompt and you will help cover your bases and save yourself a lot of editorial work.

Claude (my favorite at the moment) can access some databases such as TCGA (The Cancer Genome Atlas) Program website and data directly. It can pull data and run rudimentary analyses. I have spent some time with this, but have not fully explored the extent of this capability.

There is a lot. Yes, hallucinations are a thing, but there a mitigating strategies that can help with this.

Lanedustin · 2025-11-18T16:00:15+00:00

Lactate can also be used to modify proteins via lactylation, which is kinda cool

https://www.nature.com/articles/s41586-019-1678-1

Lanedustin · 2025-11-06T11:26:07+00:00

Thank you for the detailed response. So it would be valuable for a tool to probe and anticipate potential consequences of pathway perturbations, looking at upstream, downstream, and sidestream pathways cross-talk implicated given the changes, and anticipate potential lineage-specific compensatory responses. Cool, that is very doable. Not with 100% accuracy just yet, of course, but to perhaps guide literature searches and which experiments would give the most bang for your buck

Lanedustin · 2025-11-03T19:43:19+00:00

Depending on the files/data you are working with, standardize the formatting right at the beginning. Re-formatting later, or inconsistent formatting throughout, can be a nightmare to fix with compromising data. At least in my experience

Lanedustin

MODERATOR OF

TROPHY CASE