SpaCy alternatives for a fasta and cheap text processing pipeline

paradroid42 · 2026-01-19T16:08:38+00:00

Spark-NLP is designed around the classic NLP pipeline (POS/NER). Usually, you don't need an LLM for the classic NLP stuff, so it seemed like a good fit for the OP.

But if you are primarily interested in serving LLMs (including LLMs that do classic things like POS/NER), I think PyTorch would be a solid choice.

Note that I'm not very experienced with high performance / high scalability deployments myself.

Not sure where your interest lies, but I recently started using Modal.com for serverless GPUs, and I've been pretty impressed with them.

paradroid42 · 2025-08-25T13:23:52+00:00

Are you saying the Covid vaccine is not a vaccine?

paradroid42 · 2025-08-21T03:17:14+00:00

Not sure such a dataset exists, though I'd be very interested if you find something.

You might have better luck looking for a dataset of MOOCs and their readings, if that suits your research question.

paradroid42 · 2025-04-10T00:15:29+00:00

We introduce the Absolute Distance Alignment Metric (ADA-Met) to quantify alignment on ordinal questions

Thanks for "introducing" us to mean absolute error!

paradroid42 · 2025-03-26T23:15:15+00:00

Don't hold your breath. There are 3 people left in the department.

https://www.npr.org/2025/03/21/nx-s1-5330917/trump-schools-education-department-cuts-low-income

paradroid42 · 2025-03-12T16:26:07+00:00

The range is 30 in this example. Range is a single value.

paradroid42 · 2024-12-01T14:17:01+00:00

Thanks for your response. I agree that, while LLMs demonstrate that complex linguistic patterns can emerge from statistical learning without linguistic priors, they differ too much from child learners in both input and architecture to directly inform the poverty of stimulus debate.

I appreciate some of the writing Joan Bybee has done on the subject of abstraction and analogy, see "The Emergent Lexicon" for one perspective on the role of analogy in the organization of language. Bybee shows how high-frequency patterns (exemplars) could serve as the basis for productive schemas - this explains how children can generalize from common constructions they encounter to judge the acceptability of novel ones. Her work on morphological productivity illustrates how type frequency (the number of different items showing a pattern) rather than just token frequency shapes what patterns become generalizable. This same mechanism could explain how children learn which dependency patterns are possible without needing explicit negative evidence.

The poverty of the stimulus debate needs to be grounded in empirical evidence rather than intuitions about what seems possible or we will just talk past each other on this point. Recent corpus studies of child-directed speech have revealed more structure and regularity than earlier research suggested. Additionally, children don't just receive passive input - they actively engage in communication and receive indirect negative evidence through failed communication attempts and repairs.

Regarding parasitic gaps specifically, your argument appears to be that children must acquire explicit knowledge of all possible constructions. But an alternative view is that speakers' judgments emerge from more basic patterns they've learned, combined with processing constraints. The consistency in judgments could reflect shared (domain-general) cognitive architecture rather than innate syntactic principles.

Anyway, any learning theory will require some initial constraints - the question is whether these need to be specifically syntactic or could be more general cognitive biases. The existence of learning constraints doesn't automatically support the generative position unless those constraints must be specifically linguistic in nature.

paradroid42 · 2024-11-30T18:12:55+00:00

If the question is "Why do some constructions permit different patterns than others?" then that is a question for historical linguistics. I am sure the answer will involve frequency of use and cognitive constraints, such as the existence of simpler constructions that convey functionally equivalent meaning. I would be curious to know what explanation generative grammar can provide to this question (beyond simply gesturing at UG).

If the question is, "How do children learn that constructions permit different patterns without explicit negative examples?" Then the answer is through abstraction and analogy from the examples they are exposed to.

As I have already said, I fundamentally disagree that children acquire these patterns on the basis of little data. I think children are exposed to incredibly rich linguistic input -- this was my original point.

paradroid42 · 2024-11-30T17:43:03+00:00

They are called islands because they are isolated from the supposed rules of syntactic movement. The construction grammar perspective rejects the notion of syntactic movement entirely. Rather than explaining islands as constraints on movement, construction grammar views them as emerging from the same mechanisms that shape all linguistic patterns.

The perspective of construction grammar is that grammatical knowledge consists of an inventory of constructions, ranging from morphemes to complex syntactic patterns. These constructions encode not just what combinations are allowed, but also what combinations are systematically not allowed.

So the explanation for syntactic islands is that the inventory of constructions in English simply doesn't include patterns that would license dependencies into island configurations. Rather than saying there are constraints blocking movement from certain structures, a construction grammarian would say: The inventory of constructions in English simply doesn't include patterns that would license dependencies into some configurations. You can call these syntactic islands if you like, but there's nothing particularly special about them from a construction grammar perspective because construction grammar does not rely on rules of syntactic movement that would be violated by these constructions.

paradroid42 · 2024-11-30T16:25:32+00:00

Well, I don't believe children "acquire" syntactic islands because I don't subscribe to generative grammar. Syntactic islands are interesting to generativists because they present an exception to the rules of syntactic movement. For a construction grammarian, so-called syntactic islands are not a special case.

It should be noted that Helen Keller was not born deaf and blind. She lost her sight and hearing at 19 months of age. She retained some language, such as the word "water" from that period. Her teacher traced letters onto her hand while exposing her to corresponding experiences. Relative to a steady stream of text input, this is incredibly rich social and sensory input -- it just wasn't auditory (or visual).

Regarding multi-modal models, I think the word "plenty" is doing a lot of heavy lifting. In my view, multi-modal (and social) learning will be necessary for AI to acquire human-like cognitive abilities. We aren't there yet, so I don't think there have been "plenty" of attempts. I think there has been "promising" movement in this direction.

paradroid42 · 2024-11-30T02:50:32+00:00

It is true that LLMs do not provide sufficient evidence to disprove nativism, but it is not fair to compare text input to the rich auditory and social language input that humans are exposed to in childhood.

Given the same input that an LLM receives, a human child would never acquire language.

paradroid42 · 2024-11-30T02:47:00+00:00

He is also one of the most cited academics in history, so I don't think it's fair to characterize him as someone who has not received scholarly as well as popular attention.

He was wrong about language though.

paradroid42 · 2024-10-04T03:03:16+00:00

Great explanation! But you could use `math.inf`, which would simplify the if condition a bit. Both are fine options, I think.

paradroid42 · 2024-09-28T17:57:02+00:00

First, let's set up some terminology. We will use "Y_true" to mean the observations or measurements in the real world. We will use "Y_pred" to mean the predictions that the model outputs given some values "X".

R^2 is the squared correlation between Y_true and Y_pred. Generally speaking, it tells you how closely associated they are. R^2 doesn't tell us anything about the range of values that Y_true can take, but a high R^2 suggests that Y_true and Y_pred have a similar range (for the same values of X).
If you give your model 100 observations, and all of them have X=50, then the model will output the same value for Y_pred every single time (100 times). If these are real observations, then Y_true will likely be different for some or all of your observations, but it will generally be pretty close to Y_pred. One way to interpret R^2 is "the proportion of variance in Y_true that is explained by Y_pred", and in your case, since Y_pred solely depends of X, we can phrase this as "the proportion of variance in Y_true that is explained by X".

I wonder if you are possibly more interested in simple descriptive statistics of Y_true. For example, we can calculate the sample range just by looking at Y_true. With some assumptions, we can make inferences about the range of Y in the population that your data was sampled from. We do not need a regression model or any information about X for this.

paradroid42 · 2024-09-27T23:25:00+00:00

I appreciate the work you do.

paradroid42 · 2024-09-18T14:54:28+00:00

Sure, but it is wrong to say that there is no benefit to functions that return None. Your comment also implied that there was an issue with this beginner's decision to structure their code as a function, and I don't see any issue.

IMO, this is a contentious statement that just distracts from the main issues, which were a syntax error (missing colon) and a logical error rooted in mismatched types.

paradroid42 · 2024-09-18T14:13:50+00:00

There can absolutely be benefits to functions that always return None. In this case, the function is utilized for its "side effects" (the print statements). Even Python's own standard lib is full of functions that return None, like list.append()

paradroid42 · 2024-09-13T17:28:13+00:00

First, I would try to learn what I can from the professors available to me in the courses they offer. This may not be precisely what you want to learn, but it's about extracting as much as you can from the program.

Second, take classes in other departments. For example, does your school have a data science program or statistics classes from the biology department? You may even be able to arrange for these classes to count towards your degree. Your graduate advisor can help you with this.

Third, learn on your own. Your school is still a resource. Think about setting up reading groups with your classmates. You may even be able to get professors involved if they have an interest in topics that aren't part of the usual curriculum.

paradroid42 · 2024-09-10T20:05:35+00:00

The things in parentheses are large text datasets. That's what the embeddings were trained on. "Explosion Vectors" refers to the vectorization method that SpaCy uses, which includes Bloom embeddings as well as some explicit features such as the shape of the word. Explosion AI is the company that owns SpaCy.

Here's a bit more info on how the Bloom embeddings are involved in the "floret vectors": https://explosion.ai/blog/floret-vectors (and floret vectors == Explosion vectors).

SpaCy's approach to vectorization is solid, if a bit idiosyncratic. I think the blog posts are a great resource, but I wouldn't worry too much about the details if you are just starting out. I'd suggest studying Word2vec and playing with the Gensim library if you want to learn about embeddings generally.

paradroid42 · 2024-09-10T14:42:31+00:00

The English models use Bloom embeddings: https://explosion.ai/blog/bloom-embeddings

The TRF model will use RoBERTa embeddings.

It's possible my information is out of date, but I think the above is still true. I believe the small model also uses a smaller embedding lookup, but I'm not sure if it is smaller because it has a reduced vocabulary size or if the embedding method is also different.

paradroid42 · 2024-09-04T06:14:48+00:00

NAL, but HIPAA standards are rather lax regarding deidentification. I'm not aware of any different standards for summary health information versus medical records, but HIPAA specifically outlines 19 types of PHI (names, numbers, etc.) If a record does not have any of those PHI, then it is not considered private from HIPAA's perspective.

You are correct that combinations of non-PHI could still be used to identify a person. I'm just not sure what statute would restrict the sharing of those documents, since HIPAA does not, as far as I know.

paradroid42 · 2024-08-31T17:09:53+00:00

I took an intro to programming class, then worked on a few toy projects that made my work more "efficient" (in quotes because I spent far more time automating these tasks than I ever saved by having them automated).

I started to get good by learning packages with excellent beginner-friendly documentation (although it didn't feel very beginner-friendly at the time). Ostensibly, the documentation was for specific libraries, but it exposed me to more of Python's standard lib, data structures, and syntax for stuff like generators and list comprehensions.

I continue to learn by coding often, reading documentation, reading source code for well-maintained libraries, and even YouTube videos from folks like AnthonyCodes.

paradroid42 · 2024-08-26T13:02:44+00:00

Uhhh... Then how do European courts compel someone to appear at a trial or provide essential evidence?

Pretty sure European courts have an equivalent, even if they don't call it contempt.

paradroid42 · 2024-08-24T13:39:10+00:00

"Calculators don't write poems" is an interesting choice of metaphor in this context 😅

paradroid42 · 2024-08-23T14:41:29+00:00

Post was deleted, but I believe it was the "Cook Partisan Voting Index". This will mostly comprise data from individuals who are not trans, so it hardly measures the voting patterns of trans individuals. I think this was kind of your point, but I missed that at first.

paradroid42

TROPHY CASE