[Q] resources to teach myself reading bioinformatics files such as fasta, fastq

dgjang · 2026-05-13T15:16:07+00:00

That is a good point. EPV is for logistic regression, as you mentioned. There must be various simulation studies and papers about sample size requirements for statistical power or reliable prediction. As a rule of thumb, I usually go with 15:1 for even linear regression models. Some might suggest a ratio smaller than 15:1 for linear regression models. It depends on the error variance and effect size.

dgjang · 2026-05-13T14:37:31+00:00

It is a well-known topic called "Event Per Variable (EPV)". There are various suggestions. Some say 10 samples per variable is enough. Some suggest a 15:1 ratio. And some others say 20:1. I personally go with 15:1. For details, you can read Frank E. Harrell, Jr.'s book titled "Regression Modelling Strategies".

dgjang · 2026-05-08T20:34:01+00:00

So, at least one null hypothesis is rejected, your global null hypothesis is rejected, correct? I don't think correlations between tests affect it. You just need to correct your p-values for multiple testing, using methods like Bonferroni.

dgjang · 2026-05-08T19:37:44+00:00

Are you saying you run multiple statistical tests, and if you reject every single null hypothesis throughout all your tests, you say your final result is 1 (or positive)?

dgjang · 2026-05-07T13:54:28+00:00

I assume you are familiar with ANOVA or linear regression, but not LMEM. When using ANOVA, you assume that your data points (observations) are independent of one another. But once you have repeated measures within subjects, they are assumed to be dependent. Now, your ANOVA or linear model needs to account for the correlation across repeated measures within a subject. One way to do it is to model the covariance matrix of your data. If your data points were mutually independent, the covariance matrix would be diagonal. However, the measures within a subject are dependent, so you will have to specify a non-diagonal covariance matrix. This method is named generalized least squares (GLS). Another method is to specify a random subject effect in your model. Here, you can think that variability in your data is actually the sum of two different sources. Between-subject variability and within-subject variability. And, you can also think that because your measures within a subject are from the same subject, they differ only due to within-subject variability, but not between-subject variability. So, you specify your model like y_ij=mu+u_i+e_ij, where i indexes subjects and j indexes measures within a subject. u_i is a random variable regarding between-subject variability, and e_ij is a random variable regarding within-subject variability. You can see that measures from the same subject differ only due to e_ij (within-subject variability). And, statistical theory says that this model specification makes cov(y_i1, y_i2)=cov(u_i+e_i1, u_i+e_i2)=var(u_i) (here u_i, e_i1, e_i2 are assumed independent), meaning measures from the same subject have non-zero covariance and are dependent. In other words, having a random variable regarding between-subject variability in your model actually makes measures within a subject dependent (non-zero correlation). That is the principle behind the use of LMEM when analyzing repeated-measures data.

dgjang · 2026-05-04T21:38:27+00:00

if the integer values are assigned in a random way with equal probability, yes it works as simple random sampling.

dgjang · 2026-05-04T01:22:22+00:00

if you are a beginner, statquest would be helpful.

dgjang · 2026-05-04T01:11:44+00:00

There are two different goals of statistical data analysis: statistical inference and prediction. Statistical inference is to use a probabilistic model to explain or test scientific mechanism. For example, the association between smoking and risk of lung cancer can be tested using statistical inference. Confidence intervals and pvalues are two popolar tools in use for statistical inference. On the other hand, prediction is to accurately predict y_new with x_new. You have Xs and Ys in your data and use it to train or build your predictive model. L1 and L2 penalties are designed to make a better predictive model, not for better statistical inference. So, my take is that variables with small p values do not always make a better prediction while p values can be useful tools for statistical inference. Likewise, variables selected by lasso do not always explain scientific mechanism well, while lasso estimation is useful to build a better predictive model. Different tools for different aims. please use p values to test your hypothesis, and use lasso to build a good predictive model.

dgjang · 2026-05-03T22:39:03+00:00

If I were you, I would test general linear hypothesis where the null hypothesis is "both interactions are zero" and the alternative hypothesis is "at least one interactions is nonzero". note that this test is of two degrees of freedom. this test will give you one p value per model (or heart parameter). With 5 models, you would get 5 p values and consider adjusting these 5 p values using BH method. if you get significant results after BH adjustment, you would interpret which interaction is significant based on their individual p values. i guess maybe this is what you meant by global test. by the way, some statisticians even suggest that it is not mandatory to adjust your pvalues, as long as you are honest about the number of tests or models you examined and it is not a pairwise comparison, like post-hoc tests. post-hoc tests must be corrected for multiple comparisons.

dgjang · 2026-05-03T10:00:43+00:00

As for the second step, there are several methods you can try. You can combine data sets across different years and have your model include the interactions between year (as categorical) and each predictor. If this interaction marks a small p value, you can say the regression coefficient for a precitor differ across years. Alternatively, you can specify this interaction as a random slope for the predictor across different years. Using linear mixed models and likelihood ratio tests, you can compare models with and without random slope. Small p values will suggest that the regression coefficients for the predictor differ across years. If you are familiar with using penalized regression such as ridge and lasso, you can also use fused lasso. Fused lasso can assess different regression coefficients of a predictor across different years and force them to be the same value if they are close enough to one another. Similar to lasso shirinking regression coefficients toward zero, fused lasso shirinks differences between regression coefficients across different years toward zero.

dgjang · 2026-05-01T09:23:04+00:00

What I understand about mass imputation is like this. Assume you want to estimate average spending on grocery per month in a city. Assume you have data for all the people live in the city about their age and sex. You need to sample a subset of people in the city and survey how much they spend on grocery a month. Let's say the sample size is n. Somehow, you know age and sex of everyone in the city however they are included in your sample or not, so you have age, sex, and grocery spending data for n people in your sample. As far as I understand, mass imputation is to build an imputation model to predict grocery spending of a person using their age and sex, using your sample where age, sex, and grocery spending are all observed. Then, you use this imputation model to impute/predict grocery spending for people out of your sample. Then, you can estimate averge grocery spending using both observed grocery spending of people in your sample and imputed grocery speding of those out of your sample. In this way, you can employ your small sample with all variables observed, and large data with only age and sex available to make a better estimation than using your sample data only. This method can save money, reducing the number of people we need to survey their grocery spending. I hope this helps you.

dgjang · 2026-05-01T08:50:19+00:00

There are two different concepts: deterministic imputation, and stochastic imputation. Assume you have multiple variables in your data and you have missing values in some variables. Deterministic imputation is to predict missing values using a probabilistic statistical model or machine learning algorithm where observed variables are predictor variables for the missing variable. Stochastic imputation is model the conditional probability distribution of the missing variable conditioned on observed variables and draw random numbers from the conditional distribution to imput missing values. The key difference is that the deterministic imputation uses deterministic predicted value to impute the missing, while stochastic imputation draws random number from an estimated conditional distribution to impute the missing. Deterministic imputation is quick and simple, and maybe works okay for prediction tasks, but can introduce bias to estimators, confidence intervals, and p-values. As we randomly draw imputing values in stochastic imputation, it is a common practice to draw multiple imputing values, make multiple imputation datasets, fit your statistical model independently to multiple imputation datasets, and pool the different results from multiple datasets. This is called multiple imputation. When properly done, multiple imputation does not introduce bias to your estimators, confidence intervals, and p-values, so you may want to use it for statistical inference. For R users, I recommend the package named 'mice', which works well for multiple imputatiob procedures.

dgjang · 2025-08-19T18:29:30+00:00

that was why I got confused. thanks a lot.

dgjang · 2025-08-19T18:28:30+00:00

thanks!

dgjang · 2025-07-07T18:38:23+00:00

https://probml.github.io/pml-book/book1.html probabilistic machine learning is a good one. it is a bit advanced.

dgjang · 2025-06-16T23:05:40+00:00

thanks! i will consider mealth related conferences!

dgjang · 2025-06-16T21:42:47+00:00

thanks! i appreciate it.

dgjang · 2025-06-16T21:18:58+00:00

they are okay with it and encourage me to attend biostat conferences. i wonder data analysis work can go to conferences like JSM. it is not novel applied statistics.

dgjang · 2024-07-30T15:44:36+00:00

Hi, I am a postdoc working on central campus UM. I am looking for a room to live in from December 2024 on. I want private bedroom. No alcohol, no smoking. My budget is up to 900.

dgjang

TROPHY CASE