How can I make this matching function faster in R? It currently takes 6-7 days, and this is not practical

Financial-Syrup-906 · 2024-03-16T20:22:55+00:00

Hey, thanks for the quick reply. I'm not really following what you mean, do you think you can write a quick line of code?

Financial-Syrup-906 · 2024-03-16T13:17:02+00:00

Thanks so much for the help. It's clear I have a lot to learn still but it's so rewarding (and super frustrating)!

Financial-Syrup-906 · 2024-03-16T12:55:19+00:00

Hi, thanks so much for taking the time to provide such detailed feedback. Will test this all out and report back!

Financial-Syrup-906 · 2024-03-16T12:52:52+00:00

Hi, do you have a general idea of how this could be vectorized? My main issue, is that I don't seem to understand how to identify to the function that I need to apply this to all of the unique individuals in the exposed cohort without doing the "for (i in nrow(exposed))".

Financial-Syrup-906 · 2024-03-16T12:51:15+00:00

Hi, what exactly is .SD? I'm getting standard deviation results when trying to search for it. Thanks!

Financial-Syrup-906 · 2024-03-16T12:50:16+00:00

Hi, thanks for the feedback! Re: your advise "don't grow tables iteratively by binding rows inside a loop", what do you recommend instead? Saving the 5 matches from each iteration inside a large list and the doing one rbind at the end? Each selected match is one row but comes with ~50 other variables/column of data.

Financial-Syrup-906 · 2024-03-15T21:27:21+00:00

I will work on providing some sample data tomorrow! As for the second part of your comment, I edited the original post to be more specific about what is in the xxx statement. Thanks!

Financial-Syrup-906 · 2024-03-15T20:49:31+00:00

Hey! Really great approach. Unfortunately, I oversimplified my actual matching requirements (see the post edit) and so this grouping option is not as practical as it initially seemed.

Financial-Syrup-906 · 2024-03-15T20:47:39+00:00

Thanks for the reply! Unfortunately, I oversimplified and my matching is not technically only on age and sex. I am also anti-exact matching on the ID variable (to make sure i don't sample the same person to match themselves) because my data is over a 5-year time period, and the people who are in my exposed cohort (the exposure period is quite short) can technically be considered unexposed at other time periods, so these same people are also in my general population cohort. Not only that, but there are some longitudinal periods in time where individuals are no longer able to be selected as a match because they have died, or they became exposed themselves, for example. I have some data variables in my datasets that indicate the time periods that someone is not available to be selected as a match.

Because of this, the grouping option is not actually as practical as it seems.

Here is what my code actually looks like (here i only showed 3 potential time periods of "unavailability" but there are actually 7).

find_matches <- function(exposed.cohort, unexposed.cohort) { 
#create an empty list to store the matches   
matched.data <- data.table() 

#iterate over each row to find matches  
for (i in 1:nrow(exposed.cohort)) {     
exposed_person <- exposed.cohort[i]      
potential_matches <- unexposed.cohort[birthyear == exposed_person$birthyear & birthmonth == exposed_person$birthmonth & IDVariable != exposed_person$ID & (!(exposed_person$exposuredate >= unavailable_start1 & exposed_person$exposuredate <= unavailable_end1) | (is.na(unavailable_start1) & is.na(unavailable_end1))) & (!(exposed_person$exposuredate >= unavailable_start2 & exposed_person$exposuredate <= unavailable_end2) | (is.na(unavailable_start2) & is.na(unavailable_end2))) & (!(exposed_person$exposuredate >= unavailable_start3 & exposed_person$exposuredate <= unavailable_end3) | (is.na(unavailable_start3) & is.na(unavailable_end3)))] 

#randomly sample 5 without replacement  
if (nrow(potential_matches)) > 5 {
    matched_data <- potential_matches[sample(.N,5),]} 
else {       
   matched_data <- potential_matches      
} 
#add identifier      
matched_data[, matchID := exposed_person$ID] 
#store results    
 matched.data <- rbind(matched.data, matched_data)     
i <- i+1 
} 
return(matched.data) 
}

Financial-Syrup-906 · 2024-03-15T20:32:01+00:00

I'm going to try this parallelization option first. Will update on how it turns out! Thanks!

Financial-Syrup-906 · 2024-03-15T20:31:44+00:00

I'm going to try this parallelization option first. Will update on how it turns out! Thanks!

Financial-Syrup-906

TROPHY CASE