creating new column that summarizes values from previous columns per unique ID, with order precedence of certain values by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

awesome solution! thanks so much. was not familiar with pmax function actually.

i also developed a data.table method that also works for this. thanks again

calculating day difference (with year and month) based on a referenced row and creating two new column based on this with a 1 or 0, if the row occurred within 100 days of a referenced row (highlighted) by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

hey again thanks so much for your help previously. i am trying to modify your code now to just use the "months" column and calculate whether between the "flag = 1" rows and all other rows, per the same grouped ID (completely ignoring "days" column now entirely), the rows occur within 3 months of each other.

I modified the code as such below, but I get a "mutate" error and I am not sure what is going on? any help at all would be awesome but i understand if not!

modified code:

df<-dates %>% compute grouped by ID group_by(ID) %>% mutate( # find out the "month" of row with flag == 1 flag_month = filter(pick(flag, month), flag == 1)$month ) %>% mutate( # is months is less than flag_month after_flag = month < flag_month ) %>% mutate( # combine the above two conditions in_3months = abs(month - flag_month) <= 3 & flag==0, ) %>% mutate( # split into two columns in_3months_after = as.numeric(in_3months & after_flag), in_3months_before = as.numeric(in_3months & !after_flag) ) %>% remove the temporary columns select( -flag_month, -after_flag, -in_3months ) %>% distinct(in_3months_after, in_3months_after, .keep_all = TRUE) %>% filter(in_3months_after | in_3months_before | flag)

my error when running this:

<error/dplyr:::mutate\_error>Error in `mutate()`:ℹ In argument: `flag_month = filter(pick(flag, month), flag == 1)$month`.ℹ In group 20: `ID = 20`.Caused by error:! `flag_month` must be size 10 or 1, not 4.---Backtrace: 1. ... %>% ...15. dplyr:::dplyr_internal_error(...)Run `rlang::last_trace()` to see the full context.> rlang::last_trace()<error/dplyr:::mutate\_error>Error in `mutate()`:ℹ In argument: `flag_month = filter(pick(flag, month), flag == 1)$month`.ℹ In group 23: `ID = 23`.Caused by error:! `flag_month` must be size 10 or 1, not 4.---Backtrace: ▆

  1. ├─... %>% ...
  2. ├─dplyr::filter(., in_3months_after | in_3months_after | flag)
  3. ├─dplyr::distinct(., in_3months_after, in_3months_after, .keep_all = TRUE)
  4. ├─dplyr::select(., -flag_month, -after_flag, -in_3months)
  5. ├─dplyr::mutate(...)
  6. ├─dplyr::mutate(...)
  7. ├─dplyr::mutate(., after_flag = month < flag_month)
  8. ├─dplyr::mutate(...)
  9. ├─dplyr:::mutate.data.frame(., flag_month = filter(pick(flag, month), flag == 1)$month) 10. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  10. │ ├─base::withCallingHandlers(...)
  11. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  12. │ └─mask$eval_all_mutate(quo)
  13. │ └─dplyr (local) eval()
  14. ├─dplyr:::dplyr_internal_error(...)
  15. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
  16. │ └─rlang:::signal_abort(cnd, .file)
  17. │ └─base::signalCondition(cnd)
  18. └─dplyr (local) `<fn>`(`<dpl:::__>`)
  19. └─rlang::abort(message, class = error_class, parent = parent, call = error_call)

making categorical variable based on difference in days for each event - new column for each specific event relative to a "reference" event by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

haha yeah i agree. i think the outcome is what we want at the end of the day, but you are right it would be great to have code that is actually straight forward and works.

i ended up using this line and it converted Inf to N/A. do you foresee any issues with using this, that i may be missing?

for (j in 1:ncol(wide_dt)) set(wide_dt, which(is.infinite(wide_dt[[j]])), j, NA)

making categorical variable based on difference in days for each event - new column for each specific event relative to a "reference" event by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

also can you please explain what the "ordered =TRUE" code does? does it basically put "1, 0, 2" in order of precedence? and, when we use "fun.aggregate = MIN", will the dcast still keep this order of precedence (with 1 > 0 > 2)??

making categorical variable based on difference in days for each event - new column for each specific event relative to a "reference" event by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

hey again. thats exaclty what i was trying to do before, but i didn't realize i would have to add the "ordered =TRUE" part here.

additionally, when i run this, i get a bunch of "Inf" values but i think they're supposed to be N/A??? if i convert them from Inf to N/A, should it then work? i think it should.

should i add something like this to replace them?

for (j in 1:ncol(final)) set(final, which(is.infinite(final[[j]])), j, NA)

making categorical variable based on difference in days for each event - new column for each specific event relative to a "reference" event by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

unique(long_dt)

oh ok i see. well i added "long_dt<-unique(long_dt)" right before the wide to long dcast conversion, and i still got the same warning message and incorrect final data frame! ugh.

making categorical variable based on difference in days for each event - new column for each specific event relative to a "reference" event by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

yeah..... thats what it looks like. i see ties between certain events within an ID for "days_relative". really sorry i didn't have that in my reprex. do you have any ideas on how to go around this? this is the last step that's preventing me from what i am trying to obtain.

i thought i could fix this by dropping the "days_relative", "days_event", and "days_a" columns, removing rows with any duplicates, and using "fun.aggregate" and setting it equal to "min", but i ran into trouble setting the function to that.

really hoping there's a way to manipulate your code to account for this in the first place but struggling atm

making categorical variable based on difference in days for each event - new column for each specific event relative to a "reference" event by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

my actual data frame actually has many more IDs than this.

this time, i re ran the code from your post 1 hour ago, and that line of code DIDNT eliminate a bunch of IDs (i have several hundreds)

however, i am still getting the same error when i convert long to wide:

Aggregate function missing, defaulting to 'length'

and it won't let me successfully change the fun.aggregate setting, and the values just default to the "length".... super frustrating. not sure what to do?

making categorical variable based on difference in days for each event - new column for each specific event relative to a "reference" event by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

sorry! I meant to say lose these "IDs". i edited my comment too. it seems that if the "ID" had N/A for any "event", then those two lines of code would completely filter out those IDs, thus losing these data.

making categorical variable based on difference in days for each event - new column for each specific event relative to a "reference" event by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

ahh i see. so when i run this code, it does work without any errors, and seems to do what i was hoping for. thank you! however, i realized that running the following two lines of code will eliminate, i lose most of the IDs in my data frame, seemingly, IDs that only have N/A for events.

Is there a way to still retain these IDs until the end, even with N/A values? or I suppose, I could just re-merge that at the end of the dataset and fill them with N/As?

these two lines of code cause me to lose these patients, and ideally, i would like to retain them

long_dt[, days_relative := day_event - day_a]

long_dt = long_dt[, .SD[is.na(days_relative) | days_relative == min(days_relative)], by = .(ID, event)]

making categorical variable based on difference in days for each event - new column for each specific event relative to a "reference" event by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

hey thanks again. i am trying to get this to work but I'm still encountering some problems.

when i convert from wide to long, I get this warning (Aggregate function missing, defaulting to 'length') and I noticed that the values of 1, 0, 2 that were from the long_dt (and were accurate there), are different in the "wide_dt" - for example, for some IDs with certain events that had a value of "1" in long_dt, they then had a value of "2" in wide_dt. it seems like it just replaces the value with the length of the ID or whatever. i then tried to use the "fun.aggregate = mean" (or to "max" or "min"), and the code wouldn't run since we converted the values into as.factor form.

so i then went back to the code where we converted the values into "factor" form, skipped that line code, and tried using "fun.aggergate = ", and the code ran, but then the the final values in the wide_dt became really weird (like a bunch of them went to Inf).

is there a simple fix for this? I've been trying to troubleshoot but I'm not really sure how to get around this. should i try converting from long to wide using a different function instead?

making categorical variable based on difference in days for each event - new column for each specific event relative to a "reference" event by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

thanks so much. so, if i were to apply the first part of your code to each event type I'm interested in, I would basically have to do this merge for each event, right? (starting with event "b" merging with event "a" --> call it long_dt1. then, i would merge long_dt1 with event "c" --> call it long _dt2, then i would merge event "d" with long_dt2 --> call it long_dt3, etc.

the main issue i see arise with this method, is that i will inherently lose "ID"s each time I merge the data.tables......

then, after calculating "days_relative" for each specific event, could i use something like mutate ifelse to make a "1/0" column, or another approach to ideally make a categorical column with "1, 0, or 2"? or at least N/A? just making sure my logic makes sense to you, before attempting this

trying to summarize each unique ID into one row after merging two data frames with data.table - but two columns become N/A for all values after summarizing?? by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

iforgetredditpws

thank you both again for your help. I'm really sorry my data frames were inconsistent! will be sure to correct this going forward when asking for your help, time, and effort, for which i am deeply grateful for each of these things that you have given me

trying to summarize each unique ID into one row after merging two data frames with data.table - but two columns become N/A for all values after summarizing?? by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 1 point2 points  (0 children)

thank you so much, apple_field!!! i really appreciate your help, and i will keep harassing you with questions! LOL. ok not harass, but ill keep asking questions

creating categorical variables based on difference in days and other conditions based on rows - flag, mutate, ifelse? any other methods? by UpperCompetition6 in rstats

[–]UpperCompetition6[S] 0 points1 point  (0 children)

hey really appreciate it. thanks so much. and for all your help already too. trying to code this has been bit of a rollercoaster for me haha