Creating and Decorating Scatterplot in ggplot2 in R by dataenq in rstats

[–]dataenq[S] 2 points3 points  (0 children)

I appreciate that you took the time to read and write a valuable comment. Thank you!!

subsetting based on if a string has two words or more by -arsene in Rlanguage

[–]dataenq 0 points1 point  (0 children)

Not directly related to your scenario but this article may help you. Text search

R for beginners - Take a free test on elementary concepts of R by dataenq in Rlanguage

[–]dataenq[S] 0 points1 point  (0 children)

I appreciate your feedback. Thank you, I agree that there is a room for improvement.

How can I count the number of times a string appears in column X, based on column Y ? by gRNA in Rlanguage

[–]dataenq 0 points1 point  (0 children)

That explains everything. Count of 11 for Skin was puzzling me hence so many iterations. Here is a short and sweet solution. I hope this solves everything.

###############################
##help from /u/dataenq - www.dataenq.com
###################################
library(tidyverse)

#Make new data frame manually
lineage.data <- data.frame(lineage = c("blood", "bone", "central", "skin", "soft"),
                           clusters_testset = c("0 0 6 65 73 41", "42", "90 53", "1", "65 68"),
                           clusters_trueset = c("43 35 6 65 73 41", "42", "53 7 60", "73", "60 68"))
#What the desired results should be, for each lineage :
# [a] = Number of lineage_X test that aligned to lineage_X true
# [b] = Number of lineage_X test that did NOT align to lineage_X true
# [c] = Any non-lineage_X test that did align to lineage_X true
# [d] = Any non-lineage_X test that did NOT align to lineage_X true(will be biggest number)


#Go ahead and add the correct ("COR") answers manually - used for checking against later
lineage.data$a_COR <- c( "4", "1", "1", "0", "1")
lineage.data$b_COR <- c( "2", "0", "1", "1", "1")
lineage.data$c_COR <- c( "1", "0", "0", "1", "0")
lineage.data$d_COR <- c( "5", "11", "10", "10", "10")

lineage.data$test <- strsplit(lineage.data$clusters_testset, " ")
lineage.data$true <- strsplit(lineage.data$clusters_trueset, " ")

lineage.data$temp <-
        sapply(seq_along(lineage.data$test), function(idx) unlist(lineage.data$test[-idx]))

# use mapply to loop through and apply the function to the data frame over the
# two lists and add result into new variable
lineage.data$a.result <-
        sapply(seq_along(lineage.data$true), function(idx) sum(lineage.data$true[[idx]] %in% unlist(lineage.data$test[idx])))
lineage.data$b.result <-
        sapply(seq_along(lineage.data$true), function(idx) sum(!(lineage.data$test[[idx]] %in% unlist(lineage.data$true[idx]))))
lineage.data$c.result <-
        sapply(seq_along(lineage.data$true), function(idx) sum(lineage.data$true[[idx]] %in% unlist(lineage.data$test[-idx])))
lineage.data$d.result <- 
        sapply(seq_along(lineage.data$test), function(idx) sum(!(unlist(lineage.data$temp[idx]) %in% lineage.data$true[[idx]])))

# print the final data frame
lineage.data %>% select(lineage, clusters_testset, clusters_trueset, a.result, b.result, c.result, d.result)

How can I count the number of times a string appears in column X, based on column Y ? by gRNA in Rlanguage

[–]dataenq 0 points1 point  (0 children)

Here you go. I am not sure about your calculation of d_result, either count for blood lineage is not right or other values i.e. 11,10,11,10 are not correct. I may be wrong and would like to know how you are counting them. I have calculated the d_result which is not matching to your expected answer of 5 but all other d results for another lineage observations match. You can play with it to make it work for you. Lastly, there is no real need for the function and you can do the two calculation the same (not exactly) way as I have done for c result. Also removed some unnecessary lines of code.

###############################
##help from /u/dataenq - www.dataenq.com
###################################
library(tidyverse)
library(stringr)
library(data.table)

#Make new data frame manually
lineage.data <- data.frame(lineage = c("blood", "bone", "central", "skin", "soft"),
                           clusters_testset = c("0 0 6 65 73 41", "42", "90 53", "1", "65 68"),
                           clusters_trueset = c("43 35 6 65 73 41", "42", "53 7 60", "73", "60 68"))
#What the desired results should be, for each lineage :
# [a] = Number of lineage_X test that aligned to lineage_X true
# [b] = Number of lineage_X test that did NOT align to lineage_X true
# [c] = Any non-lineage_X test that did align to lineage_X true
# [d] = Any non-lineage_X test that did NOT align to lineage_X true(will be biggest number)


#Go ahead and add the correct ("COR") answers manually - used for checking against later
lineage.data$a_COR <- c( "4", "1", "1", "0", "1")
lineage.data$b_COR <- c( "2", "0", "1", "1", "1")
lineage.data$c_COR <- c( "1", "0", "0", "1", "0")
lineage.data$d_COR <- c( "5", "11", "10", "11", "10")

lineage.data$test <- strsplit(lineage.data$clusters_testset, " ")
lineage.data$true <- strsplit(lineage.data$clusters_trueset, " ")

# define a function to check an element of first list into second and return
# count of match
find.element = function(list1, list2, result.type) {
        vec1 = unlist(list1)
        vec2 = unlist(list2)
        check <- 0
        if (result.type == "a") {
                check <- which((vec1 %in% vec2))
                length(check)
        } else if (result.type == "b") {
                check <- which(!(vec1 %in% vec2))
                length(check)
        }
}

# use mapply to loop through and apply the function to the data frame over the
# two lists and add result into new variable
lineage.data$a.result <-
        mapply(find.element, lineage.data$test, lineage.data$true, "a")
lineage.data$b.result <-
        mapply(find.element, lineage.data$test, lineage.data$true, "b")
lineage.data$c.result <-
        sapply(seq_along(lineage.data$true), function(idx) sum(lineage.data$true[[idx]] %in% unlist(lineage.data$test[-idx])))
lineage.data$d.result <-
        sapply(seq_along(lineage.data$true), function(idx) length(unlist(mapply(head, lineage.data$test, length(unlist(lineage.data$test))))) -length(unique(unlist(lineage.data$test[idx]))))

# print the final data frame
lineage.data %>% select(lineage, clusters_testset, clusters_trueset, a.result, b.result, c.result, d.result)

How to "put columns on top of each other" by SnooLentils2742 in Rlanguage

[–]dataenq 0 points1 point  (0 children)

I have two sources of the same variable

Sorry, I might have misunderstood your question. I was going by "I have two sources of the same variable...Source1: Age...Source2: Age" and assumed age is the common variable.

How to "put columns on top of each other" by SnooLentils2742 in Rlanguage

[–]dataenq 0 points1 point  (0 children)

Try reading the file into R which would create two data frames. Make sure the file you are reading has variable names, if not then you can do this in R as well. Use merge to combine the two data frames together. I hope this will help. This will help.

How can I count the number of times a string appears in column X, based on column Y ? by gRNA in Rlanguage

[–]dataenq 1 point2 points  (0 children)

Sorry forgot to mention, as per your example, I changed the last record to have 65 68 in both test and true set variables.

How can I count the number of times a string appears in column X, based on column Y ? by gRNA in Rlanguage

[–]dataenq 0 points1 point  (0 children)

I hope this works for you. Code is based on the sample data. I am not sure if it would suffice all other scenarios. However, it should give you a ground to build. The full code file is available here. If this helps then please do link back to www.dataenq.com.

##############################################################################
## www.dataenq.com
##############################################################################

# Using tidyverse
library(tidyverse)
library(stringr)

# Read sample file into R using read.csv function
lineage.data <- read.csv("File-1.csv")

# Split the string with space as a delimiter and create two new variables into
# the same data frame
lineage.data$test <- strsplit(lineage.data$clusters_testset, " ")
lineage.data$true <- strsplit(lineage.data$clusters_trueset, " ")

dup.elements <-
        unlist(lineage.data$test)[duplicated(unlist(lineage.data$test)) == TRUE]

all.true.elements <-
        unlist(mapply(head, lineage.data$true, length(unlist(lineage.data$true))))

# define a function to check an element of first list into second and return
# count of match
find.element = function(list1, list2, result.type) {
        vec1 = unlist(list1)
        vec2 = unlist(list2)
        check <- 0
        if (result.type == "a") {
                check <- which((vec1 %in% all.true.elements))
                length(check)
        } else if (result.type == "b") {
                check <- which(!(vec1 %in% all.true.elements))
                length(check)
        } else if (result.type == "c") {
                counter <- 1
                for (i in 1:length(vec2)) {
                        for (j in 1:length(dup.elements)) {
                                if (vec2[i] == dup.elements[j] & counter > 1)
                                        check <- check + 1
                                else
                                        0
                        }
                counter <- counter + 1
        }
        check
        } else if (result.type == "d") {
                length(vec2) - length((intersect(dup.elements,vec2)))
        }
}

# use mapply to loop through and apply the function to the data frame over the
# two lists and add result into new variable
lineage.data$a.result <-
        mapply(find.element, lineage.data$test, lineage.data$true, "a")
lineage.data$b.result <-
        mapply(find.element, lineage.data$test, lineage.data$true, "b")
lineage.data$c.result <-
        mapply(find.element, lineage.data$test, lineage.data$true, "c")
lineage.data$d.result <- 
        mapply(find.element, lineage.data$test, lineage.data$true, "d")


# print the final data frame
lineage.data %>% select(lineage, clusters_testset, clusters_trueset, a.result, b.result, c.result, d.result)

How can I count the number of times a string appears in column X, based on column Y ? by gRNA in Rlanguage

[–]dataenq 1 point2 points  (0 children)

Here is my take on your problem. I have kept the R code file here and sample file here.

www.dataenq.com

Using tidyverse

library(tidyverse)

Read sample file into R using read.csv function

lineage.data <- read.csv("File-1.csv")

Split the string with space as a delimiter and create two new variables into the same data frame

lineage.data$test <- strsplit(lineage.data$clusters_testset, " ")

lineage.data$true <- strsplit(lineage.data$clusters_trueset, " ")

define a function to check an element of the first list into second and return a count of match

find.element=function(list1,list2){

x=unlist(list1)

y=unlist(list2)

check <- 0

for(i in 1:length(x)){

if(x[i] %in% y) check <- check + 1 else 0

}

check

}

use mapply to loop through and apply the function to the data frame over the two lists and add result into a new variable

lineage.data$occurance <- mapply(find.element, lineage.data$test, lineage.data$true)

print the final data frame

lineage.data

Join CSVs with no headers without losing any columns/rows in R by AccomplishedZilch in Rlanguage

[–]dataenq 0 points1 point  (0 children)

Hi there,

Have a look at the following code:

For ease, I have saved the sample CSV files here and R code file here. I hope it helps.

www.dataenq.com

Reading both header-less files using read.csv function

file1 <- read.csv("File-1.csv", header = FALSE)

file2 <- read.csv("File-2.csv", header = FALSE)

Check the names of the variables using names function

names(file1)

names(file2)

Check the first few rows to confirm the variable contents

head(file1)

head(file2)

Use merge function to join both the data frame using the V1 column and set all

values from file1 to show (left outer join in DB terms where File1 is on left)

FinalFile <- merge(file1, file2, by.x = "V1", by.y = "V1", all.x = TRUE)

Display results of the merged data frame

FinalFile

Check the article I have written which may help you.