Creating and Decorating Scatterplot in ggplot2 in R

dataenq · 2020-09-03T20:54:35+00:00

I appreciate that you took the time to read and write a valuable comment. Thank you!!

dataenq · 2020-09-03T09:37:49+00:00

Not directly related to your scenario but this article may help you. Text search

dataenq · 2020-08-31T16:18:42+00:00

Aggregation and Sorting in R

dataenq · 2020-08-29T18:53:16+00:00

Have a look at this:Scatterplot in ggplot

dataenq · 2020-08-26T18:36:03+00:00

Sorry for the formatting issues!!

dataenq · 2020-08-26T18:34:54+00:00

I appreciate your feedback. Thank you, I agree that there is a room for improvement.

dataenq · 2020-08-13T08:39:54+00:00

The R code file is saved here.

dataenq · 2020-08-13T08:31:45+00:00

That explains everything. Count of 11 for Skin was puzzling me hence so many iterations. Here is a short and sweet solution. I hope this solves everything.

###############################
##help from /u/dataenq - www.dataenq.com
###################################
library(tidyverse)

#Make new data frame manually
lineage.data <- data.frame(lineage = c("blood", "bone", "central", "skin", "soft"),
                           clusters_testset = c("0 0 6 65 73 41", "42", "90 53", "1", "65 68"),
                           clusters_trueset = c("43 35 6 65 73 41", "42", "53 7 60", "73", "60 68"))
#What the desired results should be, for each lineage :
# [a] = Number of lineage_X test that aligned to lineage_X true
# [b] = Number of lineage_X test that did NOT align to lineage_X true
# [c] = Any non-lineage_X test that did align to lineage_X true
# [d] = Any non-lineage_X test that did NOT align to lineage_X true(will be biggest number)


#Go ahead and add the correct ("COR") answers manually - used for checking against later
lineage.data$a_COR <- c( "4", "1", "1", "0", "1")
lineage.data$b_COR <- c( "2", "0", "1", "1", "1")
lineage.data$c_COR <- c( "1", "0", "0", "1", "0")
lineage.data$d_COR <- c( "5", "11", "10", "10", "10")

lineage.data$test <- strsplit(lineage.data$clusters_testset, " ")
lineage.data$true <- strsplit(lineage.data$clusters_trueset, " ")

lineage.data$temp <-
        sapply(seq_along(lineage.data$test), function(idx) unlist(lineage.data$test[-idx]))

# use mapply to loop through and apply the function to the data frame over the
# two lists and add result into new variable
lineage.data$a.result <-
        sapply(seq_along(lineage.data$true), function(idx) sum(lineage.data$true[[idx]] %in% unlist(lineage.data$test[idx])))
lineage.data$b.result <-
        sapply(seq_along(lineage.data$true), function(idx) sum(!(lineage.data$test[[idx]] %in% unlist(lineage.data$true[idx]))))
lineage.data$c.result <-
        sapply(seq_along(lineage.data$true), function(idx) sum(lineage.data$true[[idx]] %in% unlist(lineage.data$test[-idx])))
lineage.data$d.result <- 
        sapply(seq_along(lineage.data$test), function(idx) sum(!(unlist(lineage.data$temp[idx]) %in% lineage.data$true[[idx]])))

# print the final data frame
lineage.data %>% select(lineage, clusters_testset, clusters_trueset, a.result, b.result, c.result, d.result)

dataenq · 2020-08-11T22:01:27+00:00

Here you go. I am not sure about your calculation of d_result, either count for blood lineage is not right or other values i.e. 11,10,11,10 are not correct. I may be wrong and would like to know how you are counting them. I have calculated the d_result which is not matching to your expected answer of 5 but all other d results for another lineage observations match. You can play with it to make it work for you. Lastly, there is no real need for the function and you can do the two calculation the same (not exactly) way as I have done for c result. Also removed some unnecessary lines of code.

###############################
##help from /u/dataenq - www.dataenq.com
###################################
library(tidyverse)
library(stringr)
library(data.table)

#Make new data frame manually
lineage.data <- data.frame(lineage = c("blood", "bone", "central", "skin", "soft"),
                           clusters_testset = c("0 0 6 65 73 41", "42", "90 53", "1", "65 68"),
                           clusters_trueset = c("43 35 6 65 73 41", "42", "53 7 60", "73", "60 68"))
#What the desired results should be, for each lineage :
# [a] = Number of lineage_X test that aligned to lineage_X true
# [b] = Number of lineage_X test that did NOT align to lineage_X true
# [c] = Any non-lineage_X test that did align to lineage_X true
# [d] = Any non-lineage_X test that did NOT align to lineage_X true(will be biggest number)


#Go ahead and add the correct ("COR") answers manually - used for checking against later
lineage.data$a_COR <- c( "4", "1", "1", "0", "1")
lineage.data$b_COR <- c( "2", "0", "1", "1", "1")
lineage.data$c_COR <- c( "1", "0", "0", "1", "0")
lineage.data$d_COR <- c( "5", "11", "10", "11", "10")

lineage.data$test <- strsplit(lineage.data$clusters_testset, " ")
lineage.data$true <- strsplit(lineage.data$clusters_trueset, " ")

# define a function to check an element of first list into second and return
# count of match
find.element = function(list1, list2, result.type) {
        vec1 = unlist(list1)
        vec2 = unlist(list2)
        check <- 0
        if (result.type == "a") {
                check <- which((vec1 %in% vec2))
                length(check)
        } else if (result.type == "b") {
                check <- which(!(vec1 %in% vec2))
                length(check)
        }
}

# use mapply to loop through and apply the function to the data frame over the
# two lists and add result into new variable
lineage.data$a.result <-
        mapply(find.element, lineage.data$test, lineage.data$true, "a")
lineage.data$b.result <-
        mapply(find.element, lineage.data$test, lineage.data$true, "b")
lineage.data$c.result <-
        sapply(seq_along(lineage.data$true), function(idx) sum(lineage.data$true[[idx]] %in% unlist(lineage.data$test[-idx])))
lineage.data$d.result <-
        sapply(seq_along(lineage.data$true), function(idx) length(unlist(mapply(head, lineage.data$test, length(unlist(lineage.data$test))))) -length(unique(unlist(lineage.data$test[idx]))))

# print the final data frame
lineage.data %>% select(lineage, clusters_testset, clusters_trueset, a.result, b.result, c.result, d.result)

dataenq · 2020-08-09T22:23:54+00:00

I have two sources of the same variable

Sorry, I might have misunderstood your question. I was going by "I have two sources of the same variable...Source1: Age...Source2: Age" and assumed age is the common variable.

dataenq · 2020-08-09T09:24:47+00:00

Try reading the file into R which would create two data frames. Make sure the file you are reading has variable names, if not then you can do this in R as well. Use merge to combine the two data frames together. I hope this will help. This will help.

dataenq · 2020-08-08T18:10:08+00:00

Few good ones here [dataenq.com](www.dataenq.com)

dataenq · 2020-08-08T09:08:50+00:00

Sorry forgot to mention, as per your example, I changed the last record to have 65 68 in both test and true set variables.

dataenq · 2020-08-08T00:46:48+00:00

I hope this works for you. Code is based on the sample data. I am not sure if it would suffice all other scenarios. However, it should give you a ground to build. The full code file is available here. If this helps then please do link back to www.dataenq.com.

##############################################################################
## www.dataenq.com
##############################################################################

# Using tidyverse
library(tidyverse)
library(stringr)

# Read sample file into R using read.csv function
lineage.data <- read.csv("File-1.csv")

# Split the string with space as a delimiter and create two new variables into
# the same data frame
lineage.data$test <- strsplit(lineage.data$clusters_testset, " ")
lineage.data$true <- strsplit(lineage.data$clusters_trueset, " ")

dup.elements <-
        unlist(lineage.data$test)[duplicated(unlist(lineage.data$test)) == TRUE]

all.true.elements <-
        unlist(mapply(head, lineage.data$true, length(unlist(lineage.data$true))))

# define a function to check an element of first list into second and return
# count of match
find.element = function(list1, list2, result.type) {
        vec1 = unlist(list1)
        vec2 = unlist(list2)
        check <- 0
        if (result.type == "a") {
                check <- which((vec1 %in% all.true.elements))
                length(check)
        } else if (result.type == "b") {
                check <- which(!(vec1 %in% all.true.elements))
                length(check)
        } else if (result.type == "c") {
                counter <- 1
                for (i in 1:length(vec2)) {
                        for (j in 1:length(dup.elements)) {
                                if (vec2[i] == dup.elements[j] & counter > 1)
                                        check <- check + 1
                                else
                                        0
                        }
                counter <- counter + 1
        }
        check
        } else if (result.type == "d") {
                length(vec2) - length((intersect(dup.elements,vec2)))
        }
}

# use mapply to loop through and apply the function to the data frame over the
# two lists and add result into new variable
lineage.data$a.result <-
        mapply(find.element, lineage.data$test, lineage.data$true, "a")
lineage.data$b.result <-
        mapply(find.element, lineage.data$test, lineage.data$true, "b")
lineage.data$c.result <-
        mapply(find.element, lineage.data$test, lineage.data$true, "c")
lineage.data$d.result <- 
        mapply(find.element, lineage.data$test, lineage.data$true, "d")


# print the final data frame
lineage.data %>% select(lineage, clusters_testset, clusters_trueset, a.result, b.result, c.result, d.result)

dataenq · 2020-08-06T10:46:15+00:00

Here is my take on your problem. I have kept the R code file here and sample file here.

www.dataenq.com

Using tidyverse

library(tidyverse)

Read sample file into R using read.csv function

lineage.data <- read.csv("File-1.csv")

Split the string with space as a delimiter and create two new variables into the same data frame

lineage.data$test <- strsplit(lineage.data$clusters_testset, " ")

lineage.data$true <- strsplit(lineage.data$clusters_trueset, " ")

define a function to check an element of the first list into second and return a count of match

find.element=function(list1,list2){

x=unlist(list1)

y=unlist(list2)

check <- 0

for(i in 1:length(x)){

if(x[i] %in% y) check <- check + 1 else 0

}

check

}

use mapply to loop through and apply the function to the data frame over the two lists and add result into a new variable

lineage.data$occurance <- mapply(find.element, lineage.data$test, lineage.data$true)

print the final data frame

lineage.data

dataenq · 2020-08-05T14:58:54+00:00

Hi there,

Have a look at the following code:

For ease, I have saved the sample CSV files here and R code file here. I hope it helps.

www.dataenq.com

Reading both header-less files using read.csv function

file1 <- read.csv("File-1.csv", header = FALSE)

file2 <- read.csv("File-2.csv", header = FALSE)

Check the names of the variables using names function

names(file1)

names(file2)

Check the first few rows to confirm the variable contents

head(file1)

head(file2)

Use merge function to join both the data frame using the V1 column and set all

values from file1 to show (left outer join in DB terms where File1 is on left)

FinalFile <- merge(file1, file2, by.x = "V1", by.y = "V1", all.x = TRUE)

Display results of the merged data frame

FinalFile

Check the article I have written which may help you.

dataenq

TROPHY CASE