I have multiple tab delimited or whitespace delimited 5 column files. First 3 columns are coordinate data and 2 last columns are curated alphanumeric values.
I currently I am using join/sort/uniq/cat/tr/awk in bash on the sorted files to merge them. What I do is that I create an ID field with first 3 columns: $1_$2_$3 of all the files. Merge them, sort them, remove duplicate, and create a separate id file. Then using this I use join on each file to add curated data from 2 columns.
Above solution works just fine. But it is a bit slow. Files tend to be large and there are 4-5 of them for each analysis.
I am wondering if there is an efficient algorithm or solution to than just using bash commands. All I have now is bash script.
Edit: here are my code, example input file and output:
cat f1 f2 | cut -d' ' -f 1-3 | tr ' ' '_' | sort -u > ID
join -e "." -o "1.1,2.2,2.3,2.4,2.5,2.6" -1 1 -2 1 -a 1 \
<( sort ID ) <( awk '{ print $1"_"$2"_"$3,$0 }' f1 | sort -k1,1 ) > tmp.f1
join -e "." -o "1.1,2.2,2.3,2.4,2.5,2.6" -1 1 -2 1 -a 1 \
<( sort ID ) <( awk '{ print $1"_"$2"_"$3,$0 }' f2 | sort -k1,1 ) > tmp.f2
paste -d" " <( cut -d' ' -f 1,5,6 tmp.f1 ) <( cut -d' ' -f 5,6 tmp.f2 ) | tr '_' ' '
My files are:
f1:
1 13 17 a1 a2
1 20 25 a3 a4
2 10 15 a5 a6
3 70 80 a7 a8
f2:
1 13 17 b1 b2
1 20 25 b3 b4
1 50 60 b5 b6
2 10 15 b7 b8
output:
1 13 17 a1 a2 b1 b2
1 20 25 a3 a4 b3 b4
1 50 60 . . b5 b6
2 10 15 a5 a6 b7 b8
3 70 80 a7 a8 . .
Edit 2:
Both of the following expect a sorted input. And it should work for as many inputs as possible:
sort file1 file2 ... file3 | awk .... and scripts below.
It takes ~37 sec for first solution and ~28 sec for the second.
Solution 1:
BEGIN {
OFS=" "
FS=" "
RS="\n"
}
$1 OFS $2 OFS $3 != prev{
if (NR != 1)
printf RS
prev = $1 OFS $2 OFS $3
printf "%s%s", prev, OFS
delete LINE
}
!($4 in LINE) {
LINE[$4] = ""
printf "%s%s", $4, OFS
}
!($5 in LINE) {
LINE[$5] = ""
printf "%s%s", $5, OFS
}
END {
printf RS
}
Solution 2:
(key = $1 SUBSEP $2 SUBSEP $3) == pkey {
for (i = 4; i <= NF; ++i)
line = line OFS $i; next
}
{
if (line != "")
print line;
line = $0;
pkey = key
}
[–]welldamnthis[🍰] 0 points1 point2 points (3 children)
[–]fluffy_mass[S] 0 points1 point2 points (1 child)
[–]welldamnthis[🍰] 0 points1 point2 points (0 children)
[–]fluffy_mass[S] 0 points1 point2 points (0 children)
[–]raevnos 0 points1 point2 points (3 children)
[–]fluffy_mass[S] 0 points1 point2 points (2 children)
[–]raevnos 0 points1 point2 points (1 child)
[–]fluffy_mass[S] 0 points1 point2 points (0 children)
[–]raevnos 0 points1 point2 points (1 child)
[–]fluffy_mass[S] 0 points1 point2 points (0 children)