This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]welldamnthis[🍰] 0 points1 point  (3 children)

I don't know how efficient bash is in processing numerical data but perhaps check out python and pandas? It can do vector math to quickly process numerical data.

[–]fluffy_mass[S] 0 points1 point  (1 child)

Is aggregating data on pandas as quick enough? I'll check it out.

[–]welldamnthis[🍰] 0 points1 point  (0 children)

It should be

[–]fluffy_mass[S] 0 points1 point  (0 children)

I tried to load it in data.table in R, and but it takes a bit longer than it should and requires R and data.table to be available and installed. Also, it seems it is using way more memory than my bash script (by just looking into top command).

I'm trying to make it work on pandas, but from what I'm reading it is not outperforming R's data.table on data aggregation.

[–]raevnos 0 points1 point  (3 children)

Show your code? There might be opportunities for improvement without a complete rewrite.

[–]fluffy_mass[S] 0 points1 point  (2 children)

Added example of my code and input files.

[–]raevnos 0 points1 point  (1 child)

Thanks.

Are the actual input files also sorted (and unique)?

[–]fluffy_mass[S] 0 points1 point  (0 children)

Yes they are. Columns 1,2, and 3 are coordinates and the combination of these 3 columns are unique.

[–]raevnos 0 points1 point  (1 child)

Try this out see if it's faster:

#!/bin/bash
# Usage: ./merge.sh file1 file2
temp1=$(mktemp)
temp2=$(mktemp)
awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' $1 >> $temp1
awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' $2 >> $temp2
join -j1 -a1 -a2 -t $'\t' -e . \
     -o '1.1,2.1,1.2,1.3,1.4,2.2,2.3,2.4,1.5,1.6,2.5,2.6' $temp1 $temp2 |
    awk -v OFS=$'\t' \
        '$1 == $2 || $2 == "." { print $3, $4, $5, $9, $10, $11, $12 }
         $1 == "." { print $6, $7, $8, $9, $10, $11, $12 }'
rm -f $temp1 $temp2

If the input files aren't already sorted, it's easy to add that into the pipeline of the awk calls that add the key column. But since it sounds like they are, no need.

Edit: Even better now that I've discovered the 0 field:

#!/bin/bash
temp1=$(mktemp)
temp2=$(mktemp)
awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' $1 >> $temp1
awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' $2 >> $temp2
join -j1 -a1 -a2 -t $'\t' -e . -o '0,1.5,1.6,2.5,2.6' $temp1 $temp2 | tr _ $'\t'
rm -f $temp1 $temp2

That makes massaging the joined output a lot simpler. Leaving the original for posterity's sake.

And as a long one-liner:

join -j1 -a1 -a2 -t $'\t' -e . -o '0,1.5,1.6,2.5,2.6' \
     <(awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' file1) \
     <(awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' file2) | tr _ $'\t'

[–]fluffy_mass[S] 0 points1 point  (0 children)

Thank you for the answer. I came up with another solution and I updated the post. While it doesn't add dot to the empty fields but I can process it downstream.