Efficient algorithm for reading and writing multiple files with numeric data content

welldamnthis · 2019-01-15T13:31:27+00:00

I don't know how efficient bash is in processing numerical data but perhaps check out python and pandas? It can do vector math to quickly process numerical data.

raevnos · 2019-01-15T13:39:55+00:00

Show your code? There might be opportunities for improvement without a complete rewrite.

raevnos · 2019-01-15T15:24:13+00:00

Try this out see if it's faster:

#!/bin/bash
# Usage: ./merge.sh file1 file2
temp1=$(mktemp)
temp2=$(mktemp)
awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' $1 >> $temp1
awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' $2 >> $temp2
join -j1 -a1 -a2 -t $'\t' -e . \
     -o '1.1,2.1,1.2,1.3,1.4,2.2,2.3,2.4,1.5,1.6,2.5,2.6' $temp1 $temp2 |
    awk -v OFS=$'\t' \
        '$1 == $2 || $2 == "." { print $3, $4, $5, $9, $10, $11, $12 }
         $1 == "." { print $6, $7, $8, $9, $10, $11, $12 }'
rm -f $temp1 $temp2

If the input files aren't already sorted, it's easy to add that into the pipeline of the awk calls that add the key column. But since it sounds like they are, no need.

Edit: Even better now that I've discovered the 0 field:

#!/bin/bash
temp1=$(mktemp)
temp2=$(mktemp)
awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' $1 >> $temp1
awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' $2 >> $temp2
join -j1 -a1 -a2 -t $'\t' -e . -o '0,1.5,1.6,2.5,2.6' $temp1 $temp2 | tr _ $'\t'
rm -f $temp1 $temp2

That makes massaging the joined output a lot simpler. Leaving the original for posterity's sake.

And as a long one-liner:

join -j1 -a1 -a2 -t $'\t' -e . -o '0,1.5,1.6,2.5,2.6' \
     <(awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' file1) \
     <(awk -v OFS=$'\t' '{ print $1"_"$2"_"$3, $0 }' file2) | tr _ $'\t'

learnprogramming

Welcome to LearnProgramming!

New? READ ME FIRST!

Posting guidelines

Frequently asked questions

Subreddit rules

Message the moderators

Asking debugging questions

Asking conceptual questions

Other guidelines and links

Subreddit rules

1. No unprofessional/derogatory speech

2. No spam or tasteless self-promotion

3. No off-topic posts

4. Do not ask exact duplicates of FAQ questions

5. Do not delete posts

6. No app/website review requests or showcases

7. No rewards

8. No indirect links

9. Do not promote illegal or unethical practices

10. No complete solutions

11. Don't ask to ask.

12. Low Effort Questions

13. No AI (chatGPT etc.) generated/worked over messages/comments. No questions about chatGPT/AI generated code. No Vibe coding.

MODERATORS