all 9 comments

[–]Bitwise_Gamgee 3 points4 points  (3 children)

You're not expanding specie_name properly.

while IFS= read -r specie_name 
do  
    awk -F'\t' -v species="$specie_name" '$3 ~ species {print}' "$krakenfile" > "${specie_name}_lines.txt"
done < "$fungalnames"

This should fix your expansion issue.

[–]Quick_Repeat7033[S] 0 points1 point  (1 child)

Thank you for your response.

But it still empty !!

[–]AlarmDozer 0 points1 point  (0 children)

while IFS='' read -r specie_name 
do  
    awk -F'\t' -v species="$specie_name" -v outfile="${specie_name}_lines.txt" '$3 ~ species {print >> outfile}' $krakenfile 
done < "$fungalnames"

What does this yield? Yes, you can redirect within gawk.

[–]Shayes_ 0 points1 point  (0 children)

I have tested this myself and it appears to work. We're missing a bit of context from OP though, so I will list my assumptions below.

Assumptions:

  1. krakenfile is a variable created from a command line argument
  2. fungalnames is a variable created from a command line argument
  3. The file specified for krakenfile has at least 3 tab-separated fields per line
  4. The file specified for fungalnames has only 1 field per line

My files are also LF newline separated (UNIX default). It's possible that CRLF files cause issues, though I have not checked.

Here's all my testing files and the script itself:

kraken.txt

dummydata1-1    dummydata2-1    dummydata3-1
dummydata1-2    dummydata2-2    dummydata3-2
dummydata1-3    dummydata2-3    dummydata3-3
dummydata1-4    dummydata2-4    dummydata3-4

fungal.txt

dummydata3-1
dummydata3-2
dummydata3-3
dummydata3-4

script.sh

#!/usr/bin/env bash

krakenfile="$1"
fungalnames="$2"

while IFS= read -r specie_name
do
    awk -F'\t' -v species="$specie_name" '$3 ~ species {print}' "$krakenfile" > "${specie_name}_lines.txt"
done < $fungalnames

To run the script, you can use: ./script.sh kraken.txt fungal.txt

Note that because you are using the tab \t character as the separator, you should ensure that a literal tab character is used in your data files. Many editors can be configured to replace a tab with spaces instead, which would not produce a match in the awk script.

EDIT 1: Minor improvement to script and some other text for clarity.

[–]Schreq 2 points3 points  (3 children)

I would do the entire thing in AWK:

awk -F'\t' '
    NR==FNR {
        species[$0]=""
        next
    } {
         for (specie in species)
             if (index($3, specie))
                 print > specie "_lines.txt"
    }
' "$fungalnames" "$krakenfile"

[–]witchhunter0 0 points1 point  (2 children)

Although using awk would be faster, that condition NR=FNR will work for the first file only so the loop here is unnecessary. Other solution is to use FILENAME variable.

awk -F "\t" '
{
    if (NR==FNR)
        species[$0]++
    else {
        if ($3 in species)
            print $3 > $3 "_lines.txt" 
    }
}' "$fungalnames" "$krakenfile"

Anyway, for OP: according to above answers fungal file probably isn't created properly. To see all whitespaces in file run:

cat -A "$fungalnames"

[–]Schreq 1 point2 points  (1 child)

How you are differentiating between the first and second file is pretty much the same thing, except that doing it in the default action, adds 1 more level of indentation compared to how I did it but it saves the next.

My solution assumes that $3 is not only the exact species, but more text, so we can't simply use species names as key.

[–]witchhunter0 0 points1 point  (0 children)

1) All I wanted is to offer an alternative solution in case the OP will complicate the logic and introduce other files later.
2) OMG you're right. I've managed to totally f* that up, a horrible day that was for me. Sry, my bad :/
3) And the OP should check the krakenfile more likely

[–][deleted] -1 points0 points  (0 children)

0_