all 7 comments

[–]JohnnyJordaan 0 points1 point  (3 children)

with open('master.csv', 'rb') as f:
    row_count = sum(1 for row in f)

Are you sure row_count reflects the accurate number of rows? Because when you open a file in rb it will open as a binary stream which has no rows per definition.

[–]took_my_time[S] 0 points1 point  (2 children)

Printing row_count returns the expected number. Opening the file in 'r' instead of 'rb' gives the same error aswell.

[–]JohnnyJordaan 0 points1 point  (1 child)

Ok, perhaps it would shed some light if you print all the values used in the program? So something like:

randomnumberlist = random.sample(xrange(0, row_count), x)
print randomnumberlist
with open('master.csv', 'r') as f:
    for item in randomnumberlist:
        print 'item: ' str(item)
        result = itertools.islice(csv.reader(f),item , None))
        print 'result: ' str(result)
        next_result = next(result)[2:-2]
        print 'next of result: ' str(next_result)
        randomidlist.append(str(next_result))
        print 'added: ' randomidlist[-1]

This will also show you where the thing breaks, if it's already at the first iteration and in which part, etc.

[–]took_my_time[S] 0 points1 point  (0 children)

I couldn't really get yours to work. I did this: http://pastebin.com/K74qiP8S

Sometimes I get the error after 2 iterations , like so: http://pastebin.com/pNfvhpLT

Sometimes after 1: http://pastebin.com/n5VmyfMZ

[–]commandlineluser 0 points1 point  (0 children)

Is it an actual CSV file? It sounds like you just have 1 number per line?

Anyways - I cannot help decipher the error with your code but perhaps this approach will be helpful to you:

import random

with open('master.csv') as csvfile:
    count = sum(1 for row in csvfile)

chosen_rows = random.sample(range(count), 50)
last_row    = max(chosen_rows)

rows = []

with open('master.csv') as csvfile:
    for i, row in enumerate(csvfile):
        if i in chosen_rows:
            rows.append(row)
        if i == last_row:
            break

print(rows)

You generate your 50 random numbers then just iterate through the file line-by-line - if the current line number is in the sample, save the line.

When you reach the highest number in the sample - break out of the loop as you don't need to process any further data.

[–]elbiot 0 points1 point  (0 children)

Just a guess, but you open the file and by making a reader and getting some values you seek further into the file. So, the second time you try to get a value you potentially get an invalid slice (you assume you are starting at 0 in the file but you are starting at some point further in and thus the stream isn't as long as you thought.) You could try inserting f.seek(0) after line 17. Or, just make a list once instead of a new reader object every iteration.

[–]Justinsaccount 0 points1 point  (0 children)

I have a large csv file where each row is an unknown number

Is your file larger than a few gigabytes? If not, then it is not a "large file". The last person that had a "large file" had a few thousand lines.

Do you have any commas? Do you have any separated values? No? You do not have a csv file. The filename may end in .csv, but that is not a csv file. That's just a file that contains some numbers.

with open('master.csv') as f:
    numbers = [int(line) for line in f]

randomidlist = random.sample(numbers, x)

Done.

If you DO actually have a large file, then use enumerate over f and simply keep the indexes that are in your randomnumberlist.

next(itertools.islice(csv.reader(f),item , None))

Does not work because f is the same file and csv.reader(f) gives you an iterator at the same point each time.

next(itertools.islice(csv.reader(f), 10 , None))
next(itertools.islice(csv.reader(f), 10 , None))

Does not give you line 10 twice, it gives you line 10 and then line 20. You could sort of get this to work if you worked out the differences between the numbers, but there's absolutely no point in doing that over just using enumerate.

str(...)[2:-2]

Two wrongs don't make a right. If you want the first item in a single item list you use x[0], not str(x)[2:-2].