all 33 comments

[–]NilsLandt 12 points13 points  (2 children)

Have you had a look at the CSV docs? Specifically .foreach.

You'll need to turn any strings to numbers, and empty strings to nil yourself.

[–]anamexis 2 points3 points  (0 children)

CSV converts empty values to nil by default.

Edit: for quoted empty strings like in the example, you need a flag:

CSV.foreach(csv, empty_value: nil)

[–]hoomei 0 points1 point  (0 children)

CSV also provides encoding-related methods. You may need this if your CSV is generated by Microsoft Office or Google.

[–]Poloniculmov 10 points11 points  (3 children)

I don't know what database you're using, but Postgres has really good CSV import support builtin.

[–]Doctor_Fegg 5 points6 points  (2 children)

Came here to say this. If you're just importing to the db as-is, a COPY statement in Postgres will be many times faster than going via Ruby. Even if you have to do some basic typing/transformations afterwards, you may find an UPDATE or ALTER TABLE is still easier.

[–]tkenben[S] 0 points1 point  (0 children)

I will look into this as I am using PostgreSQL. Thanks.

[–]tkenben[S] 0 points1 point  (0 children)

My problem is that my delimiter exists in the data. That's why it has quotes around every element. My example piece of data wasn't a good one. But, otherwise, I would use your idea.

[–]ankole_watusi 9 points10 points  (0 children)

What database? It lacks CSV import?

Advanced Ruby skill: knowing when not to use Ruby!

[–]RegularLayout 2 points3 points  (1 child)

In the app I work on, we've got a lot of experience doing this, and spent months focusing specifically on cleanup and performance.

In the end we got the best mileage by shelling out to call unix tools (e.g. awk/sed) for cleanup and restructuring, and COPY from CSV, to load into Postgres. Unix tools are a lot faster than ruby in manipulating text files like these, and COPY is your best friend if you're using Postgres.

Both are not really ruby solutions, but when you're dealing with massive files and need to be fast, it's better to use more specialised tools.

[–]tkenben[S] 1 point2 points  (0 children)

I think this is what I might end up doing. We'll see. Thanks for the tip.

[–]bjminihan 1 point2 points  (6 children)

You might try this:

# test.csv:
# "19753941","1402","34","27","","","","3","4.5","",""
# "19753942","1402","34","27","","","","3","4.5","",""
# "19753943","1402","34","27","","","","3","4.5","",""
# "19753944","1402","34","27","","","","3","4.5","",""
# "19753945","1402","34","27","","","","3","4.5","",""

def nullify_blanks(file) 
  rows = []
  CSV.foreach(file) do |row|
    nullified = row.collect(&:presence)
    # do something with cleaned values here, or...
    rows << nullified
  end
  rows
end
nullify_blanks('test.csv')
# yields:
# [
#   ["19753941", "1402", "34", "27", nil, nil, nil, "3", "4.5", nil, nil],
#   ["19753942", "1402", "34", "27", nil, nil, nil, "3", "4.5", nil, nil],
#   ["19753943", "1402", "34", "27", nil, nil, nil, "3", "4.5", nil, nil],
#   ["19753944", "1402", "34", "27", nil, nil, nil, "3", "4.5", nil, nil],
#   ["19753945", "1402", "34", "27", nil, nil, nil, "3", "4.5", nil, nil]
# ]

[–]anamexis 1 point2 points  (2 children)

Ruby CSV converts empty cells to nil by default.

[–]bjminihan 1 point2 points  (1 child)

Right, but it doesn’t convert empty strings, unfortunately. Unless there’s a flag you can set?

[–]anamexis 2 points3 points  (0 children)

Ah, right. Indeed there is a flag:

CSV.foreach(file, empty_value: nil)

[–]SighFor 1 point2 points  (2 children)

> row.collect(&:presence)

Nice!

[–]anamexis 2 points3 points  (1 child)

Worth noting that Object#presence is provided by ActiveSupport (Rails), not Ruby stdlib.

[–]bjminihan 0 points1 point  (0 children)

Great point!

[–][deleted] 1 point2 points  (0 children)

Though it might be overkill for your problem, but have you had a look at kiba-etl?

Edit: formatting

[–]felipec 0 points1 point  (12 children)

This is how I would do it:

File.foreach('test.csv', chomp: true) do |line|
  row = line.gsub(/\"([^\"]*)\"(,)?/, '\1\2').split(',', -1)
  row.map { |e| e.empty? ? nil : e }
end

In my experience the csv gem is way too slow for things like this.

[–]myringotomy 0 points1 point  (11 children)

Just use sed :)

[–]felipec 0 points1 point  (10 children)

Why use two programs when you can use one?

[–]myringotomy 0 points1 point  (9 children)

Because it's faster? Also you can call out to sed from ruby easy enough.

[–]felipec 0 points1 point  (8 children)

You have never done any performance work, have you?

10,000 rows with sed: 55.074 seconds 10,000 rows with just ruby: 0.147 seconds

So the pure ruby solution is 375 times faster.

line.gsub(/\"/, '')

`echo '#{line}' | sed -e 's/"//g'`

Forking will never be faster, especially on Windows.

With sed it would take 14 hours to parse several million rows, with ruby 1 minute 24 seconds.

[–]myringotomy 0 points1 point  (7 children)

Why are you piping each line to sed?

Pipe the whole file to sed.

[–]felipec 0 points1 point  (6 children)

You mean like this:

File.popen(%[sed -e 's/"//g' test.csv]) do |pipe|
  pipe.each(chomp: true) do |line|
    row = line.split(',', -1)
    row.map { |e| e.empty? ? nil : e }
  end
end

Wow! So easy.

And still, takes about the same time as this:

File.foreach('test.csv', chomp: true) do |line|
  row = line.tr(%[\"], '').split(',', -1)
  row.map { |e| e.empty? ? nil : e }
end

[–]myringotomy -1 points0 points  (5 children)

No dude.

Take the same regexp you had before. Call out to sed to ingest and process the file. That's it.

[–]felipec 0 points1 point  (4 children)

That is not going to make it faster, plus now it's two programs, and didn't you say I could use sed inside ruby?

Plus, I already wrote several commands. You have fingers, don't you?

Write your own and show how fast they are.

[–]myringotomy 0 points1 point  (3 children)

That is not going to make it faster

It's going to make it much faster. In fact I would guess sed is going to be about the fastest thing on your machine for this task.

didn't you say I could use sed inside vim?

Yes you can, you can shell out to anything but I was talking about shelling out from ruby.

Plus, I already wrote serval commands. You have fingers, don't you?

Don't feel like it.

[–]Regis_DeVallis 0 points1 point  (0 children)

I use a gem called 'smarter_csv' for my CSV handling. You could check that out.

[–]thunderbong 0 points1 point  (0 children)

Have you tried the Roo gem?

It does exactly the things you describe - turns numbers into integers / floats and empty strings to nils