import from CSV into Ruby array, with 1st field as hash key, then lookup a field's value given header row
Asked Answered
B

7

27

Maybe somebody can help me.

Starting with a CSV file like so:

Ticker,"Price","Market Cap"
ZUMZ,30.00,933.90
XTEX,16.02,811.57
AAC,9.83,80.02

I manage to read them into an array:

require 'csv'
tickers = CSV.read("stocks.csv", {:headers => true, :return_headers => true, :header_converters => :symbol, :converters => :all} )

To verify data, this works:

puts tickers[1][:ticker]
ZUMZ

However this doesn't:

puts tickers[:ticker => "XTEX"][:price]

How would I go about turning this array into a hash using the ticker field as unique key, such that I could easily look up any other field associatively as defined in line 1 of the input? Dealing with many more columns and rows.

Much appreciated!

Benedic answered 12/12, 2011 at 15:35 Comment(1)
B
0

To get the best of both worlds (very fast reading from a huge file AND the benefits of a native Ruby CSV object) my code had since evolved into this method:

$stock="XTEX"
csv_data = CSV.parse IO.read(%`|sed -n "1p; /^#{$stock},/p" stocks.csv`), {:headers => true, :return_headers => false, :header_converters => :symbol, :converters => :all}

# Now the 1-row CSV object is ready for use, eg:
$company = csv_data[:company][0]
$volatility_month = csv_data[:volatility_month][0].to_f
$sector = csv_data[:sector][0]
$industry = csv_data[:industry][0]
$rsi14d = csv_data[:relative_strength_index_14][0].to_f

which is closer to my original method, but only reads in one record plus line 1 of the input csv file containing the headers. The inline sed instructions take care of that--and the whole thing is noticably instant. This this is better than last because now I can access all the fields from Ruby, and associatively, not caring about column numbers anymore as was the case with awk.

Benedic answered 24/3, 2012 at 10:52 Comment(4)
This isn't Perl -- you don't need the $, and in fact you shouldn't have it, because in Ruby, $ indicates global variables, and use of globals is generally bad practice.Regenerate
In this case the globals are intentional. But I do realize that if the entire program were more OO, things would be written better.Benedic
Exactly. Excessive use of globals (i.e. really any use except for things like configuration data) usually indicates a design problem that you'll want to fix.Regenerate
Downvoter note: This solution crossing multiple technologies (not just Ruby) has been our chosen one for years, since the original purely-Ruby answer, while it works, is highly inefficient, and simply too slow in production.Benedic
A
35

Like this (it works with other CSVs too, not just the one you specified):

require 'csv'

tickers = {}

CSV.foreach("stocks.csv", :headers => true, :header_converters => :symbol, :converters => :all) do |row|
  tickers[row.fields[0]] = Hash[row.headers[1..-1].zip(row.fields[1..-1])]
end

Result:

{"ZUMZ"=>{:price=>30.0, :market_cap=>933.9}, "XTEX"=>{:price=>16.02, :market_cap=>811.57}, "AAC"=>{:price=>9.83, :market_cap=>80.02}}

You can access elements in this data structure like this:

puts tickers["XTEX"][:price] #=> 16.02

Edit (according to comment): For selecting elements, you can do something like

 tickers.select { |ticker, vals| vals[:price] > 10.0 }
Algy answered 12/12, 2011 at 15:58 Comment(10)
Wow quick reply thank you!! Still getting the hang of this lang. Next I'm researching how to apply one or more filters(eg. return this hashed array w/all prices over 2.01)Benedic
If this answer helped you, please upvote and/or accept (the little tick mark below the voting arrows) it, that's StackOverflow etiquette. I'll update my answer to address the filtering question :-)Algy
My 2.8MB file with under 7000 rows and ~40 columns takes way too long on this foreach loop, over 5min, reading in only a few cols [1..4] testing in irb. Have to stick to snappy fast awk query to keep my script under 20s till I can figure this out within Ruby :(Benedic
Dang, ain't got the reputation to upvote you...been consulting this great site for yrs but only now created acct :)Benedic
Then just accept the answer, you should always be able to do that. As for the speed, Ruby isn't known for being fast, but 5 minutes does sound like a lot. You could try JRuby, but if you have a 20s limit, JVM startup may offset potential speed gains. But then there's nothing wrong with using awk for parsing CSV, it's actually very well suited for the task :-)Algy
BTW my sample CSV datasource tinyurl.com/AllStocksFinviz if it helps. Thanks for all explaining & examples!Benedic
This discussion is getting too long, but on my laptop this is how long it took me to parse the entire file (all columns): ruby foo.rb Downloads/finviz.csv 36,00s user 1,07s system 92% cpu 40,260 total. If you are on 1.8 you should try the FasterCSV gem.Algy
try without the :converters => :all param. I'm not sure why that's there and it will likely slow it down a bit.Forecourse
Now that my framework is maturing, and needing more fields of various types to read and process flexibly, I'm looking to shift back from my fast awk solution to 100% native Ruby. But it has to be just as fast and available to my scripts, so it seems memcached will be a good fit for me. I.e. I will structure to preload my CSV file into a memory object for Ruby to later use instantly.Benedic
I wrote a gem that yields hashed CSV records, with no extra effort: rubygems.org/gems/hasherize_csvHepato
B
6
CSV.read(file_path, headers:true, header_converters: :symbol, converters: :all).collect do |row|
  Hash[row.collect { |c,r| [c,r] }]
end
Becca answered 30/6, 2013 at 4:57 Comment(0)
D
2
CSV.read(file_path, headers:true, header_converters: :symbol, converters: :all).collect do |row|
  row.to_h
end
Dispatcher answered 24/10, 2022 at 11:32 Comment(1)
While this code may answer the question, consider adding an explanation of how this code fixes the problem. And because this is such an old question with many other answers, you should explain what your solution adds/addresses that the others do not.Dahna
C
1

To add on to Michael Kohl's answer, if you want to access the elements in the following manner

puts tickers[:price]["XTEX"] #=> 16.02

You can try the following code snippet:

CSV.foreach("Workbook1.csv", :headers => true, :header_converters => :symbol, :converters => :all) do |row|
    hash_row =  row.headers[1..-1].zip( (Array.new(row.fields.length-1, row.fields[0]).zip(row.fields[1..-1])) ).to_h
    hash_row.each{|key, value| tickers[key] ? tickers[key].merge!([value].to_h) : tickers[key] = [value].to_h}
end
Chamfer answered 16/3, 2015 at 10:8 Comment(0)
B
0

To get the best of both worlds (very fast reading from a huge file AND the benefits of a native Ruby CSV object) my code had since evolved into this method:

$stock="XTEX"
csv_data = CSV.parse IO.read(%`|sed -n "1p; /^#{$stock},/p" stocks.csv`), {:headers => true, :return_headers => false, :header_converters => :symbol, :converters => :all}

# Now the 1-row CSV object is ready for use, eg:
$company = csv_data[:company][0]
$volatility_month = csv_data[:volatility_month][0].to_f
$sector = csv_data[:sector][0]
$industry = csv_data[:industry][0]
$rsi14d = csv_data[:relative_strength_index_14][0].to_f

which is closer to my original method, but only reads in one record plus line 1 of the input csv file containing the headers. The inline sed instructions take care of that--and the whole thing is noticably instant. This this is better than last because now I can access all the fields from Ruby, and associatively, not caring about column numbers anymore as was the case with awk.

Benedic answered 24/3, 2012 at 10:52 Comment(4)
This isn't Perl -- you don't need the $, and in fact you shouldn't have it, because in Ruby, $ indicates global variables, and use of globals is generally bad practice.Regenerate
In this case the globals are intentional. But I do realize that if the entire program were more OO, things would be written better.Benedic
Exactly. Excessive use of globals (i.e. really any use except for things like configuration data) usually indicates a design problem that you'll want to fix.Regenerate
Downvoter note: This solution crossing multiple technologies (not just Ruby) has been our chosen one for years, since the original purely-Ruby answer, while it works, is highly inefficient, and simply too slow in production.Benedic
L
0

Not as 1-liner-ie but this was more clear to me.

csv_headers = CSV.parse(STDIN.gets)
csv = CSV.new(STDIN)

kick_list = []
csv.each_with_index do |row, i|
  row_hash = {}
  row.each_with_index do |field, j|
    row_hash[csv_headers[0][j]] = field
  end
  kick_list << row_hash
end
Lucho answered 2/8, 2013 at 1:45 Comment(0)
B
-1

While this isn't a 100% native Ruby solution to the original question, should others stumble here and wonder what awk call I wound up using for now, here it is:

$dividend_yield = IO.readlines("|awk -F, '$1==\"#{$stock}\" {print $9}' datafile.csv")[0].to_f

where $stock is the variable I had previously assigned to a company's ticker symbol (the wannabe key field). Conveniently survives problems by returning 0.0 if: ticker or file or field #9 not found/empty, or if value cannot be typecasted to a float. So any trailing '%' in my case gets nicely truncated.

Note that at this point one could easily add more filters within awk to have IO.readlines return a 1-dim array of output lines from the smaller resulting CSV, eg.

 awk -F, '$9 >= 2.01  &&  $2 > 99.99  {print $0}' datafile.csv 

outputs in bash which lines have a DivYld (col 9) over 2.01 and price (col 2) over 99.99. (Unfortunately I'm not using the header row to to determine field numbers, which is where I was ultimately hoping for some searchable associative Ruby array.)

Benedic answered 14/12, 2011 at 18:10 Comment(1)
The problem with -F, is that awk treats every comma as delimiter even when it occurs inside a quoted field, like a company's name: "Apple, Inc." Only "Apple returns when I ask for field $15 of that row.Benedic

© 2022 - 2024 — McMap. All rights reserved.