Multithreaded download of yahoo stock history with python yfinance
Asked Answered
X

1

5

I'm trying to download the historical data for a list of tickers and export each to a csv file. I can make this work as a for loop but that is very slow when the list of stock tickers is in the 1000's. I'm trying to multithread the process but I keep getting many different errors. At times it will download just 1 file other times 2 or 3 and a few times even 6 but never beyond that. I'm guessing that has something to do with having a 6 core 12 thread processor, but I really don't know.

import csv
import os
import yfinance as yf
import pandas as pd
from threading import Thread

ticker_list = []

with open('tickers.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    name = None
    for row in reader:
        if row[0]:
            ticker_list.append(row[0])

start_date = '2019-03-03'
end_date = '2020-03-04'

data = pd.DataFrame()

def y_hist(i):
    ticker = ticker_list[i]
    data = yf.download(ticker, start=start_date, end=end_date, group_by="ticker")
    data.to_csv('yhist/' + ticker + '.csv', sep=',', encoding='utf-8')

threads = []

for i in range(os.cpu_count()):
    print('registering thread %d' % i)
    threads.append(Thread(target=y_hist,args=(i,)))

for thread in threads:
    thread.start()

for thread in threads:
    thread.join()

print('done')

This is a sample file of the csv with the tickers just enough to test this. ticker.csv

These are the pages I've read and used code from in an attempt to make this work:

multithreading-to-scrape-yahoo-finance

Engineer Man threads

an-introduction-to-asynchronous-programming-in-python

This is a simplified version with it's output maybe it will help to clarify the issue.

import os
import pandas as pd
import yfinance as yf
from threading import Thread

ticker_list = ['IBM','MSFT','QQQ','SPY','FB','XLV','XLF','XLK','XLE','GTHX','IYR','ONE','ROG','OLED','GLD']

def y_hist():
    for ticker in ticker_list:
        print(ticker)

threads = []

for i in range(os.cpu_count()):
    threads.append(Thread(target=y_hist))

for thread in threads:
    thread.start()

for thread in threads:
    thread.join()

Output:

IBM
MSFT
QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
OLED
GLD
IBM
MSFT
QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
OLED
GLD
IBM
MSFT
QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
IBM
MSFT
QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
OLED
GLD
OLEDIBM
MSFT
QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
GLD
IBM
MSFT
QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
OLED
IBM
GLD
MSFT
ROG
OLED
GLD

QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
OLED
GLD
IBM
MSFT
QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
OLED
GLD
IBM
MSFT
QQQ
SPY
IBM
MSFT
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
OLED
GLD
QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
OLED
GLD
IBM
MSFT
QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
OLED
IBM
MSFT
QQQ
SPY
FB
XLV
XLF
XLK
XLE
GTHX
IYR
ONE
ROG
OLED
GLD
GLD
Xylon answered 5/3, 2020 at 16:59 Comment(0)
X
12

While this does not directly fix my broken code it is a solution that will get the same result. It uses yfinance built in ability to multithread. Unfortunately I still don't know why the orginal code won't work, and would still appreciate feedback on that. In the meantime this will work if anyone is looking for a solution to the same issue.

import csv
import os
import yfinance as yf
import pandas as pd
import time
start = time.time()

ticker_list = []

with open('tickers.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    name = None
    for row in reader:
        if row[0]:
            ticker_list.append(row[0])

data = yf.download(
        tickers = ticker_list,
        period = '1y',
        interval = '1d',
        group_by = 'ticker',
        auto_adjust = False,
        prepost = False,
        threads = True,
        proxy = None
    )

data = data.T

for ticker in ticker_list:
    data.loc[(ticker,),].T.to_csv('yhist/' + ticker + '.csv', sep=',', encoding='utf-8')

print('It took', time.time()-start, 'seconds.')

Time to run a list of 400 tickers:

With threading set to True

[*********************100%***********************] 400 of 400 completed

It took 23.420897006988525 seconds.

With threading set to False

[*********************100%***********************] 400 of 400 completed

It took 133.77732181549072 seconds.

Xylon answered 6/3, 2020 at 13:59 Comment(2)
Good Solution. However i am keep getting this ``` [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted') ``` while running the above code. Any idea why i am seeing this and how can i get rid of it?Southbound
Upvoted for at least including your working solution here.Imparipinnate

© 2022 - 2024 — McMap. All rights reserved.