I am building a script to download and parse benefits information for health insurance plans on Obamacare exchanges. Part of this requires downloading and parsing the plan benefit JSON files from each individual insurance company. In order to do this, I am using concurrent.futures.ThreadPoolExecutor
with 6 workers to download each file (with urllib), parse and loop thru the JSON and extract the relevant info (which is stored in nested dictionary within the script).
(running Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:38:48) [MSC v.1900 32 bit (Intel)] on win32)
The problem is that when I do this concurrently, the script does not seem to release the memory after it has downloaded\parsed\looped thru a JSON file, and after a while, it crashes, with malloc
raising a memory error.
When I do it serially--with a simple for in
loop-- however,the program does not crash nor does it take an extreme amount of memory.
def load_json_url(url, timeout):
req = urllib.request.Request(url, headers={ 'User-Agent' : 'Mozilla/5.0' })
resp = urllib.request.urlopen(req).read().decode('utf8')
return json.loads(resp)
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_json_url, url, 60): url for url in formulary_urls}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
# The below timeout isn't raising the TimeoutError.
data = future.result(timeout=0.01)
for item in data:
if item['rxnorm_id']==drugid:
for row in item['plans']:
print (row['drug_tier'])
(plansid_dict[row['plan_id']])['drug_tier']=row['drug_tier']
(plansid_dict[row['plan_id']])['prior_authorization']=row['prior_authorization']
(plansid_dict[row['plan_id']])['step_therapy']=row['step_therapy']
(plansid_dict[row['plan_id']])['quantity_limit']=row['quantity_limit']
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
downloaded_plans=downloaded_plans+1