I am trying to parse an RSS feed with feedparser and insert it into a mySQL table using SQLAlchemy. I was actually able to get this running just fine but today the feed had an item with an ellipsis character in the description and I get the following error:
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in position 35: ordinal not in range(256)
If I add the convert_unicode=True option to the engine I am able to get the insert to go through but the ellipsis doesn't show up it's just weird characters. This seems to make sense since to the best of my knowledge there is no horizontal ellipsis in latin-1. Even if I set the encoding to utf-8 it doesn't seem to make a difference. If I do an insert using phpmyadmin and include the ellipsis it goes through fine.
I'm thinking I just don't understand character encodings or how to get SQLAlchemy to use one I specify. Does anyone know how to get the text to go in without weird characters?
UPDATE
I think I have figured this one out but I'm not really sure why it matters...
Here is the code:
import sys
import feedparser
import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table
COMMON_CHANNEL_PROPERTIES = [
('Channel title:','title', None),
('Channel description:', 'description', 100),
('Channel URL:', 'link', None),
]
COMMON_ITEM_PROPERTIES = [
('Item title:', 'title', None),
('Item description:', 'description', 100),
('Item URL:', 'link', None),
]
INDENT = u' '*4
def feedinfo(url, output=sys.stdout):
feed_data = feedparser.parse(url)
channel, items = feed_data.feed, feed_data.entries
#adding charset=utf8 here is what fixed the problem
db = create_engine('mysql://user:pass@localhost/db?charset=utf8')
metadata = MetaData(db)
rssItems = Table('rss_items', metadata,autoload=True)
i = rssItems.insert();
for label, prop, trunc in COMMON_CHANNEL_PROPERTIES:
value = channel[prop]
if trunc:
value = value[:trunc] + u'...'
print >> output, label, value
print >> output
print >> output, "Feed items:"
for item in items:
i.execute({'title':item['title'], 'description': item['description'][:100]})
for label, prop, trunc in COMMON_ITEM_PROPERTIES:
value = item[prop]
if trunc:
value = value[:trunc] + u'...'
print >> output, INDENT, label, value
print >> output, INDENT, u'---'
return
if __name__=="__main__":
url = sys.argv[1]
feedinfo(url)
Here's the output/traceback from running the code without the charset option:
Channel title: [H]ardOCP News/Article Feed
Channel description: News/Article Feed for [H]ardOCP...
Channel URL: http://www.hardocp.com
Feed items:
Item title: Windows 8 UI is Dropping the 'Start' Button
Item description: After 15 years of occupying a place of honor on the desktop, the "Start" button will disappear from ...
Item URL: http://www.hardocp.com/news/2012/02/05/windows_8_ui_dropping_lsquostartrsquo_button/
---
Item title: Which Crashes More? Apple Apps or Android Apps
Item description: A new study of smartphone apps between Android and Apple conducted over a two month period came up w...
Item URL: http://www.hardocp.com/news/2012/02/05/which_crashes_more63_apple_apps_or_android/
---
Traceback (most recent call last):
File "parse.py", line 47, in <module>
feedinfo(url)
File "parse.py", line 36, in feedinfo
i.execute({'title':item['title'], 'description': item['description'][:100]})
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/expression.py", line 2758, in execute
return e._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2304, in _execute_clauseelement
return connection._execute_clauseelement(elem, multiparams, params)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1538, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_context
context)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 330, in do_execute
cursor.execute(statement, parameters)
File "build/bdist.linux-i686/egg/MySQLdb/cursors.py", line 159, in execute
File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 264, in literal
File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 202, in unicode_literal
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in position 35: ordinal not in range(256)
So it looks like adding the charset to the mysql connect string did it. I suppose it defaults to latin-1? I had tried setting the encoding flag on content_engine to utf8 and that did nothing. Anyone know why it would use latin-1 when the tables and fields are set to utf8 unicode? I also tried encoding item['description] using .encode('cp1252') before sending it off and that worked as well even without adding the charset option to the connection string. That shouldn't have worked with latin-1 but apparently it did? I've got the solution but would love an answer :)