Change text_factory in Django/sqlite
Asked Answered
S

6

6

I have a django project that uses a sqlite database that can be written to by an external tool. The text is supposed to be UTF-8, but in some cases there will be errors in the encoding. The text is from an external source, so I cannot control the encoding. Yes, I know that I could write a "wrapping layer" between the external source and the database, but I prefer not having to do this, especially since the database already contains a lot of "bad" data.

The solution in sqlite is to change the text_factory to something like: lambda x: unicode(x, "utf-8", "ignore")

However, I don't know how to tell the Django model driver this.

The exception I get is:

'Could not decode to UTF-8 column 'Text' with text' in /var/lib/python-support/python2.5/django/db/backends/sqlite3/base.py in execute

Somehow I need to tell the sqlite driver not to try to decode the text as UTF-8 (at least not using the standard algorithm, but it needs to use my fail-safe variant).

Shopping answered 30/4, 2010 at 13:0 Comment(0)
B
9

The solution in sqlite is to change the text_factory to something like: lambda x: unicode(x, "utf-8", "ignore")

However, I don't know how to tell the Django model driver this.

Have you tried

from django.db import connection
connection.connection.text_factory = lambda x: unicode(x, "utf-8", "ignore")

before running any queries?

Balling answered 18/6, 2010 at 21:11 Comment(2)
Thanks for the input! The above worked with a few modifications (namely, one has to create a cursor first, otherwise the DatabaseWrapper.connection is None). I've been tearing my hair about this.Shopping
@Shopping can u post full solution?Delphinus
D
2

Inspired by Milla's answer, consider the following monkey-patch that installs a more tolerant text_factory into the django sqlite connection. To be used when you cannot control how text is added to the sqlite database and it might not be in utf-8. Of course, the encoding used here may not be the right one, but at least your application won't crash.

import types
from django.db.backends.sqlite3.base import DatabaseWrapper

def to_unicode( s ):
    ''' Try a number of encodings in an attempt to convert the text to unicode. '''
    if isinstance( s, unicode ):
        return s
    if not isinstance( s, str ):
        return unicode(s)

    # Put the encodings you expect here in sequence.
    # Right-to-left charsets are not included in the following list.
    # Not all of these may be necessary - don't know.
    encodings = (
        'utf-8',
        'iso-8859-1', 'iso-8859-2', 'iso-8859-3',
        'iso-8859-4', 'iso-8859-5',
        'iso-8859-7', 'iso-8859-8', 'iso-8859-9',
        'iso-8859-10', 'iso-8859-11',
        'iso-8859-13', 'iso-8859-14', 'iso-8859-15',
        'windows-1250', 'windows-1251', 'windows-1252',
        'windows-1253', 'windows-1254', 'windows-1255',
        'windows-1257', 'windows-1258',
        'utf-8',     # Include utf8 again for the final exception.
    )
    for encoding in encodings:
        try:
            return unicode( s, encoding )
        except UnicodeDecodeError as e:
            pass
    raise e

if not hasattr(DatabaseWrapper, 'get_new_connection_is_patched'):
    _get_new_connection = DatabaseWrapper.get_new_connection
    def _get_new_connection_tolerant(self, conn_params):
        conn = _get_new_connection( self, conn_params )
        conn.text_factory = to_unicode
        return conn

    DatabaseWrapper.get_new_connection = types.MethodType( _get_new_connection_tolerant, None, DatabaseWrapper )
    DatabaseWrapper.get_new_connection_is_patched = True
Delilahdelimit answered 1/3, 2015 at 14:29 Comment(1)
One detail left out. You need to do this patch accessing the database. A good place could be in "models.py".Delilahdelimit
E
0

Feed the data with one of the magic str function from Django :

smart_str(s, encoding='utf-8', strings_only=False, errors='strict')

or

smart_unicode(s, encoding='utf-8', strings_only=False, errors='strict')
Elene answered 30/4, 2010 at 13:16 Comment(5)
I am sorry if I misunderstand you, but the problem is that the database already contains 'bad' data, and I want to do the conversion when I read it. The page you refer to seems to deal with inputting strings into the database. The tool that imports data does not use django, but works with the pysqlite module. It consists of legacy code that I am reluctant to change. Thanks for the response.Shopping
have you tried to fill the 'bad' DB content into the two function above?Elene
smart_str and smart_unicode can serve the purpose of filtering whether you're loading the data into the database or reading from it. I'd do both for consistency & data integrity.Parenteau
Sorry, but I must admit you got me totally confused now. I don't understand how to use those functions at the database driver level. No matter how I read the docs, I can only see that they operate on strings, but Sqlite throws an exception way before I get hold of the actual string. The question is updated with the exception I get.Shopping
I realize now that my original question wasn't very clearly formulated. The problem is that I get an exception before I can even see the data. Just iterating over the records in the model is enough to trigger the exception.Shopping
N
0

It seems, that this problem arises quite often and that it is of great interest to many people. (As this questions has more than a thousand views and quite some upvotes)

So here is the answer, that I found for the problem, which appears to me as the most convenient one:

I checked the django sqlite3 connector and added the str conversion directly to the get_new_connection(...) function:

def get_new_connection(self, conn_params):
    conn = Database.connect(**conn_params)
    conn.create_function("django_date_extract", 2, _sqlite_date_extract)
    conn.create_function("django_date_trunc", 2, _sqlite_date_trunc)
    conn.create_function("django_datetime_extract", 3, _sqlite_datetime_extract)
    conn.create_function("django_datetime_trunc", 3, _sqlite_datetime_trunc)
    conn.create_function("regexp", 2, _sqlite_regexp)
    conn.create_function("django_format_dtdelta", 5, _sqlite_format_dtdelta)
    conn.text_factory = str
    return conn

It seems to work as it should and one does not have to check on the unicode problem in every request individually. Shouldn't it be considered to add this to django code (?), since I wouldn't suggest anyone to actually modify his django backend code manually...

Nona answered 13/11, 2013 at 22:39 Comment(0)
V
0
from django.db import connection
connection.cursor()
connection.connection.text_factory = lambda x: unicode(x, "utf-8", "ignore")

In my specific case I needed to set connection.connection.text_factory = str

Vickery answered 22/9, 2014 at 14:15 Comment(0)
E
0

Incompatible Django version. Check Django version for solving this error first. I was running on Django==3.0.8 and it was producing an error. Than I ran virtualenv where I have Django==3.1.2 and the error was removed.

Eyestalk answered 8/10, 2020 at 13:24 Comment(1)
This question was asked 10 years ago. Django 3.x did not exist back then :)Shopping

© 2022 - 2024 — McMap. All rights reserved.