Is it possible to use SQLite in EFS reliably?
Asked Answered
C

1

8

Is it possible to use SQLite in AWS EFS safely? In my readings trying to determine if this is viable there appears to be some allusions that it should be doable since AWS EFS implemented NFSv4 back in 2017. In practice I am having no luck getting consistent behavior out of it.

Quick Points:

  1. "Just use AWS RDS": Due to issues with other AWS architecture another team has implemented we are trying to work around resource starving cause by the API (DynamoDB isn't an option)
  2. "This goes against SQLite's primary use case (being a locally access DB): Yes, but given the circumstances it seems like the best approach.
  3. I have verified that we are running nfsv4 on our EC2 instance

Current results are very inconsistent with 3 exceptions encountered irrespective of approach I use

  1. "file is encrypted or is not a database"
  2. "disk I/O error (potentially related to EFS open file limits)"
  3. "database disk image is malformed" (The database actually isn't corrupted after this)

database code:

SQLITE_VAR_LIMIT = 999
dgm_db_file_name = ''
db = SqliteExtDatabase(None)
lock_file = f'{os.getenv("efs_path", "tmp")}/db_lock_file.lock'


def lock_db_file():
    with open(lock_file, 'w+') as lock:
        limit = 900
        while limit:
            try:
                fcntl.flock(lock, fcntl.LOCK_EX | fcntl.LOCK_NB)
                print(f'db locked')
                break
            except Exception as e:
                print(f'Exception: {str(e)}')
                limit -= 1
                time.sleep(1)

    if not limit:
        raise ValueError(f'Timed out after 900 seconds while waiting for database lock.')


def unlock_db_file():
    with open(lock_file, 'w+') as lock:
        fcntl.flock(lock, fcntl.LOCK_UN)
        print(f'db unlocked')


def initialize_db(db_file_path=dgm_db_file_name):
    print(f'Initializing db ')
    global db
    db.init(db_file_path, pragmas={
        'journal_mode': 'wal',
        'cache_size': -1 * 64000,  # 64MB
        'foreign_keys': 1})
    print(f'db initialized')


class Thing(Model):
    name = CharField(primary_key=True)
    etag = CharField()
    last_modified = CharField()

    class Meta:
        database = db

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)   

    @staticmethod
    def insert_many(stuff):
        data = [(k, v['ETag'], v['Last-Modified']) for k, v in stuff.items()]
        fields = [Thing.name, Thing.etag, Thing.last_modified]
        limit = 900
        while True:
            try:
                with db.atomic():
                    for key_batch in chunked(data, SQLITE_VAR_LIMIT // len(fields)):
                        s = Thing.insert_many(key_batch, fields=[Thing.name, Thing.etag, Thing.last_modified]) \
                            .on_conflict_replace().execute()
                break
            except Exception as e:
                print(f'Exception: {str(e)}')
                print(f'Will try for {limit} more seconds.')
                limit -= 1
                time.sleep(1)

        if not limit:
            raise ValueError('Failed to exectue query after 900 seconds.')

Example Call:

        print(f'Critical section start')
        # lock_db_file() # I have tried with a secondary lock file as well
        self.stuff_db = Thing()
        if not Path(self.db_file_path).exists():
            initialize_db(self.db_file_path)
            print('creating tables')
            db.create_tables([Thing], safe=True)
        else:
            initialize_db(self.db_file_path)

        getattr(Thing, insert_many)(self.stuff_db, stuff_db)
        # db.close()
        # unlock_db_file()
        print(f'Critical section end')
        print(f'len after update: {len(stuff)}')

Additional peculiarities:

  • If a lamda gets stuck catching the "malformed image" exception and a new lambda execution is triggered, the error resolves in the other lambda.
Coumarin answered 8/2, 2022 at 23:31 Comment(8)
Can you use pragma synchronous = 1? Does it change anything?Inrush
@Inrush So I just tried this out and it resulted in a new exception: "locking protocol". It seems like there is something going on behind the scenes that is causing this to not work, like maybe when the lambda execution completes.Coumarin
I assume you also followed this guide (or similar) docs.aws.amazon.com/efs/latest/ug/efs-onpremises.html and I cannot see there any mounting option which would make any significant difference (unless you mount as a wrong nfs version or something similar). If you run it in lambda - yes, that can have some consequences I would consider (if you do not do) switch vpc-endpoint/no vpc endpoint docs.aws.amazon.com/efs/latest/ug/efs-vpc-endpoints.html to the other option. Otherwise - maybe it really does not work....Inrush
@Inrush The configuration for EFS was done by another developer but I did confirm that the NFS version is v4 so that isn't an issue. I can confirm that the problem is caused by concurrent lambda access though as I have a working solution that limits lambda execution 1 one instance and there are no issues.Coumarin
Can you run one lambda and one instance from e.g. a temporary ec2 or just different service, would it work? I am curious if it is caused by the lambda environment or concurrency per se.Inrush
@Inrush Well I finally got it to work. I had to use the peewee APSWDatabase class and pass it the keyword vfs='unix-excl'. I'm not sure of the broader implications of this other than I know that effects how the locking works.Coumarin
great work! I mean, it is for sure an edge use case, but it is at least an interesting possibility and your project is a proof that it has practical usage. You solved it yourself, but thanks for sharing the results.Inrush
The problem here is that you are using journal_mode=WAL which is expected to work only on local filesystems with shared memory support (see sqlite.org/wal.html). So the only documented option for network filesystem is to use locking_mode=EXCLUSIVE, see chapter "Use of WAL Without Shared-Memory".Anabas
C
6

After some trial and error I discovered it is a workable solution. It appears that the design will need to use APSWDatabase(..., vfs='unix-excl') to properly enforce locking.

Database code:

from peewee import *
from playhouse.apsw_ext import APSWDatabase

SQLITE_VAR_LIMIT = 999
db = APSWDatabase(None, vfs='unix-excl')


def initialize_db(db_file_path):
    global db
    db.init(db_file_path, pragmas={
        'journal_mode': 'wal',
        'cache_size': -1 * 64000})
    db.create_tables([Thing], safe=True)

    return Thing()


class Thing(Model):
    field_1 = CharField(primary_key=True)
    field_2 = CharField()
    field_3 = CharField()

    class Meta:
        database = db

This allows for the following usage:

        db_model = initialize_db(db_file_path)
        with db:
            # Do database queries here with the db_model
            pass

Note: If you don't use the context managed database connection you will need to explicitly call db.close() otherwise the lock will not be released from the file. Additionally, calling db_init(...) causes a lock to be placed on the databased until it is closed.

Coumarin answered 11/2, 2022 at 19:47 Comment(1)
For anyone else wondering what vfs='unix-excl' does: sqlite.org/vfs.html#standard_unix_vfses "obtains and holds an exclusive lock on database files, preventing other processes from accessing the database. Also keeps the wal-index in heap rather than in shared memory."Compressed

© 2022 - 2024 — McMap. All rights reserved.