Writing data to LMDB with Python very slow
Asked Answered
E

4

9

Creating datasets for training with Caffe I both tried using HDF5 and LMDB. However, creating a LMDB is very slow even slower than HDF5. I am trying to write ~20,000 images.

Am I doing something terribly wrong? Is there something I am not aware of?

This is my code for LMDB creation:

DB_KEY_FORMAT = "{:0>10d}"
db = lmdb.open(path, map_size=int(1e12))
    curr_idx = 0
    commit_size = 1000
    for curr_commit_idx in range(0, num_data, commit_size):
        with in_db_data.begin(write=True) as in_txn:
            for i in range(curr_commit_idx, min(curr_commit_idx + commit_size, num_data)):
                d, l = data[i], labels[i]
                im_dat = caffe.io.array_to_datum(d.astype(float), label=int(l))
                key = DB_KEY_FORMAT.format(curr_idx)
                in_txn.put(key, im_dat.SerializeToString())
                curr_idx += 1
    db.close()

As you can see I am creating a transaction for every 1,000 images, because I thought creating a transaction for each image would create an overhead, but it seems this doesn't influence performance too much.

Extemporize answered 27/7, 2015 at 9:16 Comment(8)
why aren't you using the convert_imageset tool?Chase
@Shai: Actually I wasn't aware of, but I also don't have my images as files. Though, why should it be faster? Is the Python implementation so slow?Extemporize
I'm working with convert_imageset to woek on ilsvrc12 (imagenet) converting datasets of ~1M images, it takes a while but it works.Chase
where do you get your data from?Chase
I have HDF5 files containing my data. I know Caffe can use HDF5 files as data source, unfortunately when doing so Caffe does not allow data transform.Extemporize
What transformations do you require?Chase
Actually, I want to use data augmentation like cropping and mirroring.Extemporize
Then, you can either save you hdf5 images to jpegs and process them through the conventional pipeline that allows you for data augmentation. Or, you can manually crop and mirror creating additional numpy arrays saving them to HDF5 and feeding the augmented HDF5 to the net.Chase
I
6

In my experience, I've had 50-100 ms writes to LMDB from Python writing Caffe data on ext4 hard disk on Ubuntu. That's why I use tmpfs (RAM disk functionality built into Linux) and get these writes done in around 0.07 ms. You can make smaller databases on your ramdisk and copy them to a hard disk and later train on all of them. I'm making around 20-40GB ones as I have 64 GB of RAM.

Some pieces of code to help you guys dynamically create, fill and move LMDBs to storage. Feel free to edit it to fit your case. It should save you some time getting your head around how LMDB and file manipulation works in Python.

import shutil
import lmdb
import random


def move_db():
    global image_db
    image_db.close();
    rnd = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5))
    shutil.move( fold + 'ram/train_images',  '/storage/lmdb/'+rnd)
    open_db()


def open_db():
    global image_db
    image_db    = lmdb.open(os.path.join(fold, 'ram/train_images'),
            map_async=True,
            max_dbs=0)

def write_to_lmdb(db, key, value):
    """
    Write (key,value) to db
    """
    success = False
    while not success:
        txn = db.begin(write=True)
        try:
            txn.put(key, value)
            txn.commit()
            success = True
        except lmdb.MapFullError:
            txn.abort()
            # double the map_size
            curr_limit = db.info()['map_size']
            new_limit = curr_limit*2
            print '>>> Doubling LMDB map size to %sMB ...' % (new_limit>>20,)
            db.set_mapsize(new_limit) # double it

...

image_datum                 = caffe.io.array_to_datum( transformed_image, label )
write_to_lmdb(image_db, str(itr), image_datum.SerializeToString())
Isotron answered 10/5, 2016 at 21:21 Comment(5)
Can you give a bit more context what tempfs is?Zischke
Can you please provide specific code describing your solution/workflow?Chase
This is an excellent suggestion! @SteveHeim See this post for details on creating a RAM disk in Ubuntu. Rather than writing data to a hard disk, which can be very slow when a large number of writes are involved, you can mount a directory to a RAM location. While the interface is the same as any other directory, read and write access to the mounted directory will be orders of magnitude faster. When you're finished using your database you can then move it to another directory on a hard disk for long term storage.Blinding
Steve, as I wrote, tempfs is a RAM disk fs on Linux. You can use a different RAM disk filesystem if you're on another OS, it doesn't matter.Psalter
Actually, sorry for the typo, I meant tmpfs. Shai, my specific code is pretty specific as I'm getting my data through interprocess communication via sockets - I'm distorting stuff on HTML canvas and POSTing it to a socket. But okay, I'll update my answer to include code for manipulating DBs. Read about tmpfs somewhere else, though. That one is well documented.Psalter
F
3

Try this:

DB_KEY_FORMAT = "{:0>10d}"
db = lmdb.open(path, map_size=int(1e12))
    curr_idx = 0
    commit_size = 1000
    with in_db_data.begin(write=True) as in_txn:
        for curr_commit_idx in range(0, num_data, commit_size):
            for i in range(curr_commit_idx, min(curr_commit_idx + commit_size, num_data)):
                d, l = data[i], labels[i]
                im_dat = caffe.io.array_to_datum(d.astype(float), label=int(l))
                key = DB_KEY_FORMAT.format(curr_idx)
                in_txn.put(key, im_dat.SerializeToString())
                curr_idx += 1
    db.close()

the code

with in_db_data.begin(write=True) as in_txn:

takes much time.

Firestone answered 15/9, 2015 at 1:56 Comment(0)
O
2

LMDB writes are very sensitive to order - If you can sort the data before insertion speed will improve significantly

Orthopter answered 22/6, 2017 at 11:15 Comment(0)
F
2

I did a small benchmark to illustrate Ophir's point:

Machine:

RasPi 4B - overclock to 1.75 GHz, 4GB, RasperryPi OS, OS on SSD

Code:

def insert_lmdb(fsobj, transaction):
    transaction.put(key=str(fsobj).encode("utf-8", "ignore"), value=generate_hash_from_file(fsobj).hexdigest().encode("utf-8", "ignore"))
list_f = list_files(FOLDER)

print(f"\n> Insert results in lmdb <")
list_f = Directory(path=DIR_ECTORY, use_hash=False, hash_from_content=False).lists["files"]

# list_f = sorted(list_f) # Run only in the 'sorted' case.

st = timeit.default_timer()

env = lmdb.open(path=DB_NAME)

with env.begin(write=True) as txn:
    for i in list_f:
        insert_lmdb(i, transaction=txn)
average = (timeit.default_timer() - st)*1000000/records

print(f"Test repeated {TIMES} times.\nNumber of files: {records}\nAverage time: {round(average, 3)} us or {round(1000000/average/1000, 3)}k inserts/sec")

Results:

Without sorting:

> Insert results in lmdb <
Test repeated 50000 times.
Number of files: 363
Average time: 84 us or 12k inserts/sec

With sorting:

> Insert results in lmdb <
Test repeated 50000 times.
Number of files: 363
Average time: 18.5 us or 54k inserts/sec

Sorting brought a 4.5 times speed increase in writes, not bad for only one extra line of code :).

Factorial answered 19/7, 2020 at 12:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.