What are some "mainstream" lightweight alternatives to storing files in .csv format? [duplicate]

Asked 19/4, 2011 at 19:36 Answered 18/1, 2016 at 11:46

Solved csv storage local-storage data-storage

I'm on a project which heavily favors the use of .csv files for data storage. I see many issues with using .csv, especially for storing relational data. Parsing .csv is generally a pain, particularly when using ad-hoc column assigments.

I've advocated the use of XML and minimal databases such as SQLite, but I'm looking for "faster, better, cheaper" alternatives.

What are some other, "mainstream" lightweight alternatives to .csv files?

Also, what about CouchDB. How does it compare to SQLite in terms of lightweight-ness?

EDIT: I missed it. This question has been asked before.

Bick answered 19/4, 2011 at 19:36 Comment(0)

I would argue there is no direct replacement for a CSV file. CSV is a flat file index-oriented format. It doesn't matter if you replace commas with pipes or whatnot. It's the same thing with slightly different rules.

With that being said, I often opt for SQLite when the data is in my control.

Using SQLite consistently lends to using the same tooling, can be used as either an ad-hoc store or a relational model, has a 'step up' plan to a "standalone" RDBMS, provides DQL "for free" (which is a big plus for me), etc. Unless space is a real issue or there isn't support for the data-access, why not? (Modern Firefox also uses SQLite).

(There are a number of object-database out there, such as DB4O as well -- or even simpler key/value hierarchical stores, etc. Not trying to say SQLite is the only way to obtain relationships in a micro/embedded database.)

One down-side over say, XML is that special tooling (sqlite/adapter) is required. XML, while not the most human-friendly format, can be edited just fine in notepad. Additionally, there is no extra overhead (fragmentation or structure) in XML beside the markup/data itself and XML is generally quite amendable to compression. There are also many libraries to map an entire object graph to XML (and thus maintain relationships) so that might be a nice feature.

Other formats like JSON are also out there -- but if the format is opaque then it doesn't really make a difference over XML (it's more of a matter of tooling support).

So... "it depends".

Hydrometeor answered 19/4, 2011 at 19:47 Comment(0)

It looks like YAML is relatively small compared to formats such as XML, but slightly more descriptive than JSON (it's a superset). It's another candidate I'll consider.

Bick answered 19/4, 2011 at 21:22 Comment(0)

It's all about use-case.

My rule of thumb: use SQLite if there are dependencies or relations between two pieces of data; use CSV (or some other "flat" format) if it's just flat data files. The simplest thing that just works is often the most reliable solution as well.

(Note: Ensure the CSV is well formed. Nobody likes having to hack around bad CSV implementations.)

Dilator answered 19/4, 2011 at 19:58 Comment(0)

HDF5 is a good choice for storing large tabular datasets, if you do not require concurrent writes.

In Python, Pandas + PyTables are very easy to use. Example from the Pandas documentation:

In [259]: store = HDFStore('store.h5')

In [260]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Empty
Objects can be written to the file just like adding key-value pairs to a dict:

In [261]: np.random.seed(1234)

In [262]: index = date_range('1/1/2000', periods=8)

In [263]: s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [264]: df = DataFrame(randn(8, 3), index=index,
   .....:                columns=['A', 'B', 'C'])
   .....: 

In [265]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'],
   .....:            major_axis=date_range('1/1/2000', periods=5),
   .....:            minor_axis=['A', 'B', 'C', 'D'])
   .....: 

# store.put('s', s) is an equivalent method
In [266]: store['s'] = s

In [267]: store['df'] = df

In [268]: store['wp'] = wp

# the type of stored data
In [269]: store.root.wp._v_attrs.pandas_type
Out[269]: 'wide'

In [270]: store
Out[270]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df            frame        (shape->[8,3])  
/s             series       (shape->[5])    
/wp            wide         (shape->[2,5,4])

Cyanotype answered 18/1, 2016 at 11:46 Comment(0)

XML is designed to be mainstream and relativey "lightweight". JSON is another popular choice but much more suited to object modeling as opposed to data storage.

MySQL is a good option if you need relational querying capabilities.

Lillis answered 19/4, 2011 at 19:50 Comment(2)

I think a lot of people would argue that XML is extremely heavy. JSON was designed because of the weight of XML... – Barns 20/4, 2011 at 5:54

@Jonathan Day: Whether XML or JSON is heavier depends on what you're doing. If your data is more or less "like" CSVs (as it is for OP), then XML like "<x last="Smith" first="Pat" age="27".../> is actually shorter than the corresponding JSON {"last":"Smith","first":"Pat""age":27}; and either one can be loaded with a single line of code. You can shorten JSON by using a plain array instead of names, but you reduce readability and error-checkability (esp. with lots of fields) -- and then your JSON is just CSV with square brackets in place of newlines. – Denn 27/1, 2020 at 17:7

Recommended topics

Hot tags