Store metadata in CSV file
Asked Answered
M

5

8

I understand that this would be a misuse of the CSV format. I do understand that more appropriate solution would be to export xls/xlsx file, or give user an OData end point and let them use power pivot to retrieve data. At this time I want to stick with CSV, because it is easier to implement.

I need to export a flat list of rows (a report) as a CSV file. At the same time I would like to include some metadata about the data. E.g. report title, generation date, and user.

Is there a neat way to accomplish both goals using CSV file format?

The intent usage for this file is to allow the user to open file in Excel and do data crunching.

One solution would be to break CSV into two sections: meta and data. E.g.

Title,Report Generation Date,Generated by User,,
Outstanding Jobs,5/5/2015,[email protected],,
,,,,
Job Started, Estimated Completion, Description, Foo, Bar
9/3/2003,1/1/2006,"do something important""""",5,7
5/4/2007,2/2/2010,"do something else""""",3,12

Or perhaps there is a "standard" extension to CSV format for this purpose?

Are there any practical issues with this approach?

Meshuga answered 9/6, 2015 at 18:24 Comment(4)
From what I've seen it seems having the first line being the meta data and the rest of the lines the data seems to be how it's mostly done.Townsman
However standard CSV parsers won't deal with that out of the box. You'll need to strip off the extra lines before handing the file to a parser, or write your own parser. If you do write your own parser, be aware of how to handle commas, quotes and white space in a "cell".Gerge
NCSA Common log format would work nicely, alas Excel doesn't understand it.Meshuga
Panda's has a CSV parser that deals with skipping lines, check out their header argumentRicard
H
6

When I wanted to add some metadata to a csv file I ended up storing it as a json string as part of the last column header.

timestamp;x;y;z;heading#{"id": "-L4uNnjWXL2cLY_xpU_s"}
89318.329;0.00541;-0.00548;-0.00219;-1.0
89318.339;0.00998;0.0063;-0.00328;-1.0
89318.349;0.0043;0.01318;0.00069;-1.0
89318.359;0.00477;0.0138;0.0007;-1.0

The advantage of this compared to OP's suggestion is that it's a valid csv file in terms of all rows having equal amount of columns and each column containing only one type of data. A parser that knows about the format could split the header at the # and parse the json metadata. A parser that doesn't know of the format will simply display the last column's header with the metadata.

Heterogamete answered 9/2, 2018 at 12:54 Comment(2)
JSON string - clever!Adar
See https://mcmap.net/q/1325501/-saving-header-from-csv-file-using-numpy-genfromtxt for a way to read it back in via python/numpyAdar
G
5

There is no standard extension to CSV that allows for storing of metadata.

You can certainly have a separate section in the file for metadata, but that tends to complicate processing as the CSV parser has to deal with separate headers, etc. Standard CSV parsers won't do that for you.

Consider whether you can store the metadata in a separate file, using a naming convention, e.g.

MyData123.csv
MyData123-Meta.csv

You could bundle both into a Zip archive to keep them together, until they are ready for processing.

Gerge answered 9/6, 2015 at 18:48 Comment(5)
.csvx - my new favorite oxymoron :)Meshuga
@THX-1138: Or .car (Csv ARchive) ;-)Gerge
@EricJ., if one goes all the way to storing the metadata in a separate file, why would she pick a CSV file for that?Ransome
If the OP is using CSV for the main data, presumably it's a format in common use in that organization. There's nothing inherently right or wrong about using a CSV given the limited information available. If the metadata is hierarchical, other formats (XML) may well be more suited.Gerge
Just stumbled across this and thought I'd add, my company uses this approach for data interchange with many other companies because of the simplicity and universality of CSV. Yes, other formats offer a far richer solution (e.g. JSON in today's world). But CSV is simple and easy to process in many environments.Gerge
R
2

There is a W3C recommendation for how to do this.

Here is an example: (Example 21: Tab-separated file containing embedded metadata)

#   publisher   City of Palo Alto
#   updated 12/31/2010
#name   GID on_street   species trim_cycle  inventory_date
#datatype   string  string  string  string  date:M/D/YYYY
    GID On Street   Species Trim Cycle  Inventory Date
    1   ADDISON AV  Celtis australis    Large Tree Routine Prune    10/18/2010
    2   EMERSON ST  Liquidambar styraciflua Large Tree Routine Prune    6/2/2010

There is also a W3C CSV on the web community group.

Reciprocate answered 20/6, 2021 at 16:47 Comment(1)
I've posted a suggestion at the Microsoft site, to add support for this to the default Text File driver for csv files, I'd appreciate an upvote on these issues, who know maybe it convinces Microsoft to look into this issue feedback.azure.com/d365community/search/?q=schema.iniStibine
S
1

Just want to add that you could also use the schema.ini format, so with your .csv file(s) also include a separate schema.ini file which contains something like this:

[yourfilenamehere.txt]
Format=CSVDelimited
ColNameHeader=True
DateTimeFormat=d/m/yyyy
Col1="Job Started" DateTime Width 8
Col2="Estimated Completion" DateTime Width 8
Col3="Description" Text Width 24
Col4="Foo" Integer Width 1
Col5="Bar" Integer Width 2

One schema.ini file can contain the metadata of multiple files, each file metadata is a new section for example [file1.csv] and [file2.csv] etc. And afaik there isn't separate key for a description or title, but you could just add your own custom keyname (so not used by the Text driver) like this:

[yourfilenamehere.txt]
MyReportTitle=Outstanding Jobs (5/5/2015) [email protected]
Format=CSVDelimited
etc..

Or add a line starting with semi-colon which counts as a comment line in .ini files

[yourfilenamehere.txt]
; Outstanding Jobs (5/5/2015) [email protected]
Format=CSVDelimited
etc..
Stibine answered 24/5, 2022 at 12:54 Comment(3)
I like the idea of using an INI file like this for storing metadata, but why not giving it the same name as the CSV file - only a different extension? Then you can even store more CSV files in the same folder - all having their own INI files?Cohlier
A separate .ini file for each data file that is one way to do it, however the Microsoft text driver expects one single name schema.ini file. Btw another idea is to add the meta data as comments in the csv file, see the other answer by @Max-Murphy on this page.Stibine
fyi I've posted feedback to Microsoft about both schema.ini and a SkipLines option for comment lines. I'd appreciate an upvote on these questions (see link below), maybe it helps for Microsoft to notice those requests feedback.azure.com/d365community/search/?q=schema.iniStibine
D
0

I recently ran into this problem, and found a "solution" that worked for me.

The idea is to encode the metadata in a singular cell in the first and/or the last row of the csv, encoded as json.

In your program, you can support this with a stack of decorator layers:

  • A standard component that reads the CSV file as a stream of raw data, that is, a stream where each row is an array of strings (the "cells"). For this layer, the first row, even if it contains the labels, is no different conceptually than all the other rows. Ideally each row would have the same number of cells, but this is not absolutely required.
  • An optional component that can look at the raw file and extract the contents of the last row, without reading the rest of the file.
  • An optional decorator/adapter that consumes the first row, parses the first cell of this row as json, and uses it as metadata.
  • An optional decorator/adapter that discards the last row, knowing that it contains metadata.
  • An optional decorator/adapter layer that consumes the first row as column labels, and then returns each row as an associative array/structure with the column labels as keys. The result of this is a stream of associative arrays, where the values are still strings.
  • An optional decorator/adapter layer that processes the cell contents. E.g. if one column contains json, it would parse those cells and turn them into objects. This layer behaves as a stream of objects, where the property values can be of various types.

I personally did this with PHP, but it should be possible in other languages / environments as well.

Notes

Why would you store some metadata at the end of the file, instead of the beginning?

A typical example would be the total row count, which might only be known after the rest of the file is completely written.

In fact this is the only use case so far where I used this solution. I named these files *.ncsv, indicating that the last row contains the row count. But still this is technically valid csv, if we accept rows with different cell count.

Of course whatever you do here, every piece of code that deals with this file will need to know about the special format you came up with.

Disembody answered 7/6, 2017 at 2:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.