Is there a standard format for describing a flat file?
Asked Answered
V

7

15

Is there a standard or open format which can be used to describe the formating of a flat file. My company integrates many different customer file formats. With an XML file it's easy to get or create an XSD to describe the XML file format. I'm looking for something similar to describe a flat file format (fixed width, delimited etc). Stylus Studio uses a proprietary .conv format to do this. That .conv format can be used at runtime to transform an arbitrary flat file to an XML file. I was just wondering if there was any more open or standards based method for doing the same thing.

I'm looking for one method of describing a variety of flat file formats whether they are fixed width or delimited, so CSV is not an answer to this question.

Vermeil answered 14/10, 2009 at 18:53 Comment(9)
I don't know who just voted down this question and all its answers. It's a perfectly valid question with helpful answers. Have a +1 on me.Morning
I'm not sure I agree the answers are particularly helpful since most aren't even answering the question I'm asking, but I don't know why the question would be downvoted :(Vermeil
I did not vote anything down, but the fact that every single answerer misunderstood the question is evidence the question is badly-written. If you want to improve SO you should edit the question so it makes sense, not vote it up.Compulsive
With all the downvoting I'm going to refrain from opining in a answer, but you are looking at an age old problem, you only have to look at the flat file import routines for access/excel to see the issues involved. In your question you state that your clients provide the files in a verity of formats, so you need a method of describing to your system the file format, not a file format that contains a description, otherwise you could just tell your clients to provide the data in a specified format (and then csv would be the answer)Haulage
The question is perfectly clear as to his intent. In fact, the title states what he wants exactly.Ragg
@Vermeil - I disagree. Your original question opened with 'Is there a standard or open format which can be used to describe the formating of a flat file.' CSV fits the bill perfectly (hint: this is why 3 people gave this answer simutaneously). Later you edited your question to be more specific (while keeping the first sentence that made me think you wanted "a" standard) but not until after you probably downvoted all your responses. BTW, I did not downvote your question.Gilly
Agree with Dour, you're unlikely to get a good answer to such a large and complex problem without being very specific. The industry is littered with disasters at dealing with this problem (EDI comes to mind).Swill
Also would be handy to know the number of formats involved, whether conversion between formats takes place, and the type of data involved (as an industry specific solution may exist).Swill
@Jay Riggs I think if people had read past the first sentence it was clear that I was asking for a more generic way of describing various flat file formats (even prior to my clarification), but I did clarify the question to try and get answers to my actual problem.Vermeil
S
7

XFlat: http://www.infoloom.com/gcaconfs/WEB/philadelphia99/lyons.HTM#N29 http://www.unidex.com/overview.htm

For complex cases (e.g. log files) you may consider a lexical parser.

Sazerac answered 14/10, 2009 at 19:22 Comment(1)
Hey! This one actually answers the question. I found XFlat on an earlier search on the issue, but can't find a whole lot of information on who owns it, if it's a real standard etc. Unidex also provides tools for taking the XFlat description and a flat file in order to transform it into XML (unidex.com/xflat.htm)Vermeil
A
3

About selecting existing flat file formats: There is the Comma-separated values (CSV) format. Or, more generally, DSV. But these are not "fixed-width", since there's a delimiter character (such as a comma) that separates individual cells. Note that though CSV is standardized, not everybody adheres to the standard. Also, CSV may be to simple for your purposes, since it doesn't allow a rich document structure.

In that respect, the standardized and only slightly more complex (but thus more useful) formats JSON and YAML are a better choice. Both are supported out of the box by plenty of languages.

Your best bet is to have a look at all languages listed as non-binary in this overview and then determine which works best for you.

About describing flat file formats: This could be very easy or difficult, depending on the format. Though in most cases easier solutions exist, one way that will work in general is to view the file format as a formal grammar, and write a lexer/parser for it. But I admit, that's quite heavy machinery.

If you're lucky, a couple of advanced regular expressions may do the trick. Most formats will not lend themselves for that however. If you plan on writing a lexer/parser yourself, I can advise PLY (Python Lex-Yacc). But many other solutions exists, in many different languages, a lot of them more convenient than the old-school Lex & Yacc. For more, see What parser generator do you recommend?


  : Yes, that may be an understatement.
  : Even properly describing the email address format is not trivial.

Annatto answered 14/10, 2009 at 18:55 Comment(0)
M
2

COBOL (whether you like it or not) has a standard format for describing fixed-width record formats in files.

Other file formats, however, are somewhat simpler to describe. A CSV file, for example, is just a list of strings. Often the first row of a CSV file is the column names -- that's the description.

There are examples of using JSON to formulate metadata for text files. This can be applied to JSON files, CSV files and fixed-format files.

Look at http://www.projectzero.org/sMash/1.1.x/docs/zero.devguide.doc/zero.resource/declaration.html

This is IBM's sMash (Project Zero) using JSON to encode metadata. You can easily apply this to flat files.

Mannequin answered 14/10, 2009 at 19:27 Comment(0)
S
1

At the end of the day, you will probably have to define your own file standard that caters specifically to your storage needs. What I suggest is using xml, YAML or JSON as your internal container for all of the file types you receive. On top of this, you will have to implement some extra validation logic to maintain meta-data such as the column sizes of the fixed width files (for importing from and exporting to fixed width). Alternatively, you can store or link a set of metadata to each file you convert to the internal format.

There may be a standard out there, but it's too hard to create 'one size fits all' solutions for these problems. There are entity relationship management tools out there (Talend, others) that make creating these mappings easier, but you will still need to spend a lot of time maintaining file format definitions and rules.

As for enforcing column width, xml might be the best solution as you can describe the formats using xml schemas (with the length restriction). For YAML or JSON, you may have to write your own logic for this, although I'm sure someone else has come up with a solution.

See XML vs comma delimited text files for further reference.

Swill answered 14/10, 2009 at 19:5 Comment(2)
I don't have a choice as to what format to use. Customers are providing flat files in the form of delimeted, fixed width or XML. I have to go from those formats to an internal format. Simple to do with XML, just use an XSLT transformation. Fairly simple to do with delimeted, just describe the delimiter and then build an XML file which can have an XSLT applied. More difficult to do with fixed width, you have to describe each field length. I'm looking for an open standard which can describe fixed width and delimited flat files so I don't have to create my own persistance for that meta dataVermeil
Alternatively, you could use a tool that knows how to manipulate flat files and convert them to other formats. SSIS comes to mind (SQL Server Integration Services).Lustrous
C
1

I don't know if there is any standard or open format to describe a flat file format. But one industry has done this: the banking industry. Financial institutions are indeed communicating using standardized message over a dedicated network called SWIFT. SWIFT messages were originally positional (before SWIFTML, the XMLified version). I don't know if it's a good suggestion as it's kinda obscure but maybe you could look at the SWIFT Formatting Guide, it may gives you some ideas.

Having that said, check out Flatworm, an humble flat file parser. I've used it to parse positional and/or CSV file and liked its XML descriptor format. It may be a better suggestion than SWIFT :)

Calutron answered 14/10, 2009 at 20:12 Comment(0)
D
0

CSV

CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.


The CSV entry on wikipedia allowed me to find a comparison of data serialization formats that is pretty much what you asked for.

Distraction answered 14/10, 2009 at 18:55 Comment(1)
Nice link on the comparison of data serialization formats, thanks!Philology
S
0

The only similar thing I know of is Hachoir, which can currently parse 70 file formats:

http://bitbucket.org/haypo/hachoir/wiki/Home

I'm not sure if it really counts as a declarative language, since it's plugin parser based, but it seems to work, and is extensible, which may meet your needs just fine.

As an aside, there are interesting standardised, extensible flat-file FORMATS, such as IFF (Interchange File Format).

Sebi answered 14/10, 2009 at 19:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.