Iterating over a text file using Fortran like format in C++

Asked 23/7, 2013 at 8:53 Answered 22/8, 2013 at 5:24

I am making an application that deals with txt file data.

The idea is that txt files may come in different formats, and it should be read into C++.

One example might be 3I2, 3X, I3, which should be done as: "first we have 3 integers of length 2, then we have 3 empty spots, then we have 1 integer of length 3.

Is the best to iterate over the file, yielding lines, followed by iterating over the lines as strings? What would be an effective approach for iterating smartly leaving out the 3 spots to be ignored?

E.g.

101112---100
102113---101
103114---102

to:

10, 11, 12, 100
10, 21, 13, 101
10, 31, 14, 102

Acidify answered 23/7, 2013 at 8:53 Comment(2)

Is 3I2, 3X, I3 something you get at runtime? – Gemperle 16/8, 2013 at 10:31

@Gemperle It will be user input from a GUI, that will be passed to the C++ application on the command line. – Acidify 16/8, 2013 at 10:47

The link given by Kyle Kanos is a good one; *scanf/*printf format strings map pretty well onto fortran format strings. It's actually easier to do this using C-style IO, but using C++ style streams is doable as well:

#include <cstdio>
#include <iostream>
#include <fstream>
#include <string>

int main() {
    std::ifstream fortranfile;
    fortranfile.open("input.txt");

    if (fortranfile.is_open()) {

        std::string line;
        getline(fortranfile, line);

        while (fortranfile.good()) {
            char dummy[4];
            int i1, i2, i3, i4;

            sscanf(line.c_str(), "%2d%2d%2d%3s%3d", &i1, &i2, &i3, dummy, &i4);

            std::cout << "Line: '" << line << "' -> " << i1 << " " << i2 << " "
                      << i3 << " " << i4 << std::endl;

            getline(fortranfile, line);
        }
    }

    fortranfile.close();

    return 0;
}

Running gives

$ g++ -o readinput readinput.cc
$ ./readinput
Line: '101112---100' -> 10 11 12 100
Line: '102113---101' -> 10 21 13 101
Line: '103114---102' -> 10 31 14 102

Here the format string we're using is %2d%2d%2d%3s%3d - 3 copies of %2d (decimal integer of width 2) followed by %3s (string of width 3, which we read into a variable we never use) followed by %3d (decimal integer of width 3).

Sippet answered 15/8, 2013 at 18:58 Comment(3)

Though it is not flexible right? You need to compile which format it is in upfront? E.g. if you switches the order of the 3X and an I2 we are in trouble ? – Acidify 21/8, 2013 at 9:2

It's like fortran, in that the string has to be known at runtime but it doesn't necessarily have to be hardcoded. You could translate the fortran format string to a scanf format string, but if you're going to be getting it in Fortran format (as is made clear in the comments made on the question after this answer was posted), it may be easier to follow John Zwinck and M.S.B's advice and simply do the reading in Fortran called from C. – Sippet 21/8, 2013 at 15:46

The format is just because it seems to be an easy way to specify file format. There is no actual Fortran code involved, just that I am copying the style. I hoped perhaps you had a dynamic solution using your type of solution. – Acidify 21/8, 2013 at 15:54

Given that you wish to dynamically parse Fortran Format specifier flags, you should note that: you've immediately walked into the realm of parsers.

In addition to the other methods of parsing such input that others have noted here:

By using Fortran and CC/++ bindings to do the parsing for you.
Using pure C++ to parse it for you by writing a parser using a combination of:
- sscanf
- streams

My proposal is that if boost is available to you, you can use it to implement a simple parser for on-the-fly operations, using a combination of Regexes and STL containers.

From what you've described, and what is shown in different places, you can construct a naive implementation of the grammar you wish to support, using regex captures:

(\\d{0,8})([[:alpha:]])(\\d{0,8})

Where the first group is the number of that variable type.
The second is the type of the variable.
and the third is the length of variable type.

Using this reference for the Fortran Format Specifier Flags, you can implement a naive solution as shown below:

#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <cstdlib>
#include <boost/regex.hpp>
#include <boost/tokenizer.hpp>
#include <boost/algorithm/string.hpp>
#include <boost/lexical_cast.hpp>

//A POD Data Structure used for storing Fortran Format Tokens into their relative forms
typedef struct FortranFormatSpecifier {
    char type;//the type of the variable
    size_t number;//the number of times the variable is repeated
    size_t length;//the length of the variable type
} FFlag;

//This class implements a rudimentary parser to parse Fortran Format
//Specifier Flags using Boost regexes.
class FormatParser {
public:
    //typedefs for further use with the class and class methods
    typedef boost::tokenizer<boost::char_separator<char> > bst_tokenizer;
    typedef std::vector<std::vector<std::string> > vvstr;
    typedef std::vector<std::string> vstr;
    typedef std::vector<std::vector<int> > vvint;
    typedef std::vector<int> vint;

    FormatParser();
    FormatParser(const std::string& fmt, const std::string& fname);

    void parse();
    void printIntData();
    void printCharData();

private:
    bool validateFmtString();
    size_t determineOccurence(const std::string& numStr);
    FFlag setFortranFmtArgs(const boost::smatch& matches);
    void parseAndStore(const std::string& line);
    void storeData();

    std::string mFmtStr;                //this holds the format string
    std::string mFilename;              //the name of the file

    FFlag mFmt;                         //a temporary FFlag variable
    std::vector<FFlag> mFortranVars;    //this holds all the flags and details of them
    std::vector<std::string> mRawData;  //this holds the raw tokens

    //this is where you will hold all the types of data you wish to support
    vvint mIntData;                     //this holds all the int data
    vvstr mCharData;                    //this holds all the character data (stored as strings for convenience)
};

FormatParser::FormatParser() : mFmtStr(), mFilename(), mFmt(), mFortranVars(), mRawData(), mIntData(), mCharData() {}
FormatParser::FormatParser(const std::string& fmt, const std::string& fname) : mFmtStr(fmt), mFilename(fname), mFmt(), mFortranVars(), mRawData(), mIntData(), mCharData() {}

//this function determines the number of times that a variable occurs
//by parsing a numeric string and returning the associated output
//based on the grammar
size_t FormatParser::determineOccurence(const std::string& numStr) {
    size_t num = 0;
    //this case means that no number was supplied in front of the type
    if (numStr.empty()) {
        num = 1;//hence, the default is 1
    }
    else {
        //attempt to parse the numeric string and find it's equivalent
        //integer value (since all occurences are whole numbers)
        size_t n = atoi(numStr.c_str());

        //this case covers if the numeric string is expicitly 0
        //hence, logically, it doesn't occur, set the value accordingly
        if (n == 0) {
            num = 0;
        }
        else {
            //set the value to its converted representation
            num = n;
        }
    }
    return num;
}

//from the boost::smatches, determine the set flags, store them
//and return it
FFlag FormatParser::setFortranFmtArgs(const boost::smatch& matches) {
    FFlag ffs = {0};

    std::string fmt_number, fmt_type, fmt_length;

    fmt_number = matches[1];
    fmt_type = matches[2];
    fmt_length = matches[3];

    ffs.type = fmt_type.c_str()[0];

    ffs.number = determineOccurence(fmt_number);
    ffs.length = determineOccurence(fmt_length);

    return ffs;
}

//since the format string is CSV, split the string into tokens
//and then, validate the tokens by attempting to match them
//to the grammar (implemented as a simple regex). If the number of
//validations match, everything went well: return true. Otherwise:
//return false.
bool FormatParser::validateFmtString() {    
    boost::char_separator<char> sep(",");
    bst_tokenizer tokens(mFmtStr, sep);
    mFmt = FFlag();

    size_t n_tokens = 0;
    std::string token;

    for(bst_tokenizer::const_iterator it = tokens.begin(); it != tokens.end(); ++it) {
        token = *it;
        boost::trim(token);

        //this "grammar" is based on the Fortran Format Flag Specification
        std::string rgx = "(\\d{0,8})([[:alpha:]])(\\d{0,8})";
        boost::regex re(rgx);
        boost::smatch matches;

        if (boost::regex_match(token, matches, re, boost::match_extra)) {
            mFmt = setFortranFmtArgs(matches);
            mFortranVars.push_back(mFmt);
        }
        ++n_tokens;
    }

    return mFortranVars.size() != n_tokens ? false : true;
}

//Now, parse each input line from a file and try to parse and store
//those variables into their associated containers.
void FormatParser::parseAndStore(const std::string& line) {
    int offset = 0;
    int integer = 0;
    std::string varData;
    std::vector<int> intData;
    std::vector<std::string> charData;

    offset = 0;

    for (std::vector<FFlag>::const_iterator begin = mFortranVars.begin(); begin != mFortranVars.end(); ++begin) {
        mFmt = *begin;

        for (size_t i = 0; i < mFmt.number; offset += mFmt.length, ++i) {
            varData = line.substr(offset, mFmt.length);

            //now store the data, based on type:
            switch(mFmt.type) {
                case 'X':
                  break;

                case 'A':
                  charData.push_back(varData);
                  break;

                case 'I':
                  integer = atoi(varData.c_str());
                  intData.push_back(integer);
                  break;

                default:
                  std::cerr << "Invalid type!\n";
            }
        }
    }
    mIntData.push_back(intData);
    mCharData.push_back(charData);
}

//Open the input file, and attempt to parse the input file line-by-line.
void FormatParser::storeData() {
    mFmt = FFlag();
    std::ifstream ifile(mFilename.c_str(), std::ios::in);
    std::string line;

    if (ifile.is_open()) {
        while(std::getline(ifile, line)) {
            parseAndStore(line);
        }
    }
    else {
        std::cerr << "Error opening input file!\n";
        exit(3);
    }
}

//If character flags are set, this function will print the character data
//found, line-by-line
void FormatParser::printCharData() {    
    vvstr::const_iterator it = mCharData.begin();
    vstr::const_iterator jt;
    size_t linenum = 1;

    std::cout << "\nCHARACTER DATA:\n";

    for (; it != mCharData.end(); ++it) {
        std::cout << "LINE " << linenum << " : ";
        for (jt = it->begin(); jt != it->end(); ++jt) {
            std::cout << *jt << " ";
        }
        ++linenum;
        std::cout << "\n";
    }
}

//If integer flags are set, this function will print all the integer data
//found, line-by-line
void FormatParser::printIntData() {
    vvint::const_iterator it = mIntData.begin();
    vint::const_iterator jt;
    size_t linenum = 1;

    std::cout << "\nINT DATA:\n";

    for (; it != mIntData.end(); ++it) {
        std::cout << "LINE " << linenum << " : ";
        for (jt = it->begin(); jt != it->end(); ++jt) {
            std::cout << *jt << " ";
        }
        ++linenum;
        std::cout << "\n";
    }
}

//Attempt to parse the input file, by first validating the format string
//and then, storing the data accordingly
void FormatParser::parse() {
    if (!validateFmtString()) {
        std::cerr << "Error parsing the input format string!\n";
        exit(2);
    }
    else {
        storeData();
    }
}

int main(int argc, char **argv) {
    if (argc < 3 || argc > 3) {
        std::cerr << "Usage: " << argv[0] << "\t<Fortran Format Specifier(s)>\t<Filename>\n";
        exit(1);
    }
    else {
        //parse and print stuff here
        FormatParser parser(argv[1], argv[2]);
        parser.parse();

        //print the data parsed (if any)
        parser.printIntData();
        parser.printCharData();
    }
    return 0;
}

This is standard c++98 code and can be compiled as follows:

g++ -Wall -std=c++98 -pedantic fortran_format_parser.cpp -lboost_regex

BONUS

This rudimentary parser also works on Characters too (Fortran Format Flag 'A', for up to 8 characters). You can extend this to support whatever flags you may like by editing the regex and performing checks on the length of captured strings in tandem with the type.

POSSIBLE IMPROVEMENTS

If C++11 is available to you, you can use lambdas in some places and substitute auto for the iterators.

If this is running in a limited memory space, and you have to parse a large file, vectors will inevitably crash due to the way how vectors manages memory internally. It will be better to use deques instead. For more on that see this as discussed from here:

http://www.gotw.ca/gotw/054.htm

And, if the input file is large, and file I/O is a bottleneck, you can improved performance by modifying the size of the ifstream buffer:

How to get IOStream to perform better?

DISCUSSION

What you will notice is that: the types that you're parsing must be known at runtime, and any associated storage containers must be supported in the class declaration and definition.

As you would imagine, supporting all types in one main class isn't efficient. However, as this is a naive solution, an improved full solution can be specialized to support these cases.

Another suggestion is to use Boost::Spirit. But, as Spirit uses a lot of templates, debugging such an application is not for the faint of heart when errors can and do occur.

PERFORMANCE

Compared to @Jonathan Dursi's solution, this solution is slow:

For 10,000,000 lines of randomly generated output (a 124MiB file) using this same line format ("3I2, 3X, I3"):

#include <fstream>
#include <cstdlib>
#include <ctime>
using namespace std;

int main(int argc, char **argv) {
    srand(time(NULL));
    if (argc < 2 || argc > 2) {
        printf("Invalid usage! Use as follows:\t<Program>\t<Output Filename>\n");
        exit(1);
    }

    ofstream ofile(argv[1], ios::out);
    if (ofile.is_open()) {
        for (int i = 0; i < 10000000; ++i) {
             ofile << (rand() % (99-10+1) + 10) << (rand() % (99-10+1) + 10) << (rand() % (99-10+1)+10) << "---" << (rand() % (999-100+1) + 100) << endl;
        }
    }

    ofile.close();
    return 0;
}

My solution:

0m13.082s
0m13.107s
0m12.793s
0m12.851s
0m12.801s
0m12.968s
0m12.952s
0m12.886s
0m13.138s
0m12.882s

Clocks an average walltime of 12.946s

Jonathan Dursi's solution:

0m4.698s
0m4.650s
0m4.690s
0m4.675s
0m4.682s
0m4.681s
0m4.698s
0m4.675s
0m4.695s
0m4.696s

Blazes with average walltime of 4.684s

His is faster than mine by at least 270% with both on O2.

However, since you don't have to actually modify the source code every time you want to parse an additional format flag, then this solution is more optimal.

Note: you can implement a solution that involves sscanf / streams that only requires you to know what type of variable you wish to read (much like mine), but the additional checks such as verifying the type(s) bloats development time. (This is why I offer my solution in Boost, because of the convenience of tokenizers and regexes - which makes the development process easier).

REFERENCES

http://www.boost.org/doc/libs/1_34_1/libs/regex/doc/character_class_names.html

Disposed answered 19/8, 2013 at 3:25 Comment(0)

You could translate 3I2, 3X, I3 in a scanf format.

Hallucinatory answered 23/7, 2013 at 8:57 Comment(1)

Examples of scanf/fscanf – Norword 23/7, 2013 at 17:55

Given that Fortran is easily callable from C, you could write a little Fortran function to do this "natively." The Fortran READ function takes a format string as you describe, after all.

If you want this to work, you'll need to brush up on Fortran just a tiny bit (http://docs.oracle.com/cd/E19957-01/806-3593/2_io.html), plus learn how to link Fortran and C++ using your compiler. Here are a few tips:

The Fortran symbols may be implicitly suffixed with underscore, so MYFUNC may be called from C as myfunc_().
Multi-dimensional arrays have the opposite ordering of dimensions.
Declaring a Fortran (or C) function in a C++ header requires placing it in an extern "C" {} scope.

Glucinum answered 17/8, 2013 at 5:12 Comment(1)

If you decide to obtain Fortran IO by calling Fortran, I recommend using the ISO_C_Binding of Fortran. That standardizes the interface between C (and C++ via extern C) and Fortran -- no concerns about underscores! – Brittaney 17/8, 2013 at 8:10

If your user is actually supposed to enter it in the Fortran format, or if you very quickly adapt or write Fortran code to do this, I would do as John Zwinck and M.S.B. suggest. Just write a short Fortran routine to read the data into an array, and use "bind(c)" and the ISO_C_BINDING types to set up the interface. And remember that the array indexing is going to change between Fortran and C++.

Otherwise, I would recommend using scanf, as mentioned above:

http://en.cppreference.com/w/cpp/io/c/fscanf

If you don't know the number of items per line you need to read, you might be able to use vscanf instead:

http://en.cppreference.com/w/cpp/io/c/vfscanf

However, although it looks convenient, I've never used this, so YMMV.

Skyler answered 18/8, 2013 at 8:50 Comment(0)

Thought about this some today but no time to write an example. @jrd1's example and analysis are on track but I'd try to make the parsing more modular and object oriented. The format string parser could build a list of item parsers that then worked more or less independently, allowing adding new ones like floating point without changing old code. I think a particularly nice interface would be an iomanip initialized with a format string so that the ui would be something like

cin >> f77format("3I2, 3X, I3") >> a >> b >> c >> d;

On implementation I'd have f77format parse the bits and build the parser by components, so it would create 3 fixed width int parsers, a devNull parser and another fixed width parser that would then consume the input.

Of course if you want to support all of the edit descriptors, it would be a big job! And in general it wouldn't just be passing the rest of the string on to the next parser since there are edit descriptors that require re-reading the line.

Sadick answered 22/8, 2013 at 5:24 Comment(1)

I want to point out that the format is not specific to Fortran 77, it is for Fortran in general. Calling the function something like f90format might remind (more ignorant) C users that Fortran has been updated multiple times. – Norword 23/8, 2013 at 18:40

Recommended topics

Hot tags