Given that you wish to dynamically parse Fortran Format specifier flags, you should note that: you've immediately walked into the realm of parsers.
In addition to the other methods of parsing such input that others have noted here:
- By using Fortran and CC/++ bindings to do the parsing for you.
- Using pure C++ to parse it for you by writing a parser using a combination of:
My proposal is that if boost is available to you, you can use it to implement a simple parser for on-the-fly operations, using a combination of Regexes and STL containers.
From what you've described, and what is shown in different places, you can construct a naive implementation of the grammar you wish to support, using regex captures:
(\\d{0,8})([[:alpha:]])(\\d{0,8})
- Where the first group is the number of that variable type.
- The second is the type of the variable.
- and the third is the length of variable type.
Using this reference for the Fortran Format Specifier Flags, you can implement a naive solution as shown below:
#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <cstdlib>
#include <boost/regex.hpp>
#include <boost/tokenizer.hpp>
#include <boost/algorithm/string.hpp>
#include <boost/lexical_cast.hpp>
//A POD Data Structure used for storing Fortran Format Tokens into their relative forms
typedef struct FortranFormatSpecifier {
char type;//the type of the variable
size_t number;//the number of times the variable is repeated
size_t length;//the length of the variable type
} FFlag;
//This class implements a rudimentary parser to parse Fortran Format
//Specifier Flags using Boost regexes.
class FormatParser {
public:
//typedefs for further use with the class and class methods
typedef boost::tokenizer<boost::char_separator<char> > bst_tokenizer;
typedef std::vector<std::vector<std::string> > vvstr;
typedef std::vector<std::string> vstr;
typedef std::vector<std::vector<int> > vvint;
typedef std::vector<int> vint;
FormatParser();
FormatParser(const std::string& fmt, const std::string& fname);
void parse();
void printIntData();
void printCharData();
private:
bool validateFmtString();
size_t determineOccurence(const std::string& numStr);
FFlag setFortranFmtArgs(const boost::smatch& matches);
void parseAndStore(const std::string& line);
void storeData();
std::string mFmtStr; //this holds the format string
std::string mFilename; //the name of the file
FFlag mFmt; //a temporary FFlag variable
std::vector<FFlag> mFortranVars; //this holds all the flags and details of them
std::vector<std::string> mRawData; //this holds the raw tokens
//this is where you will hold all the types of data you wish to support
vvint mIntData; //this holds all the int data
vvstr mCharData; //this holds all the character data (stored as strings for convenience)
};
FormatParser::FormatParser() : mFmtStr(), mFilename(), mFmt(), mFortranVars(), mRawData(), mIntData(), mCharData() {}
FormatParser::FormatParser(const std::string& fmt, const std::string& fname) : mFmtStr(fmt), mFilename(fname), mFmt(), mFortranVars(), mRawData(), mIntData(), mCharData() {}
//this function determines the number of times that a variable occurs
//by parsing a numeric string and returning the associated output
//based on the grammar
size_t FormatParser::determineOccurence(const std::string& numStr) {
size_t num = 0;
//this case means that no number was supplied in front of the type
if (numStr.empty()) {
num = 1;//hence, the default is 1
}
else {
//attempt to parse the numeric string and find it's equivalent
//integer value (since all occurences are whole numbers)
size_t n = atoi(numStr.c_str());
//this case covers if the numeric string is expicitly 0
//hence, logically, it doesn't occur, set the value accordingly
if (n == 0) {
num = 0;
}
else {
//set the value to its converted representation
num = n;
}
}
return num;
}
//from the boost::smatches, determine the set flags, store them
//and return it
FFlag FormatParser::setFortranFmtArgs(const boost::smatch& matches) {
FFlag ffs = {0};
std::string fmt_number, fmt_type, fmt_length;
fmt_number = matches[1];
fmt_type = matches[2];
fmt_length = matches[3];
ffs.type = fmt_type.c_str()[0];
ffs.number = determineOccurence(fmt_number);
ffs.length = determineOccurence(fmt_length);
return ffs;
}
//since the format string is CSV, split the string into tokens
//and then, validate the tokens by attempting to match them
//to the grammar (implemented as a simple regex). If the number of
//validations match, everything went well: return true. Otherwise:
//return false.
bool FormatParser::validateFmtString() {
boost::char_separator<char> sep(",");
bst_tokenizer tokens(mFmtStr, sep);
mFmt = FFlag();
size_t n_tokens = 0;
std::string token;
for(bst_tokenizer::const_iterator it = tokens.begin(); it != tokens.end(); ++it) {
token = *it;
boost::trim(token);
//this "grammar" is based on the Fortran Format Flag Specification
std::string rgx = "(\\d{0,8})([[:alpha:]])(\\d{0,8})";
boost::regex re(rgx);
boost::smatch matches;
if (boost::regex_match(token, matches, re, boost::match_extra)) {
mFmt = setFortranFmtArgs(matches);
mFortranVars.push_back(mFmt);
}
++n_tokens;
}
return mFortranVars.size() != n_tokens ? false : true;
}
//Now, parse each input line from a file and try to parse and store
//those variables into their associated containers.
void FormatParser::parseAndStore(const std::string& line) {
int offset = 0;
int integer = 0;
std::string varData;
std::vector<int> intData;
std::vector<std::string> charData;
offset = 0;
for (std::vector<FFlag>::const_iterator begin = mFortranVars.begin(); begin != mFortranVars.end(); ++begin) {
mFmt = *begin;
for (size_t i = 0; i < mFmt.number; offset += mFmt.length, ++i) {
varData = line.substr(offset, mFmt.length);
//now store the data, based on type:
switch(mFmt.type) {
case 'X':
break;
case 'A':
charData.push_back(varData);
break;
case 'I':
integer = atoi(varData.c_str());
intData.push_back(integer);
break;
default:
std::cerr << "Invalid type!\n";
}
}
}
mIntData.push_back(intData);
mCharData.push_back(charData);
}
//Open the input file, and attempt to parse the input file line-by-line.
void FormatParser::storeData() {
mFmt = FFlag();
std::ifstream ifile(mFilename.c_str(), std::ios::in);
std::string line;
if (ifile.is_open()) {
while(std::getline(ifile, line)) {
parseAndStore(line);
}
}
else {
std::cerr << "Error opening input file!\n";
exit(3);
}
}
//If character flags are set, this function will print the character data
//found, line-by-line
void FormatParser::printCharData() {
vvstr::const_iterator it = mCharData.begin();
vstr::const_iterator jt;
size_t linenum = 1;
std::cout << "\nCHARACTER DATA:\n";
for (; it != mCharData.end(); ++it) {
std::cout << "LINE " << linenum << " : ";
for (jt = it->begin(); jt != it->end(); ++jt) {
std::cout << *jt << " ";
}
++linenum;
std::cout << "\n";
}
}
//If integer flags are set, this function will print all the integer data
//found, line-by-line
void FormatParser::printIntData() {
vvint::const_iterator it = mIntData.begin();
vint::const_iterator jt;
size_t linenum = 1;
std::cout << "\nINT DATA:\n";
for (; it != mIntData.end(); ++it) {
std::cout << "LINE " << linenum << " : ";
for (jt = it->begin(); jt != it->end(); ++jt) {
std::cout << *jt << " ";
}
++linenum;
std::cout << "\n";
}
}
//Attempt to parse the input file, by first validating the format string
//and then, storing the data accordingly
void FormatParser::parse() {
if (!validateFmtString()) {
std::cerr << "Error parsing the input format string!\n";
exit(2);
}
else {
storeData();
}
}
int main(int argc, char **argv) {
if (argc < 3 || argc > 3) {
std::cerr << "Usage: " << argv[0] << "\t<Fortran Format Specifier(s)>\t<Filename>\n";
exit(1);
}
else {
//parse and print stuff here
FormatParser parser(argv[1], argv[2]);
parser.parse();
//print the data parsed (if any)
parser.printIntData();
parser.printCharData();
}
return 0;
}
This is standard c++98 code and can be compiled as follows:
g++ -Wall -std=c++98 -pedantic fortran_format_parser.cpp -lboost_regex
BONUS
This rudimentary parser also works on Characters
too (Fortran Format Flag 'A', for up to 8 characters). You can extend this to support whatever flags you may like by editing the regex and performing checks on the length of captured strings in tandem with the type.
POSSIBLE IMPROVEMENTS
If C++11 is available to you, you can use lambdas
in some places and substitute auto
for the iterators.
If this is running in a limited memory space, and you have to parse a large file, vectors will inevitably crash due to the way how vectors
manages memory internally. It will be better to use deques
instead. For more on that see this as discussed from here:
http://www.gotw.ca/gotw/054.htm
And, if the input file is large, and file I/O is a bottleneck, you can improved performance by modifying the size of the ifstream
buffer:
How to get IOStream to perform better?
DISCUSSION
What you will notice is that: the types that you're parsing must be known at runtime, and any associated storage containers must be supported in the class declaration and definition.
As you would imagine, supporting all types in one main class isn't efficient. However, as this is a naive solution, an improved full solution can be specialized to support these cases.
Another suggestion is to use Boost::Spirit. But, as Spirit uses a lot of templates, debugging such an application is not for the faint of heart when errors can and do occur.
PERFORMANCE
Compared to @Jonathan Dursi's solution, this solution is slow:
For 10,000,000 lines of randomly generated output (a 124MiB file) using this same line format ("3I2, 3X, I3"):
#include <fstream>
#include <cstdlib>
#include <ctime>
using namespace std;
int main(int argc, char **argv) {
srand(time(NULL));
if (argc < 2 || argc > 2) {
printf("Invalid usage! Use as follows:\t<Program>\t<Output Filename>\n");
exit(1);
}
ofstream ofile(argv[1], ios::out);
if (ofile.is_open()) {
for (int i = 0; i < 10000000; ++i) {
ofile << (rand() % (99-10+1) + 10) << (rand() % (99-10+1) + 10) << (rand() % (99-10+1)+10) << "---" << (rand() % (999-100+1) + 100) << endl;
}
}
ofile.close();
return 0;
}
My solution:
0m13.082s
0m13.107s
0m12.793s
0m12.851s
0m12.801s
0m12.968s
0m12.952s
0m12.886s
0m13.138s
0m12.882s
Clocks an average walltime of 12.946s
Jonathan Dursi's solution:
0m4.698s
0m4.650s
0m4.690s
0m4.675s
0m4.682s
0m4.681s
0m4.698s
0m4.675s
0m4.695s
0m4.696s
Blazes with average walltime of 4.684s
His is faster than mine by at least 270% with both on O2.
However, since you don't have to actually modify the source code every time you want to parse an additional format flag, then this solution is more optimal.
Note: you can implement a solution that involves sscanf
/ streams
that only requires you to know what type of variable you wish to read (much like mine), but the additional checks such as verifying the type(s) bloats development time. (This is why I offer my solution in Boost, because of the convenience of tokenizers and regexes - which makes the development process easier).
REFERENCES
http://www.boost.org/doc/libs/1_34_1/libs/regex/doc/character_class_names.html
3I2, 3X, I3
something you get at runtime? – Gemperle