changing the delimiter for cin (c++)
Asked Answered
M

4

53

I've redirected "cin" to read from a file stream cin.rdbug(inF.rdbug()) When I use the extraction operator it reads until it reaches a white space character.

Is it possible to use another delimiter? I went through the api in cplusplus.com, but didn't find anything.

Marionmarionette answered 5/9, 2011 at 0:32 Comment(3)
You don't use operator<< with std::cin, did you mean >>?Coverlet
you could try to include white-space characters in cin buffer.Mattress
@0x69 : That doesn't work. It just means that given the input " A B", extracting the first word gets you " A" instead of "A".Tequilater
M
53

It is possible to change the inter-word delimiter for cin or any other std::istream, using std::ios_base::imbue to add a custom ctype facet.

If you are reading a file in the style of /etc/passwd, the following program will read each :-delimited word separately.

#include <locale>
#include <iostream>


struct colon_is_space : std::ctype<char> {
  colon_is_space() : std::ctype<char>(get_table()) {}
  static mask const* get_table()
  {
    static mask rc[table_size];
    rc[':'] = std::ctype_base::space;
    rc['\n'] = std::ctype_base::space;
    return &rc[0];
  }
};

int main() {
  using std::string;
  using std::cin;
  using std::locale;

  cin.imbue(locale(cin.getloc(), new colon_is_space));

  string word;
  while(cin >> word) {
    std::cout << word << "\n";
  }
}
Mugger answered 5/9, 2011 at 5:33 Comment(4)
Using new in uncontrolled way is evil, needless to say that you have not delete your struct (and there is no way to delete an unnamed pointer). ALWAYS try shared_ptr instead when possible.Drawknife
That is generally excellent advice which does not apply in this specific case. In this case, std::facet is a refernce-counted pointer, std::locale::locale requires a raw pointer, not a shared pointer, and std::locale::~locale is defined to delete the facet pointer. If you have a problem with the interface to locale, take it up with the standards committee, not me. See the example program at en.cppreference.com/w/cpp/locale/locale/localeVowel
Even though I will suggest to define a wrapper function get_locale to wrap those unusual use of new with comments. So the code reviewer will realize there are something wrong with the interface, not the code writer. And this is what I mean for "controled" way of using new.Drawknife
If not creating new functions, a better way to represent the ownership transfer could be unique_ptr<colon_is_space>(new colon is_space).release(). Although it is basically the same thing of your code but more verbose, it indicates that you are transferring pointer ownership.Drawknife
C
25

For strings, you can use the std::getline overloads to read using a different delimiter.

For number extraction, the delimiter isn't really "whitespace" to begin with, but any character invalid in a number.

Coverlet answered 5/9, 2011 at 0:38 Comment(8)
I'm not sure how you can say the delimiter isn't "whitespace" for numbers, if foo is an int, istringstream("123 456") >> foo; puts 123 in foo, not 123456.Hornwort
@JonathanMee: I didn't say that whitespace aren't delimiters, I said the set of delimiters is not only whitespace. Try istringstream("123_456") >> foo; or Try istringstream("123|456") >> foo;Coverlet
Ahhh, I understand, you're saying that rather than looking for a character defined as ctype_base::space the stream is looking for a character not defined as ctype_base::digit.Hornwort
@JonathanMee: Right, although it's more complex than that, some punctuation characters are allowed during numeric parsing. And obviously whether it is classified as a space may affect the status flags, but whitespace is not the only thing that causes numeric extraction to stop.Coverlet
Does it make sense to expect that std::getline is optimized for performance?Cousin
@Cousin streams in general are one of the least performant things in the standard. But typically you're going to use streams with input/output so slow performance will be negligible relative to the cost of the input/output operation. For performance reasons though arrays should be preferred over streams.Hornwort
@JonathanMee: "slow performance will be negligible relative to the cost of the input/output operation" has NEVER been true in my experience. The fact is that in many applications both file I/O and parsing are negligible compared to the cost of other processing, or waiting for the user to hit the start button, or network requests. But in I/O heavy applications built with iostreams, it's the iostream code, not the I/O operations, that dominates.Coverlet
Hmmm... I guess it's the type of project that I have a history with. Thanks for the clarification. It's good to have a balancing point of view. I suppose a better answer for @Wolf's question would be: "getline is no slower than the stream is as a whole, but if performance is a concern for you, you should look for non-stream options."Hornwort
H
19

This is an improvement on Robᵩ's answer, because that is the right one (and I'm disappointed that it hasn't been accepted.)

What you need to do is change the array that ctype looks at to decide what a delimiter is.

In the simplest case you could create your own:

const ctype<char>::mask foo[ctype<char>::table_size] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ctype_base::space};

On my machine '\n' is 10. I've set that element of the array to the delimiter value: ctype_base::space. A ctype initialized with foo would only delimit on '\n' not ' ' or '\t'.

Now this is a problem because the array passed into ctype defines more than just what a delimiter is, it also defines leters, numbers, symbols, and some other junk needed for streaming. (Ben Voigt's answer touches on this.) So what we really want to do is modify a mask, not create one from scratch.

That can be accomplished like this:

const auto temp = ctype<char>::classic_table();
vector<ctype<char>::mask> bar(temp, temp + ctype<char>::table_size);

bar[' '] ^= ctype_base::space;
bar['\t'] &= ~(ctype_base::space | ctype_base::cntrl);
bar[':'] |= ctype_base::space;

A ctype initialized with bar would delimit on '\n' and ':' but not ' ' or '\t'.

You go about setting up cin, or any other istream, to use your custom ctype like this:

cin.imbue(locale(cin.getloc(), new ctype<char>(data(bar))));

You can also switch between ctypes and the behavior will change mid-stream:

cin.imbue(locale(cin.getloc(), new ctype<char>(foo)));

If you need to go back to default behavior, just do this:

cin.imbue(locale(cin.getloc(), new ctype<char>));

Live example

Hornwort answered 28/1, 2015 at 16:25 Comment(2)
that will set bar['\t'] to zero, probably not intended. To clear a bit, use &~ (bit-wise AND with bit-wise NOT). ! is logical NOT and won't have the desired effect.Coverlet
@BenVoigt Thank you, I wanted to strip out the space and cntrl bits and I accidentally got everything.Hornwort
N
5

This is an improvement on Jon's answer, and the example from cppreference.com. So this follows the same premise as both, but combines them with parameterized delimiters.

struct delimiter_ctype : std::ctype<char> {
    static const mask* make_table(std::string delims)
    {
        // make a copy of the "C" locale table
        static std::vector<mask> v(classic_table(), classic_table() + table_size);
        for(mask m : v){
            m &= ~space;
        }
        for(char d : delims){
            v[d] |= space;
        }
        return &v[0];
    }
    delimiter_ctype(std::string delims, ::size_t refs = 0) : ctype(make_table(delims), false, refs) {}
};

Cheers!

Nannette answered 20/3, 2019 at 23:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.