How to safely read a line from an std::istream?
Asked Answered
C

4

55

I want to safely read a line from an std::istream. The stream could be anything, e.g., a connection on a Web server or something processing files submitted by unknown sources. There are many answers starting to do the moral equivalent of this code:

void read(std::istream& in) {
    std::string line;
    if (std::getline(in, line)) {
        // process the line
    }
}

Given the possibly dubious source of in, using the above code would lead to a vulnerability: a malicious agent may mount a denial of service attack against this code using a huge line. Thus, I would like to limit the line length to some rather high value, say 4 millions chars. While a few large lines may be encountered, it isn't viable to allocate a buffer for each file and use std::istream::getline().

How can the maximum size of the line be limited, ideally without distorting the code too badly and without allocating large chunks of memory up front?

Clio answered 17/12, 2013 at 21:49 Comment(9)
What about a custom allocator that will throw if asked to allocate beyond your threshold? Construct a basic_string object using that allocator and read into it.Disembark
Maybe subclassing std::string and providing a max_size() function that spits out something small?Footcloth
@Praetorian: I guess, using an allocator would be an option. Sadly, it changes the type of std::string.Zhdanov
You could replace the streambuf of in with your own implementation that wraps the original streambuf and sends a '\n' when certain number of characters are read.Kovacev
@DietmarKühl: Maybe you can try to simply check number of chars in buffer before extracting: if (in.rdbuf()->in_avail() > max_size) { /* end */ }...Milamilady
@mb84: in_avail() shows how many characters are known to be available, not how many there will be available in total. Basically, it tells how many characters are currently in the stream buffer's buffer. Once these are read, the stream buffer will provide another buffer of characters, possibly blocking until more characters arrive or it becomes apparent that the source has gone stale.Zhdanov
I wonder whether you should apply security measures at the level of getline, or higher up in your application. E.g. if you are reading from a socket (e.g. through boost::asio::ip::tcp::iostream), limiting the buffer size may guard against a single huge line. But a malicious agent may also mount a DoS attack by sending a huge numnber of small lines. Does your code also aim to guard against that threat? Or should it somehow be guarded against in your firewall, outside your application?Ritenuto
@TemplateRex: although clearly the total amount of data needs to be controlled, too, there is a way to break out of a loop reading lines after a suitable number of lines. The issue with functions like std::getline() is that there is no obvious way to control when they should stop accepting input. That said, from the approaches suggested the approach limiting the amount of data forwarded by a stream buffer could be used easily to limit the total amount of data (in fact, it is easier and more efficient to limit the total amount of data).Zhdanov
a malicious agent may **mount** a denial of service attack against this code using a huge line -> That sounded soooo awesome but since I'm a C++ noobie, can you please tell how could one do that? :)Prorogue
S
37

You could write your own version of std::getline with a maximum number of characters read parameter, something called getline_n or something.

#include <string>
#include <iostream>

template<typename CharT, typename Traits, typename Alloc>
auto getline_n(std::basic_istream<CharT, Traits>& in, std::basic_string<CharT, Traits, Alloc>& str, std::streamsize n) -> decltype(in) {
    std::ios_base::iostate state = std::ios_base::goodbit;
    bool extracted = false;
    const typename std::basic_istream<CharT, Traits>::sentry s(in, true);
    if(s) {
        try {
            str.erase();
            typename Traits::int_type ch = in.rdbuf()->sgetc();
            for(; ; ch = in.rdbuf()->snextc()) {
                if(Traits::eq_int_type(ch, Traits::eof())) {
                    // eof spotted, quit
                    state |= std::ios_base::eofbit;
                    break;
                }
                else if(str.size() == n) {
                    // maximum number of characters met, quit
                    extracted = true;
                    in.rdbuf()->sbumpc();
                    break;
                }
                else if(str.max_size() <= str.size()) {
                    // string too big
                    state |= std::ios_base::failbit;
                    break;
                }
                else {
                    // character valid
                    str += Traits::to_char_type(ch);
                    extracted = true;
                }
            }
        }
        catch(...) {
            in.setstate(std::ios_base::badbit);
        }
    }

    if(!extracted) {
        state |= std::ios_base::failbit;
    }

    in.setstate(state);
    return in;
}

int main() {
    std::string s;
    getline_n(std::cin, s, 10); // maximum of 10 characters
    std::cout << s << '\n';
}

Might be overkill though.

Seesaw answered 17/12, 2013 at 22:13 Comment(4)
Writing a version of getline() could be an option (especially as I have actually done so in the past having implemented all of the IOStreams library). I don't know why it didn't occure to me: maybe I was too focussed on two other solutions (only one of which mentioned so far).Zhdanov
+1. The only thing I question is the call to reserve, since the OP is talking on the order of 4Mbytes as a guard, but might only be dealing with much smaller sized strings. It might be better to have the user perform the reserve on their own.Decemvirate
@DaveS Funny enough, the original version that I wrote this didn't have the reserve call but I added it in for good measure. If it was up to me I wouldn't have done it either. I figure I'll just remove it.Seesaw
I'm confused - where does this code actually check for the new line character?Malady
S
18

There is already such a getline function as a member function of istream, you just need to wrap it for buffer management.

#include <assert.h>
#include <istream>
#include <stddef.h>         // ptrdiff_t
#include <string>           // std::string, std::char_traits

typedef ptrdiff_t Size;

namespace my {
    using std::istream;
    using std::string;
    using std::char_traits;

    istream& getline(
        istream& stream, string& s, Size const buf_size, char const delimiter = '\n'
        )
    {
        s.resize( buf_size );  assert( s.size() > 1 );
        stream.getline( &s[0], buf_size, delimiter );
        if( !stream.fail() )
        {
            Size const n = char_traits<char>::length( &s[0] );
            s.resize( n );      // Downsizing.
        }
        return stream;
    }
}  // namespace my
Sulphate answered 17/12, 2013 at 22:56 Comment(10)
&s[0] makes me feel uneasyHolmberg
@Inverse: There's nothing to feel uneasy about. You might as well feel uneasy about a division, reasoning that it could turn out to be a division by zero. In this code the relevant constraint on the value (that the string length must be >0) is expressed by an assert, which is generally good practice and makes the code safer than without it. With the assert one must work hard in order to bring forth the nasal daemons of UB, Namely, call the function with invalid argument for buf_size and do this with NDEBUG defined so as to suppress the assert. That's why you should use assert.Sulphate
Oh I meant that, my understanding is the data in a std::string is not guaranteed to be continuous, it's only .c_str() that is. So &v[0] is fine for a std::vector but not strictly for std::stringHolmberg
@Inverse: std::string buffer has been guaranteed contiguous since the Lillehammer meeting in 2005. Of course it wasn't official until C++11, but no compiler vendor would be able to sell a compiler where they introduced the opposite, given both existing code dependencies and known wording in the future C++11 standard. At the time one way to identify pretend-language-lawyers was discussion about whether this could be relied on. ;-)Sulphate
Is copy-on-write std::string an issue? (There certainly were such implementations around in the early 2000's, although C++11 outlaws them)Tunic
@MattMcNabb: No, COW is not an issue here. It has to be compatible with any ordinary single-threaded use of strings, and so it has to perform any deferred copying when operator[] is used (or any reference to the buffer is obtained). At least one of g++'s standard library implementations used COW strings, and I'd be surprised if that has changed even if C++11 really prohibits that. But I'd need to see the wording to be convinced that they're really invalid now. Without knowing anything about the wording, I would think that at most tread safety for reads is required, i.e. an efficiency issue.Sulphate
The SO thread on C++11 COW is hereTunic
@MattMcNabb: Thanks! Well the selected answer doesn't scan, really. Faulty logic, because the COW copying is done the first time an external reference is obtained (for the given buffer allocation), so it doesn't (it can't logically) invalidate any existing references. But a faulty argument doesn't make the position argued for, untrue; it may still be that C++ prohibits COW. Possibly.Sulphate
@MattMcNabb: Luc Danton's answer explains how COW is at least partially prohibited in C++11, by a requirement that the copy constructor makes a buffer copy.Sulphate
@MattMcNabb: Sorry, I'm evidently tired. Luc's answer doesn't show that COW is prohibited in any way. Merely that a call of data() must trigger COW copying if the buffer is currently shared. Hm.Sulphate
S
8

Replace std::getline by creating a wrapper around std::istream::getline:

std::istream& my::getline( std::istream& is, std::streamsize n, std::string& str, char delim )
    {
    try
       {
       str.resize(n);
       is.getline(&str[0],n,delim);
       str.resize(is.gcount());
       return is;
       }
    catch(...) { str.resize(0); throw; }
    }

If you want to avoid excessive temporary memory allocations, you could use a loop which grows the allocation as needed (probably doubling in size on each pass). Don't forget that exceptions might or might not be enabled on the istream object.

Here's a version with the more efficient allocation strategy:

std::istream& my::getline( std::istream& is, std::streamsize n, std::string& str, char delim )
    {
    std::streamsize base=0;
    do {
       try
          {
          is.clear();
          std::streamsize chunk=std::min(n-base,std::max(static_cast<std::streamsize>(2),base));
          if ( chunk == 0 ) break;
          str.resize(base+chunk);
          is.getline(&str[base],chunk,delim);
          }
       catch( std::ios_base::failure ) { if ( !is.gcount () ) str.resize(0), throw; }
       base += is.gcount();
       } while ( is.fail() && is.gcount() );
    str.resize(base);
    return is;
    }
Sidran answered 17/12, 2013 at 22:52 Comment(3)
As implemented, both of these will leave a terminating '\0' character in the generated string. This is not normal for a C++ string, so an improvement would be to pop the final character before returning. Note that resizing based on scanning the string for '\0' might be considered buggy since '\0' could be a valid character within the string (this isn't 'C'). Also, I don't know exactly how this interacts with Microsoft 'text' mode where text lines are typically terminated by two characters. If I understand the docs, a '\r' would be left in the string because is.getline() is "unformatted".Sidran
... if ( !str.empty() ) str.resize(str.size()-1);Sidran
try { my::getline( is, 4096, s, '\n' ); } catch ( std::ios::failure const & ) {}Sidran
C
5

Based on the comments and answers, there seem to be three approaches:

  1. Write a custom version of getline() possibly using the std::istream::getline() member internally to get the actual characters.
  2. Use a filtering stream buffer to limit the amount of data potentially received.
  3. Instead of reading a std::string, use a string instantiation with a custom allocator limiting the amount of memory stored in the string.

Not all of the suggestions came with code. This answer provides code for all approaches and a bit of discussion of all three approaches. Before going into implementation details it is first worth pointing out that there are multiple choices of what should happen if an excessively long input is received:

  1. Reading an overlong line could result in a successful read of a partial line, i.e., the resulting string contains the read content and the stream doesn't have any error flags set. Doing so means, however, that it isn't possible to distinguish between a line hitting exactly the limit or being too long. Since the limit is somewhat arbitrary anyway it probably doesn't really matter, though.
  2. Reading an overlong line could be considered a failure (i.e., setting std::ios_base::failbit and/or std::ios_base::bad_bit) and, since reading failed, yield an empty string. Yielding an empty string, obviously, prevents potentially looking at the string read so far to possibly see what's going on.
  3. Reading an overlong line could provide the partial line read and also set error flags on the stream. This seems reasonable behavior both detecting that there is something up and also providing the input for potential inspection.

Although there are multiple code examples implementing a limited version of getline() already, here is another one! I think it is simpler (albeit possibly slower; performance can be dealt with when necessary) which also retains's std::getline()s interface: it use the stream's width() to communicate a limit (maybe taking width() into account is a reasonable extension to std::getline()):

template <typename cT, typename Traits, typename Alloc>
std::basic_istream<cT, Traits>&
safe_getline(std::basic_istream<cT, Traits>& in,
             std::basic_string<cT, Traits, Alloc>& value,
             cT delim)
{
    typedef std::basic_string<cT, Traits, Alloc> string_type;
    typedef typename string_type::size_type size_type;

    typename std::basic_istream<cT, Traits>::sentry cerberos(in);
    if (cerberos) {
        value.clear();
        size_type width(in.width(0));
        if (width == 0) {
            width = std::numeric_limits<size_type>::max();
        }
        std::istreambuf_iterator<char> it(in), end;
        for (; value.size() != width && it != end; ++it) {
            if (!Traits::eq(delim, *it)) {
                value.push_back(*it);
            }
            else {
                ++it;
                break;
            }
        }
        if (value.size() == width) {
            in.setstate(std::ios_base::failbit);
        }
    }
    return in;
}

This version of getline() is used just like std::getline() but when it seems reasonable to limit the amount of data read, the width() is set, e.g.:

std::string line;
if (safe_getline(in >> std::setw(max_characters), line)) {
    // do something with the input
}

Another approach is to just use a filtering stream buffer to limit the amount of input: the filter would just count the number of characters processed and limit the amount to a suitable number of characters. This approach is actually easier applied to an entire stream than an individual line: when processing just one line, the filter can't just obtain buffers full of characters from the underlying stream because there is no reliable way to put the characters back. Implementing an unbuffered version is still simple but probably not particularly efficient:

template <typename cT, typename Traits = std::char_traits<char> >
class basic_limitbuf
    : std::basic_streambuf <cT, Traits> {
public:
    typedef Traits                    traits_type;
    typedef typename Traits::int_type int_type;

private:
    std::streamsize                   size;
    std::streamsize                   max;
    std::basic_istream<cT, Traits>*   stream;
    std::basic_streambuf<cT, Traits>* sbuf;

    int_type underflow() {
        if (this->size < this->max) {
            return this->sbuf->sgetc();
        }
        else {
            this->stream->setstate(std::ios_base::failbit);
            return traits_type::eof();
        }
    }
    int_type uflow()     {
        if (this->size < this->max) {
            ++this->size;
            return this->sbuf->sbumpc();
        }
        else {
            this->stream->setstate(std::ios_base::failbit);
            return traits_type::eof();
        }
    }
public:
    basic_limitbuf(std::streamsize max,
                   std::basic_istream<cT, Traits>& stream)
        : size()
        , max(max)
        , stream(&stream)
        , sbuf(this->stream->rdbuf(this)) {
    }
    ~basic_limitbuf() {
        std::ios_base::iostate state = this->stream->rdstate();
        this->stream->rdbuf(this->sbuf);
        this->stream->setstate(state);
    }
};

This stream buffer is already set up to insert itself upon construction and remove itself upon destruction. That is, it can be used simply like this:

std::string line;
basic_limitbuf<char> sbuf(max_characters, in);
if (std::getline(in, line)) {
    // do something with the input
}

It would be easy to add a manipulator setting up the limit, too. One advantage of this approach is that none of the reading code needs to be touched if the total size of the stream could be limited: the filter could be set up right after creating the stream. When there is no need to back out the filter, the filter could also use a buffer which would greatly improve the performance.

The third approach suggested is to use a std::basic_string with a custom allocator. There are two aspects which are a bit awkward about the allocator approach:

  1. The string being read actually has a type which isn't immediately convertible to std::string (although it also isn't hard to do the conversion).
  2. The maximum array size can be easily limited but the string will have some more or less random size smaller than that: when the stream fails allocating an exception is thrown and there is no attempt to grow the string by a smaller size.

Here is the necessary code for an allocator limiting the allocated size:

template <typename T>
struct limit_alloc
{
private:
    std::size_t max_;
public:
    typedef T value_type;
    limit_alloc(std::size_t max): max_(max) {}
    template <typename S>
    limit_alloc(limit_alloc<S> const& other): max_(other.max()) {}
    std::size_t max() const { return this->max_; }
    T* allocate(std::size_t size) {
        return size <= max_
            ? static_cast<T*>(operator new[](size))
            : throw std::bad_alloc();
    }
    void  deallocate(void* ptr, std::size_t) {
        return operator delete[](ptr);
    }
};

template <typename T0, typename T1>
bool operator== (limit_alloc<T0> const& a0, limit_alloc<T1> const& a1) {
    return a0.max() == a1.max();
}
template <typename T0, typename T1>
bool operator!= (limit_alloc<T0> const& a0, limit_alloc<T1> const& a1) {
    return !(a0 == a1);
}

The allocator would be used something like this (the code compiles OK with a recent version of clang but not with gcc):

std::basic_string<char, std::char_traits<char>, limit_alloc<char> >
    tmp(limit_alloc<char>(max_chars));
if (std::getline(in, tmp)) {
    std::string(tmp.begin(), tmp.end());
    // do something with the input
}

In summary, there are multiple approach each with its own small drawback but each reasonably viable for the stated goal of limiting denial of service attacks based on overlong lines:

  1. Using a custom version of getline() means the reading code needs to be changed.
  2. Using a custom stream buffer is slow unless the entire stream's size can be limited.
  3. Using a custom allocator gives less control and requires some changes to reading code.
Clio answered 17/12, 2013 at 21:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.