C++ get the size (in bytes) of EOL
Asked Answered
H

2

8

I am reading an ASCII text file. It is defined by the size of each field, in bytes. E.g. Each row consists of a 10 bytes for some string, 8 bytes for a floating point value, 5 bytes for an integer and so on.

My problem is reading the newline character, which has a variable size depending on the OS (usually 2 bytes for windows and 1 byte for linux I believe).

How can I get the size of the EOL character in C++?

For example, in python I can do:

len(os.linesep)
Hales answered 5/1, 2016 at 7:35 Comment(2)
If you're opening the file in text mode, newlines should always just be '\n', whatever the native line ending is. Do you really need to know the size of the native EOL string?Parenteral
Is the file guaranteed to have been saved under the same OS as the one your code that reads it runs on? If yes, simply open the file in text (not binary) mode.Age
U
1

The time honored way to do this is to read a line.

Now, the last char should be \n. Strip it. Then, look at the previous character. It will either be \r or something else. If it's \r, strip it.

For Windows [ascii] text files, there aren't any other possibilities.

This works even if the file is mixed (e.g. some lines are \r\n and some are just \n).

You can tentatively do this on few lines, just to be sure you're not dealing with something weird.

After that, you now know what to expect for most of the file. But, the strip method is the general reliable way. On Windows, you could have a file imported from Unix (or vice versa).

Unyoke answered 5/1, 2016 at 7:58 Comment(7)
Half a nitpick, but it's hard to read a line without knowing beforehands what the line terminator is. For example, your recipe fails for \r line terminators, and also for consecutive empty lines saved as \r\n\n\n which have been sighted in windows-land.Age
@Age The method works against \r\n\n\n (e.g. \r\n \n \n)--that's just mixed mode as I mentioned [consecutive is non-issue]. I haven't seen a \r only file in 20+ years [if ever, and I've converted 1000's of files]. Not readable by many programs as they now assume [at least] newline. Try DOS type file on one ;-) I don't think even MS supports them anymore. '\r' is valid [as non-terminator] at the beginning of a line (e.g. captured progress output). I've seen much more of that (e.g. \rpgm is 56% done\rpgm is 57% done)Unyoke
@CraigEstey - Old school Mac files are \r only. See wikipedia: en.wikipedia.org/wiki/NewlineDissymmetry
@Dissymmetry I guessed as much, but, this is beyond the scope of OP's question. Such a file would need to be converted upon import to the [NTFS] FS to be usable under WinX--so OP would never see them raw. They can be auto-detected/converted, but it's better to just "know" [via cmd line option]. The fastest way to do line reads is via mmap (See my answer: #33616784), so easy enough to prescan first, but hardly worth the extra effort in 99.44% of cases.Unyoke
@CraigEstey - There are many ways I can think of to get CR terminated text files. You could boot a windows machine using a linux boot disk and copy files from an old drive, etc. Point is - nowhere does the OP mention windows, copying a file onto a windows machine doesnt "import to the FS", heck Vim can generate CR line ending text files on a windows machine if you really wanted. It doesn't seem "beyond the scope" of the question - indeed it seems the entire point of the question, a point that you have missed.Dissymmetry
@Dissymmetry I've missed nothing my friend. vim [under windows] will generate \r\n [vim calls it "dos mode"] and I covered that mixed mode case in my post. You can turn dos mode on/off on either system. That is different than \r only--which is malformed on WinX/unix and must be converted before any common/sane program can use them. OP does mention windows--reread question. Time to move on ...Unyoke
@CraigEstey I think you need to learn how to use Vim, and learn how line endings work at the same time. vim.wikia.com/wiki/File_format set file format to mac and everything works fine. Utter nonsense what you say about it being "malformed"/ Nevermind, people like you don't have the ability to learn. Maybe move on to a textbook - 20 years experience, hah, must have missed MacOS 9 then, eh?Dissymmetry
D
0

I'm not sure that the translation occurs where you think it is. Look at the following code:

ostringstream buf;
buf<< std::endl;
string s = buf.str();
int i = strlen(s.c_str());

After this, running on Windows, i == 1. So the end of line definition in std is 1 character. As others have commented, this is the "\n" character.

Dissymmetry answered 5/1, 2016 at 7:47 Comment(3)
This code is wrong because CRT lib doesn't turn \n into \r\n for in-memory buffers, but it does so for files and console.Pedagogy
Here you are demonstrating the problem I am up against. C++ will convert "\n" into the os-specific character when writing to a file/console, but not to a buffer.Hales
@Hales I don't think you explained your problem well enough yet. \n doesn't need to (and in fact couldn't) be encoded whatsoever when written to a buffer. But when you write that buffer to a file opened in text mode, the \n will be translated automatically to whatever the platform mandates. Then if you open the same file in text mode and read it back, the newline sequence will be translated back to \n. So, to me at least, it's not clear why you need to know the encoding of \n in the file on disk.Age

© 2022 - 2024 — McMap. All rights reserved.