Read a file/URL line-by-line in Swift
Asked Answered
C

12

89

I am trying to read a file given in an NSURL and load it into an array, with items separated by a newline character \n.

Here is the way I've done it so far:

var possList: NSString? = NSString.stringWithContentsOfURL(filePath.URL) as? NSString
if var list = possList {
    list = list.componentsSeparatedByString("\n") as NSString[]
    return list
}
else {
    //return empty list
}

I'm not very happy with this for a couple of reasons. One, I'm working with files that range from a few kilobytes to hundreds of MB in size. As you can imagine, working with strings this large is slow and unwieldy. Secondly, this freezes up the UI when it's executing--again, not good.

I've looked into running this code in a separate thread, but I've been having trouble with that, and besides, it still doesn't solve the problem of dealing with huge strings.

What I'd like to do is something along the lines of the following pseudocode:

var aStreamReader = new StreamReader(from_file_or_url)
while aStreamReader.hasNextLine == true {
    currentline = aStreamReader.nextLine()
    list.addItem(currentline)
}

How would I accomplish this in Swift?

A few notes about the files I'm reading from: All files consist of short (<255 chars) strings separated by either \n or \r\n. The length of the files range from ~100 lines to over 50 million lines. They may contain European characters, and/or characters with accents.

Corcyra answered 4/7, 2014 at 23:4 Comment(1)
Are you wanting to write the array out to disk as you go or just let the OS handle it with memory? Will the Mac running it have enough ram that you could map the file and work with it that way? Multiple tasks are easy enough to do, and I suppose you could have multiple jobs that start reading the file at different places.Poore
S
166

(The code is for Swift 2.2/Xcode 7.3 now. Older versions can be found in the edit history if somebody needs it. An updated version for Swift 3 is provided at the end.)

The following Swift code is heavily inspired by the various answers to How to read data from NSFileHandle line by line?. It reads from the file in chunks, and converts complete lines to strings.

The default line delimiter (\n), string encoding (UTF-8) and chunk size (4096) can be set with optional parameters.

class StreamReader  {

    let encoding : UInt
    let chunkSize : Int

    var fileHandle : NSFileHandle!
    let buffer : NSMutableData!
    let delimData : NSData!
    var atEof : Bool = false

    init?(path: String, delimiter: String = "\n", encoding : UInt = NSUTF8StringEncoding, chunkSize : Int = 4096) {
        self.chunkSize = chunkSize
        self.encoding = encoding

        if let fileHandle = NSFileHandle(forReadingAtPath: path),
            delimData = delimiter.dataUsingEncoding(encoding),
            buffer = NSMutableData(capacity: chunkSize)
        {
            self.fileHandle = fileHandle
            self.delimData = delimData
            self.buffer = buffer
        } else {
            self.fileHandle = nil
            self.delimData = nil
            self.buffer = nil
            return nil
        }
    }

    deinit {
        self.close()
    }

    /// Return next line, or nil on EOF.
    func nextLine() -> String? {
        precondition(fileHandle != nil, "Attempt to read from closed file")

        if atEof {
            return nil
        }

        // Read data chunks from file until a line delimiter is found:
        var range = buffer.rangeOfData(delimData, options: [], range: NSMakeRange(0, buffer.length))
        while range.location == NSNotFound {
            let tmpData = fileHandle.readDataOfLength(chunkSize)
            if tmpData.length == 0 {
                // EOF or read error.
                atEof = true
                if buffer.length > 0 {
                    // Buffer contains last line in file (not terminated by delimiter).
                    let line = NSString(data: buffer, encoding: encoding)

                    buffer.length = 0
                    return line as String?
                }
                // No more lines.
                return nil
            }
            buffer.appendData(tmpData)
            range = buffer.rangeOfData(delimData, options: [], range: NSMakeRange(0, buffer.length))
        }

        // Convert complete line (excluding the delimiter) to a string:
        let line = NSString(data: buffer.subdataWithRange(NSMakeRange(0, range.location)),
            encoding: encoding)
        // Remove line (and the delimiter) from the buffer:
        buffer.replaceBytesInRange(NSMakeRange(0, range.location + range.length), withBytes: nil, length: 0)

        return line as String?
    }

    /// Start reading from the beginning of file.
    func rewind() -> Void {
        fileHandle.seekToFileOffset(0)
        buffer.length = 0
        atEof = false
    }

    /// Close the underlying file. No reading must be done after calling this method.
    func close() -> Void {
        fileHandle?.closeFile()
        fileHandle = nil
    }
}

Usage:

if let aStreamReader = StreamReader(path: "/path/to/file") {
    defer {
        aStreamReader.close()
    }
    while let line = aStreamReader.nextLine() {
        print(line)
    }
}

You can even use the reader with a for-in loop

for line in aStreamReader {
    print(line)
}

by implementing the SequenceType protocol (compare http://robots.thoughtbot.com/swift-sequences):

extension StreamReader : SequenceType {
    func generate() -> AnyGenerator<String> {
        return AnyGenerator {
            return self.nextLine()
        }
    }
}

Update for Swift 3/Xcode 8 beta 6: Also "modernized" to use guard and the new Data value type:

class StreamReader  {

    let encoding : String.Encoding
    let chunkSize : Int
    var fileHandle : FileHandle!
    let delimData : Data
    var buffer : Data
    var atEof : Bool

    init?(path: String, delimiter: String = "\n", encoding: String.Encoding = .utf8,
          chunkSize: Int = 4096) {

        guard let fileHandle = FileHandle(forReadingAtPath: path),
            let delimData = delimiter.data(using: encoding) else {
                return nil
        }
        self.encoding = encoding
        self.chunkSize = chunkSize
        self.fileHandle = fileHandle
        self.delimData = delimData
        self.buffer = Data(capacity: chunkSize)
        self.atEof = false
    }

    deinit {
        self.close()
    }

    /// Return next line, or nil on EOF.
    func nextLine() -> String? {
        precondition(fileHandle != nil, "Attempt to read from closed file")

        // Read data chunks from file until a line delimiter is found:
        while !atEof {
            if let range = buffer.range(of: delimData) {
                // Convert complete line (excluding the delimiter) to a string:
                let line = String(data: buffer.subdata(in: 0..<range.lowerBound), encoding: encoding)
                // Remove line (and the delimiter) from the buffer:
                buffer.removeSubrange(0..<range.upperBound)
                return line
            }
            let tmpData = fileHandle.readData(ofLength: chunkSize)
            if tmpData.count > 0 {
                buffer.append(tmpData)
            } else {
                // EOF or read error.
                atEof = true
                if buffer.count > 0 {
                    // Buffer contains last line in file (not terminated by delimiter).
                    let line = String(data: buffer as Data, encoding: encoding)
                    buffer.count = 0
                    return line
                }
            }
        }
        return nil
    }

    /// Start reading from the beginning of file.
    func rewind() -> Void {
        fileHandle.seek(toFileOffset: 0)
        buffer.count = 0
        atEof = false
    }

    /// Close the underlying file. No reading must be done after calling this method.
    func close() -> Void {
        fileHandle?.closeFile()
        fileHandle = nil
    }
}

extension StreamReader : Sequence {
    func makeIterator() -> AnyIterator<String> {
        return AnyIterator {
            return self.nextLine()
        }
    }
}
Scintillate answered 9/7, 2014 at 8:38 Comment(39)
Where would I put the extension code block? In the StreamReader class?Corcyra
@Matt: It does not matter. You can put the extension in the same Swift file as the "main class", or in a separate file. - Actually you don't really need an extension. You can add the generate() function to the StreamReader class and declare that as class StreamReader : Sequence { ... }. But it seems to be good Swift style to use extensions for separate pieces of functionality.Scintillate
Does this support Western European characters and accent marks (just the marks themselves)?Corcyra
Or, is it possible to have it handle unrecognized characters?Corcyra
@Matt: Do you know which characters set is used in the text file? You could try NSWindowsCP1252StringEncoding or NSISOLatin1StringEncoding, compare #13929903.Scintillate
This is the file: whitehatenterprises.com/downloads/mangled.txt.zip (Caution, it's pretty big) I'm running in to trouble around line 1704240.Corcyra
Don't worry about it. Using NSISOLatin1SringEncoding seemed to fix what was happening :)Corcyra
For Swift 1.1 and Xcode 6.1 I created an updated version of your StreamReader: gist.github.com/klaas/4691612802f540b6a9c5Militate
Failable initializers come in pretty handy ;-)Militate
line seems to be always nil for me. Any idea what I could be doing wrong? I am getting the path from my NSURL through url.path!Shatzer
@zanzoken: What kind of URL are you using? The above code works only for file URLs. It cannot be used to read from a general server URL. Compare #26674682 and my comments under the question.Scintillate
I am using fileURLs which I fetched from a share extension I wrote. So basically they are picture files.Shatzer
@zanzoken: My code is meant for text files, and expects the file to use a specified encoding (UTF-8 by default). If you have a file with arbitrary binary bytes (such as an image file) then the data->string conversion will fail.Scintillate
Is there a way to do it for images?Shatzer
@zanzoken: Reading scan lines from an image is a completely different topic and has nothing to do with this code, sorry. I am sure that it can be done for example with CoreGraphics methods, but I to not have an immediate reference for you.Scintillate
I was just about to write this up myself but I had time constraints. Thanks for this!Fraud
was thinking about writing some C code but this is good stuff.Steed
Great post, but I found a couple of errors using Swift 2. In the buffer.rangeOfData call, options cannot be nil or it says "cannot invoke 'rangeOfData' with an argument list of type [the arguments]." Changing nil to NSDataSearchOptions() or some actual instance of that object fixes the problem. Also, this is minor, but since tmpData is only assigned a value once, it can be made a constant by changing the var before it to let.Zalucki
@5happy1: options: nil has to be replaced by options: [] as I mentioned in the "Update for Swift 2" at the end of the answer. Your are right about the constant, I have fixed that. Thanks for the feedback!Scintillate
@MartinR You're welcome! I'm sorry I corrected the options thing, but I just didn't completely understand it on my first read-through. Thanks for clarifying it though, I understand it now. :)Zalucki
Attention all Int(aStreamReader.nextLine()) people!!! If that line (or something like it with some optional/forced unwraps (? or !)) is giving you trouble because it becomes nil, then simply remove the last character from what nextLine() returns, because it returns something like "17\r". To remove it, follow the instructions here: #24122788 except count the characters by the method seen here (in Swift 2): https://mcmap.net/q/53673/-get-the-length-of-a-stringZalucki
For some reason, using a relative path such as ./file.txt doesn't work.Coronet
@applemavs: This works with relative paths as well (I just tested it). Did you verify that the file is located in the current working directory of the process?Scintillate
This is awesome. Is it possible to use the code with a string instead of a file? I have been trying but cannot figure it out. Thanks!Cuspidate
@IgorTupitsyn: It should be possible to extend the code to work with strings, but why would you want that? The purpose of this routines was to read huge files line by line, so that you don't have to load the entire file into memory. If you already have a string then you can just use componentsSeparatedByString to split it into lines.Scintillate
Martin. I am using your great solution to extract parts of texts from a big file (these parts are separated from each other by empty lines). Each part consists of multiple lines of text (sometimes as long as 30-40 lines). So, I then need to break the extracted part into separate lines. In C++ I used getline to do both tasks, which was very quick. And I was thinking of a similar solution here. Thanks a lot!Cuspidate
in the line delimData = delimiter.dataUsingEncoding(encoding) I would suggest to replace encoding by NSUTF8StringEncoding as the delimiter is coming from the source file. If you encode a UTF16-file it will not work otherwiseNinety
@Christian: Are you sure? If, for example, delim="\n" and encoding= NSUTF16LittleEndianStringEncoding, then delimData is set to <0A 00> and matches the UTF-16 newline character in the file.Scintillate
I just parsed an .strings-file which is UTF16 with my source being UTF8 and I had to do it like this. I thought that was the reasonNinety
@Rodrigo: Thank you for the edit. options: .Anchored is not the correct solution however because it ties the search for the delimiter to the start of the data. Also I had already added an "addendum" for Swift 2. – But I have taken the opportunity to clean-up the answer and remove all pre-2.0 stuff, hopefully that avoids future confusion.Scintillate
What happens if there's an I/O error? As far as I can tell, FileHandle still throws Objective-C exceptions that are uncatchable in Swift.Keneth
I had a spike in memory graph while iterate over lines. Wrap the code in an autoreleasepool solved.Minimal
@Eporediese I'm facing the exact same issue! Could you provide a code snippet how did you solve it using autoreleasepool? Thanks in advanceRutheruthenia
@DCDC while !aStreamReader.atEof { try autoreleasepool { guard let line = aStreamReader.nextLine() else { return } ...code... } } Minimal
You may want to wrap buffer.removeSubrange(0..<range.upperBound) in autoreleasepool to reduce memory usage when reading large files.Knight
Works in Swift 4. Test at this GitHub repo.Remittent
Is there any reason why you don't just directly conform StreamReader to the IteratorProtocol and rename nextLine to next? This would obviate the need for the extension to StreamReader that declares the makeIterator method.Boring
@PeterSchorn: No particular reason. I just wrote the nextLine method first and added the sequence conformance later. Also “nextLine” might describe the purpose of the function better than “next“ but that is of course a matter of personal taste.Scintillate
if let aStreamReader = StreamReader(path: filePath, delimiter: "\n") { defer { aStreamReader.close() } while aStreamReader.atEof == false { while let line = aStreamReader.nextLine() { print(line) } } }Cuckoo
T
38

Efficient and convenient class for reading text file line by line (Swift 4, Swift 5)

Note: This code is platform independent (macOS, iOS, ubuntu)

import Foundation

/// Read text file line by line in efficient way
public class LineReader {
   public let path: String

   fileprivate let file: UnsafeMutablePointer<FILE>!

   init?(path: String) {
      self.path = path
      file = fopen(path, "r")
      guard file != nil else { return nil }
   }

   public var nextLine: String? {
      var line: UnsafeMutablePointer<CChar>?
      var linecap: Int = 0
      defer { free(line) }
      return getline(&line, &linecap, file) > 0 ? String(cString: line!) : nil
   }

   deinit {
      fclose(file)
   }
}

extension LineReader: Sequence {
   public func makeIterator() -> AnyIterator<String> {
      return AnyIterator<String> {
         return self.nextLine
      }
   }
}

Usage:

guard let reader = LineReader(path: "/Path/to/file.txt") else {
    return; // cannot open file
}

for line in reader {
    print(">" + line.trimmingCharacters(in: .whitespacesAndNewlines))      
}

Repository on github

Talyah answered 28/11, 2016 at 23:12 Comment(0)
A
8

Swift 4.2 Safe syntax

class LineReader {

    let path: String

    init?(path: String) {
        self.path = path
        guard let file = fopen(path, "r") else {
            return nil
        }
        self.file = file
    }
    deinit {
        fclose(file)
    }

    var nextLine: String? {
        var line: UnsafeMutablePointer<CChar>?
        var linecap = 0
        defer {
            free(line)
        }
        let status = getline(&line, &linecap, file)
        guard status > 0, let unwrappedLine = line else {
            return nil
        }
        return String(cString: unwrappedLine)
    }

    private let file: UnsafeMutablePointer<FILE>
}

extension LineReader: Sequence {
    func makeIterator() -> AnyIterator<String> {
        return AnyIterator<String> {
            return self.nextLine
        }
    }
}

Usage:

guard let reader = LineReader(path: "/Path/to/file.txt") else {
    return
}
reader.forEach { line in
    print(line.trimmingCharacters(in: .whitespacesAndNewlines))      
}
Alcantara answered 3/1, 2019 at 23:17 Comment(0)
E
4

I'm late to the game, but here's small class I wrote for that purpose. After some different attempts (try to subclass NSInputStream) I found this to be a reasonable and simple approach.

Remember to #import <stdio.h> in your bridging header.

// Use is like this:
let readLine = ReadLine(somePath)
while let line = readLine.readLine() {
    // do something...
}

class ReadLine {

    private var buf = UnsafeMutablePointer<Int8>.alloc(1024)
    private var n: Int = 1024

    let path: String
    let mode: String = "r"

    private lazy var filepointer: UnsafeMutablePointer<FILE> = {
        let csmode = self.mode.withCString { cs in return cs }
        let cspath = self.path.withCString { cs in return cs }

        return fopen(cspath, csmode)
    }()

    init(path: String) {
        self.path = path
    }

    func readline() -> String? {
        // unsafe for unknown input
        if getline(&buf, &n, filepointer) > 0 {
            return String.fromCString(UnsafePointer<CChar>(buf))
        }

        return nil
    }

    deinit {
        buf.dealloc(n)
        fclose(filepointer)
    }
}
Epimenides answered 6/6, 2015 at 14:12 Comment(1)
I like this, but it can still be improved. Creating pointers using withCString is not necessary (and actually really unsafe), you can simply call return fopen(self.path, self.mode). One might add a check if the file really could be opened, currently readline() will just crash. The UnsafePointer<CChar> cast is not needed. Finally, your usage example does not compile.Scintillate
P
4

This function takes a file URL and returns a sequence which will return every line of the file, reading them lazily. It works with Swift 5. It relies on the underlying getline:

typealias LineState = (
  // pointer to a C string representing a line
  linePtr:UnsafeMutablePointer<CChar>?,
  linecap:Int,
  filePtr:UnsafeMutablePointer<FILE>?
)

/// Returns a sequence which iterates through all lines of the the file at the URL.
///
/// - Parameter url: file URL of a file to read
/// - Returns: a Sequence which lazily iterates through lines of the file
///
/// - warning: the caller of this function **must** iterate through all lines of the file, since aborting iteration midway will leak memory and a file pointer
/// - precondition: the file must be UTF8-encoded (which includes, ASCII-encoded)
func lines(ofFile url:URL) -> UnfoldSequence<String,LineState>
{
  let initialState:LineState = (linePtr:nil, linecap:0, filePtr:fopen(url.path,"r"))
  return sequence(state: initialState, next: { (state) -> String? in
    if getline(&state.linePtr, &state.linecap, state.filePtr) > 0,
      let theLine = state.linePtr  {
      return String.init(cString:theLine)
    }
    else {
      if let actualLine = state.linePtr  { free(actualLine) }
      fclose(state.filePtr)
      return nil
    }
  })
}

So for instance, here's how you would use it to print every line of a file named "foo" in your app bundle:

let url = NSBundle.mainBundle().urlForResource("foo", ofType: nil)!
for line in lines(ofFile:url) {
  // suppress print's automatically inserted line ending, since
  // lineGenerator captures each line's own new line character.
  print(line, separator: "", terminator: "")
}

I developed this answer by modifying Alex Brown's answer to remove a memory leak mentioned by Martin R's comment, and by updating it to for Swift 5.

Porthole answered 1/5, 2016 at 3:25 Comment(0)
C
3

Swift 5.5: use url.lines

ADC Docs are here

Example usage:

guard let url = URL(string: "https://www.example.com") else {
    return
}

// Manipulating an `Array` in memory seems to be a requirement.
// This will balloon in size as lines of data get added.
var myHugeArray = [String]()

do {
    // This should keep the inbound data memory usage low
    for try await line in url.lines {
        myHugeArray.append(line)
    }
} catch {
     debugPrint(error)
}

You can use this in a SwiftUI .task { } modifier or wrap this in a Task return type to get its work off the main thread.

Cothurnus answered 15/6, 2022 at 4:40 Comment(0)
L
2

Try this answer, or read the Mac OS Stream Programming Guide.

You may find that performance will actually be better using the stringWithContentsOfURL, though, as it will be quicker to work with memory-based (or memory-mapped) data than disc-based data.

Executing it on another thread is well documented, also, for example here.

Update

If you don't want to read it all at once, and you don't want to use NSStreams, then you'll probably have to use C-level file I/O. There are many reasons not to do this - blocking, character encoding, handling I/O errors, speed to name but a few - this is what the Foundation libraries are for. I've sketched a simple answer below that just deals with ACSII data:

class StreamReader {

    var eofReached = false
    let fileHandle: UnsafePointer<FILE>

    init (path: String) {
        self.fileHandle = fopen(path.bridgeToObjectiveC().UTF8String, "rb".bridgeToObjectiveC().UTF8String)
    }

    deinit {
        fclose(self.fileHandle)
    }

    func nextLine() -> String {
        var nextChar: UInt8 = 0
        var stringSoFar = ""
        var eolReached = false
        while (self.eofReached == false) && (eolReached == false) {
            if fread(&nextChar, 1, 1, self.fileHandle) == 1 {
                switch nextChar & 0xFF {
                case 13, 10 : // CR, LF
                    eolReached = true
                case 0...127 : // Keep it in ASCII
                    stringSoFar += NSString(bytes:&nextChar, length:1, encoding: NSASCIIStringEncoding)
                default :
                    stringSoFar += "<\(nextChar)>"
                }
            } else { // EOF or error
                self.eofReached = true
            }
        }
        return stringSoFar
    }
}

// OP's original request follows:
var aStreamReader = StreamReader(path: "~/Desktop/Test.text".stringByStandardizingPath)

while aStreamReader.eofReached == false { // Changed property name for more accurate meaning
    let currentline = aStreamReader.nextLine()
    //list.addItem(currentline)
    println(currentline)
}
Leena answered 6/7, 2014 at 9:44 Comment(5)
I appreciate the suggestion(s), but I am specifically looking for the code in Swift. Additionally, I want to work with one line at a time, rather than all the lines at once.Corcyra
So are you looking to work with one line then release it and read the next one in? I would need to think that it is going to be faster to work with it in memory. Do they need to be processed in order? If not you can use a enumeration block to dramatically speed up the processing of the array.Poore
I'd like to grab a number of lines at once, but I won't necessarily need to load all of the lines. As for being in order, it's not critical, but it would be helpful.Corcyra
What happens if you extend the case 0...127 to non-ASCII characters?Corcyra
Well that really depends on what character encoding you have in your files. If they are one of the many formats of Unicode, you'll need to code for that, if they are one of the many pre-Unicode PC "code-page" systems, you'll need to decode that. The Foundation libraries do all of this for you, it's a lot of work on your own.Leena
F
2

Or you could simply use a Generator:

let stdinByLine = GeneratorOf({ () -> String? in
    var input = UnsafeMutablePointer<Int8>(), lim = 0
    return getline(&input, &lim, stdin) > 0 ? String.fromCString(input) : nil
})

Let's try it out

for line in stdinByLine {
    println(">>> \(line)")
}

It's simple, lazy, and easy to chain with other swift things like enumerators and functors such as map, reduce, filter; using the lazy() wrapper.


It generalises to all FILE as:

let byLine = { (file:UnsafeMutablePointer<FILE>) in
    GeneratorOf({ () -> String? in
        var input = UnsafeMutablePointer<Int8>(), lim = 0
        return getline(&input, &lim, file) > 0 ? String.fromCString(input) : nil
    })
}

called like

for line in byLine(stdin) { ... }
Fortitude answered 26/2, 2015 at 0:59 Comment(3)
Much thanks to a now departed answer which gave me the getline code!Fortitude
Obviously I'm completely ignoring encoding. Left as an exercise for the reader.Fortitude
Note that your code leaks memory as getline() allocates a buffer for the data.Scintillate
S
2

Following up on @dankogai's answer, I made a few modifications for Swift 4+,

    let bufsize = 4096
    let fp = fopen(jsonURL.path, "r");
    var buf = UnsafeMutablePointer<Int8>.allocate(capacity: bufsize)

    while (fgets(buf, Int32(bufsize-1), fp) != nil) {
        print( String(cString: buf) )
     }
    buf.deallocate()

This worked for me.

Thanks

Sisneros answered 28/11, 2019 at 1:40 Comment(0)
H
1

It turns out good old-fasioned C API is pretty comfortable in Swift once you grok UnsafePointer. Here is a simple cat that reads from stdin and prints to stdout line-by-line. You don't even need Foundation. Darwin suffices:

import Darwin
let bufsize = 4096
// let stdin = fdopen(STDIN_FILENO, "r") it is now predefined in Darwin
var buf = UnsafePointer<Int8>.alloc(bufsize)
while fgets(buf, Int32(bufsize-1), stdin) {
    print(String.fromCString(CString(buf)))
}
buf.destroy()
Headrest answered 10/7, 2014 at 6:37 Comment(3)
Fails to handle "by line" at all. It blits input data to output, and does not recognise the different between normal characters and line end characters. Obviously, the output consists of the same lines as the input, but that's because newline is also blitted.Fortitude
@AlexBrown: That is not true. fgets() reads characters up to (and including) a newline character (or EOF). Or am I misunderstanding your comment?Scintillate
@Martin R , please how would this look in Swift 4/5? I need something this simple to read a file line by line –Sisneros
I
1

(Note: I'm using Swift 3.0.1 on Xcode 8.2.1 with macOS Sierra 10.12.3)

All of the answers I've seen here missed that he could be looking for LF or CRLF. If everything goes well, s/he could just match on LF and check the returned string for an extra CR at the end. But the general query involves multiple search strings. In other words, the delimiter needs to be a Set<String>, where the set is neither empty nor contains the empty string, instead of a single string.

On my first try at this last year, I tried to do the "right thing" and search for a general set of strings. It was too hard; you need a full blown parser and state machines and such. I gave up on it and the project it was part of.

Now I'm doing the project again, and facing the same challenge again. Now I'm going to hard-code searching on CR and LF. I don't think anyone would need to search on two semi-independent and semi-dependent characters like this outside of CR/LF parsing.

I'm using the search methods provided by Data, so I'm not doing string encodings and stuff here. Just raw binary processing. Just assume I got an ASCII superset, like ISO Latin-1 or UTF-8, here. You can handle string encoding at the next-higher layer, and you punt on whether a CR/LF with secondary code-points attached still counts as a CR or LF.

The algorithm: just keep searching for the next CR and the next LF from your current byte offset.

  • If neither is found, then consider the next data string to be from the current offset to the end-of-data. Note that the terminator length is 0. Mark this as the end of your reading loop.
  • If a LF is found first, or only a LF is found, consider the next data string to be from the current offset to the LF. Note that the terminator length is 1. Move the offset to after the LF.
  • If only a CR is found, do like the LF case (just with a different byte value).
  • Otherwise, we got a CR followed by a LF.
    • If the two are adjacent, then handle like the LF case, except the terminator length will be 2.
    • If there is one byte between them, and said byte is also CR, then we got the "Windows developer wrote a binary \r\n while in text mode, giving a \r\r\n" problem. Also handle it like the LF case, except the terminator length will be 3.
    • Otherwise the CR and LF aren't connected, and handle like the just-CR case.

Here's some code for that:

struct DataInternetLineIterator: IteratorProtocol {

    /// Descriptor of the location of a line
    typealias LineLocation = (offset: Int, length: Int, terminatorLength: Int)

    /// Carriage return.
    static let cr: UInt8 = 13
    /// Carriage return as data.
    static let crData = Data(repeating: cr, count: 1)
    /// Line feed.
    static let lf: UInt8 = 10
    /// Line feed as data.
    static let lfData = Data(repeating: lf, count: 1)

    /// The data to traverse.
    let data: Data
    /// The byte offset to search from for the next line.
    private var lineStartOffset: Int = 0

    /// Initialize with the data to read over.
    init(data: Data) {
        self.data = data
    }

    mutating func next() -> LineLocation? {
        guard self.data.count - self.lineStartOffset > 0 else { return nil }

        let nextCR = self.data.range(of: DataInternetLineIterator.crData, options: [], in: lineStartOffset..<self.data.count)?.lowerBound
        let nextLF = self.data.range(of: DataInternetLineIterator.lfData, options: [], in: lineStartOffset..<self.data.count)?.lowerBound
        var location: LineLocation = (self.lineStartOffset, -self.lineStartOffset, 0)
        let lineEndOffset: Int
        switch (nextCR, nextLF) {
        case (nil, nil):
            lineEndOffset = self.data.count
        case (nil, let offsetLf):
            lineEndOffset = offsetLf!
            location.terminatorLength = 1
        case (let offsetCr, nil):
            lineEndOffset = offsetCr!
            location.terminatorLength = 1
        default:
            lineEndOffset = min(nextLF!, nextCR!)
            if nextLF! < nextCR! {
                location.terminatorLength = 1
            } else {
                switch nextLF! - nextCR! {
                case 2 where self.data[nextCR! + 1] == DataInternetLineIterator.cr:
                    location.terminatorLength += 1  // CR-CRLF
                    fallthrough
                case 1:
                    location.terminatorLength += 1  // CRLF
                    fallthrough
                default:
                    location.terminatorLength += 1  // CR-only
                }
            }
        }
        self.lineStartOffset = lineEndOffset + location.terminatorLength
        location.length += self.lineStartOffset
        return location
    }

}

Of course, if you have a Data block of a length that's at least a significant fraction of a gigabyte, you'll take a hit whenever no more CR or LF exist from the current byte offset; always fruitlessly searching until the end during every iteration. Reading the data in chunks would help:

struct DataBlockIterator: IteratorProtocol {

    /// The data to traverse.
    let data: Data
    /// The offset into the data to read the next block from.
    private(set) var blockOffset = 0
    /// The number of bytes remaining.  Kept so the last block is the right size if it's short.
    private(set) var bytesRemaining: Int
    /// The size of each block (except possibly the last).
    let blockSize: Int

    /// Initialize with the data to read over and the chunk size.
    init(data: Data, blockSize: Int) {
        precondition(blockSize > 0)

        self.data = data
        self.bytesRemaining = data.count
        self.blockSize = blockSize
    }

    mutating func next() -> Data? {
        guard bytesRemaining > 0 else { return nil }
        defer { blockOffset += blockSize ; bytesRemaining -= blockSize }

        return data.subdata(in: blockOffset..<(blockOffset + min(bytesRemaining, blockSize)))
    }

}

You have to mix these ideas together yourself, since I haven't done it yet. Consider:

  • Of course, you have to consider lines completely contained in a chunk.
  • But you have to handle when the ends of a line are in adjacent chunks.
  • Or when the endpoints have at least one chunk between them
  • The big complication is when the line ends with a multi-byte sequence, but said sequence straddles two chunks! (A line ending in just CR that's also the last byte in the chunk is an equivalent case, since you need to read the next chunk to see if your just-CR is actually a CRLF or CR-CRLF. There are similar shenanigans when the chunk ends with CR-CR.)
  • And you need to handle when there are no more terminators from your current offset, but the end-of-data is in a later chunk.

Good luck!

Iridescent answered 25/2, 2017 at 0:40 Comment(0)
Q
0

I wanted a version that did not continually modify the buffer or duplicate code, as both are inefficient, and would allow for any size buffer (including 1 byte) and any delimiter. It has one public method: readline(). Calling this method will return the String value of the next line or nil at EOF.

import Foundation

// LineStream(): path: String, [buffSize: Int], [delim: String] -> nil | String
// ============= --------------------------------------------------------------
// path:     the path to a text file to be parsed
// buffSize: an optional buffer size, (1...); default is 4096
// delim:    an optional delimiter String; default is "\n"
// ***************************************************************************
class LineStream {
    let path: String
    let handle: NSFileHandle!

    let delim: NSData!
    let encoding: NSStringEncoding

    var buffer = NSData()
    var buffSize: Int

    var buffIndex = 0
    var buffEndIndex = 0

    init?(path: String,
      buffSize: Int = 4096,
      delim: String = "\n",
      encoding: NSStringEncoding = NSUTF8StringEncoding)
    {
      self.handle = NSFileHandle(forReadingAtPath: path)
      self.path = path
      self.buffSize = buffSize < 1 ? 1 : buffSize
      self.encoding = encoding
      self.delim = delim.dataUsingEncoding(encoding)
      if handle == nil || self.delim == nil {
        print("ERROR initializing LineStream") /* TODO use STDERR */
        return nil
      }
    }

  // PRIVATE
  // fillBuffer(): _ -> Int [0...buffSize]
  // ============= -------- ..............
  // Fill the buffer with new data; return with the buffer size, or zero
  // upon reaching end-of-file
  // *********************************************************************
  private func fillBuffer() -> Int {
    buffer = handle.readDataOfLength(buffSize)
    buffIndex = 0
    buffEndIndex = buffer.length

    return buffEndIndex
  }

  // PRIVATE
  // delimLocation(): _ -> Int? nil | [1...buffSize]
  // ================ --------- ....................
  // Search the remaining buffer for a delimiter; return with the location
  // of a delimiter in the buffer, or nil if one is not found.
  // ***********************************************************************
  private func delimLocation() -> Int? {
    let searchRange = NSMakeRange(buffIndex, buffEndIndex - buffIndex)
    let rangeToDelim = buffer.rangeOfData(delim,
                                          options: [], range: searchRange)
    return rangeToDelim.location == NSNotFound
        ? nil
        : rangeToDelim.location
  }

  // PRIVATE
  // dataStrValue(): NSData -> String ("" | String)
  // =============== ---------------- .............
  // Attempt to convert data into a String value using the supplied encoding; 
  // return the String value or empty string if the conversion fails.
  // ***********************************************************************
    private func dataStrValue(data: NSData) -> String? {
      if let strVal = NSString(data: data, encoding: encoding) as? String {
          return strVal
      } else { return "" }
}

  // PUBLIC
  // readLine(): _ -> String? nil | String
  // =========== ____________ ............
  // Read the next line of the file, i.e., up to the next delimiter or end-of-
  // file, whichever occurs first; return the String value of the data found, 
  // or nil upon reaching end-of-file.
  // *************************************************************************
  func readLine() -> String? {
    guard let line = NSMutableData(capacity: buffSize) else {
        print("ERROR setting line")
        exit(EXIT_FAILURE)
    }

    // Loop until a delimiter is found, or end-of-file is reached
    var delimFound = false
    while !delimFound {
        // buffIndex will equal buffEndIndex in three situations, resulting
        // in a (re)filling of the buffer:
        //   1. Upon the initial call;
        //   2. If a search for a delimiter has failed
        //   3. If a delimiter is found at the end of the buffer
        if buffIndex == buffEndIndex {
            if fillBuffer() == 0 {
                return nil
            }
        }

        var lengthToDelim: Int
        let startIndex = buffIndex

        // Find a length of data to place into the line buffer to be
        // returned; reset buffIndex
        if let delim = delimLocation() {
            // SOME VALUE when a delimiter is found; append that amount of
            // data onto the line buffer,and then return the line buffer
            delimFound = true
            lengthToDelim = delim - buffIndex
            buffIndex = delim + 1   // will trigger a refill if at the end
                                    // of the buffer on the next call, but
                                    // first the line will be returned
        } else {
            // NIL if no delimiter left in the buffer; append the rest of
            // the buffer onto the line buffer, refill the buffer, and
            // continue looking
            lengthToDelim = buffEndIndex - buffIndex
            buffIndex = buffEndIndex    // will trigger a refill of buffer
                                        // on the next loop
        }

        line.appendData(buffer.subdataWithRange(
            NSMakeRange(startIndex, lengthToDelim)))
    }

    return dataStrValue(line)
  }
}

It is called as follows:

guard let myStream = LineStream(path: "/path/to/file.txt")
else { exit(EXIT_FAILURE) }

while let s = myStream.readLine() {
  print(s)
}
Quartana answered 27/7, 2016 at 20:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.