How can I create a String from UTF8 in Swift?
Asked Answered
A

10

25

We know we can print each character in UTF8 code units? Then, if we have code units of these characters, how can we create a String with them?

Atelectasis answered 28/6, 2014 at 9:34 Comment(0)
N
20

With Swift 5, you can choose one of the following ways in order to convert a collection of UTF-8 code units into a string.


#1. Using String's init(_:) initializer

If you have a String.UTF8View instance (i.e. a collection of UTF-8 code units) and want to convert it to a string, you can use init(_:) initializer. init(_:) has the following declaration:

init(_ utf8: String.UTF8View)

Creates a string corresponding to the given sequence of UTF-8 code units.

The Playground sample code below shows how to use init(_:):

let string = "Café 🇫🇷"
let utf8View: String.UTF8View = string.utf8

let newString = String(utf8View)
print(newString) // prints: Café 🇫🇷

#2. Using Swift's init(decoding:as:) initializer

init(decoding:as:) creates a string from the given Unicode code units collection in the specified encoding:

let string = "Café 🇫🇷"
let codeUnits: [Unicode.UTF8.CodeUnit] = Array(string.utf8)

let newString = String(decoding: codeUnits, as: UTF8.self)
print(newString) // prints: Café 🇫🇷

Note that init(decoding:as:) also works with String.UTF8View parameter:

let string = "Café 🇫🇷"
let utf8View: String.UTF8View = string.utf8

let newString = String(decoding: utf8View, as: UTF8.self)
print(newString) // prints: Café 🇫🇷

#3. Using transcode(_:from:to:stoppingOnError:into:) function

The following example transcodes the UTF-8 representation of an initial string into Unicode scalar values (UTF-32 code units) that can be used to build a new string:

let string = "Café 🇫🇷"
let bytes = Array(string.utf8)

var newString = ""
_ = transcode(bytes.makeIterator(), from: UTF8.self, to: UTF32.self, stoppingOnError: true, into: {
    newString.append(String(Unicode.Scalar($0)!))
})
print(newString) // prints: Café 🇫🇷

#4. Using Array's withUnsafeBufferPointer(_:) method and String's init(cString:) initializer

init(cString:) has the following declaration:

init(cString: UnsafePointer<CChar>)

Creates a new string by copying the null-terminated UTF-8 data referenced by the given pointer.

The following example shows how to use init(cString:) with a pointer to the content of a CChar array (i.e. a well-formed UTF-8 code unit sequence) in order to create a string from it:

let bytes: [CChar] = [67, 97, 102, -61, -87, 32, -16, -97, -121, -85, -16, -97, -121, -73, 0]

let newString = bytes.withUnsafeBufferPointer({ (bufferPointer: UnsafeBufferPointer<CChar>)in
    return String(cString: bufferPointer.baseAddress!)
})
print(newString) // prints: Café 🇫🇷

#5. Using Unicode.UTF8's decode(_:) method

To decode a code unit sequence, call decode(_:) repeatedly until it returns UnicodeDecodingResult.emptyInput:

let string = "Café 🇫🇷"
let codeUnits = Array(string.utf8)

var codeUnitIterator = codeUnits.makeIterator()
var utf8Decoder = Unicode.UTF8()
var newString = ""

Decode: while true {
    switch utf8Decoder.decode(&codeUnitIterator) {
    case .scalarValue(let value):
        newString.append(Character(Unicode.Scalar(value)))
    case .emptyInput:
        break Decode
    case .error:
        print("Decoding error")
        break Decode
    }
}

print(newString) // prints: Café 🇫🇷

#6. Using String's init(bytes:encoding:) initializer

Foundation gives String a init(bytes:encoding:) initializer that you can use as indicated in the Playground sample code below:

import Foundation

let string = "Café 🇫🇷"
let bytes: [Unicode.UTF8.CodeUnit] = Array(string.utf8)

let newString = String(bytes: bytes, encoding: String.Encoding.utf8)
print(String(describing: newString)) // prints: Optional("Café 🇫🇷")
Nourish answered 21/2, 2019 at 14:25 Comment(1)
#2 above is the general simple safe and efficient answer.Tortricid
R
15

It's possible to convert UTF8 code points to a Swift String idiomatically using the UTF8 Swift class. Although it's much easier to convert from String to UTF8!

import Foundation

public class UTF8Encoding {
  public static func encode(bytes: Array<UInt8>) -> String {
    var encodedString = ""
    var decoder = UTF8()
    var generator = bytes.generate()
    var finished: Bool = false
    do {
      let decodingResult = decoder.decode(&generator)
      switch decodingResult {
      case .Result(let char):
        encodedString.append(char)
      case .EmptyInput:
        finished = true
      /* ignore errors and unexpected values */
      case .Error:
        finished = true
      default:
        finished = true
      }
    } while (!finished)
    return encodedString
  }

  public static func decode(str: String) -> Array<UInt8> {
    var decodedBytes = Array<UInt8>()
    for b in str.utf8 {
      decodedBytes.append(b)
    }
    return decodedBytes
  }
}

func testUTF8Encoding() {
  let testString = "A UTF8 String With Special Characters: 😀🍎"
  let decodedArray = UTF8Encoding.decode(testString)
  let encodedString = UTF8Encoding.encode(decodedArray)
  XCTAssert(encodedString == testString, "UTF8Encoding is lossless: \(encodedString) != \(testString)")
}

Of the other alternatives suggested:

  • Using NSString invokes the Objective-C bridge;

  • Using UnicodeScalar is error-prone because it converts UnicodeScalars directly to Characters, ignoring complex grapheme clusters; and

  • Using String.fromCString is potentially unsafe as it uses pointers.

Rivalry answered 3/6, 2015 at 19:40 Comment(3)
Thank you for decoding UTF8 encoding! You can remove import Foundation from the top, that's the whole reason I want to use this..Ful
Thanks! Very helpful. Here is a link to the Sandbox with this working with a couple updates and made decode a bit easier. swiftlang.ng.bluemix.net/#/repl/…Sunder
Your use of the words "encode" and "decode" are the opposite of how I think about the conversions between strings and UTF-8 data.Ship
S
5

improve on Martin R's answer

import AppKit

let utf8 : CChar[] = [65, 66, 67, 0]
let str = NSString(bytes: utf8, length: utf8.count, encoding: NSUTF8StringEncoding)
println(str) // Output: ABC

import AppKit

let utf8 : UInt8[] = [0xE2, 0x82, 0xAC, 0]
let str = NSString(bytes: utf8, length: utf8.count, encoding: NSUTF8StringEncoding)
println(str) // Output: €

What happened is Array can be automatic convert to CConstVoidPointer which can be used to create string with NSSString(bytes: CConstVoidPointer, length len: Int, encoding: Uint)

Sacristy answered 28/6, 2014 at 12:43 Comment(1)
Note that your code converts the 0 byte as well, to a NUL-character in the created NSString.Rodeo
T
4

Swift 3

let s = String(bytes: arr, encoding: .utf8)

Teshatesla answered 20/3, 2017 at 6:38 Comment(0)
B
2

I've been looking for a comprehensive answer regarding string manipulation in Swift myself. Relying on cast to and from NSString and other unsafe pointer magic just wasn't doing it for me. Here's a safe alternative:

First, we'll want to extend UInt8. This is the primitive type behind CodeUnit.

extension UInt8 {
    var character: Character {
        return Character(UnicodeScalar(self))
    }
}

This will allow us to do something like this:

let codeUnits: [UInt8] = [
    72, 69, 76, 76, 79
]

let characters = codeUnits.map { $0.character }
let string     = String(characters)

// string prints "HELLO"

Equipped with this extension, we can now being modifying strings.

let string = "ABCDEFGHIJKLMONP"

var modifiedCharacters = [Character]()
for (index, utf8unit) in string.utf8.enumerate() {

    // Insert a "-" every 4 characters
    if index > 0 && index % 4 == 0 {
        let separator: UInt8 = 45 // "-" in ASCII
        modifiedCharacters.append(separator.character)
    }
    modifiedCharacters.append(utf8unit.character)
}

let modifiedString = String(modifiedCharacters)

// modified string == "ABCD-EFGH-IJKL-MONP"
Bellyful answered 7/9, 2016 at 13:7 Comment(2)
Am I correct in assuming that this will only work with ASCII character strings? I.e., it will mess things up if there are Danish letters Æ Ø Å æ ø å in the string? Or accented letters? Not to mention other alphabets like Russian cyrillic and the Greek alphabet and Chinese and ...Ship
Yes, that assumption is correct. This solution will only work for single byte (ASCII) characters only and will quickly break on anything like emoji or international characters.Bellyful
C
2
// Swift4
var units = [UTF8.CodeUnit]()
//
// update units
//
let str = String(decoding: units, as: UTF8.self)
Cryolite answered 17/6, 2018 at 7:12 Comment(1)
While this code snippet may be the solution, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion.Violent
R
1

This is a possible solution (now updated for Swift 2):

let utf8 : [CChar] = [65, 66, 67, 0]
if let str = utf8.withUnsafeBufferPointer( { String.fromCString($0.baseAddress) }) {
    print(str) // Output: ABC
} else {
    print("Not a valid UTF-8 string") 
}

Within the closure, $0 is a UnsafeBufferPointer<CChar> pointing to the array's contiguous storage. From that a Swift String can be created.

Alternatively, if you prefer the input as unsigned bytes:

let utf8 : [UInt8] = [0xE2, 0x82, 0xAC, 0]
if let str = utf8.withUnsafeBufferPointer( { String.fromCString(UnsafePointer($0.baseAddress)) }) {
    print(str) // Output: €
} else {
    print("Not a valid UTF-8 string")
}
Rodeo answered 28/6, 2014 at 10:13 Comment(3)
C code written in Swift syntax... and more ugly than C (which may be a good thing so people want to avoid them)Sacristy
@BryanChen: I have just tried to present a Swift-only solution that does not use Foundation and Objective-C classes...Rodeo
I think the true Swift way must use Character and UTF8 somewhereSacristy
F
1

I would do something like this, it may be not such elegant than working with 'pointers' but it does the job well, those are pretty much about a bunch of new += operators for String like:

@infix func += (inout lhs: String, rhs: (unit1: UInt8)) {
    lhs += Character(UnicodeScalar(UInt32(rhs.unit1)))
}

@infix func += (inout lhs: String, rhs: (unit1: UInt8, unit2: UInt8)) {
    lhs += Character(UnicodeScalar(UInt32(rhs.unit1) << 8 | UInt32(rhs.unit2)))
}

@infix func += (inout lhs: String, rhs: (unit1: UInt8, unit2: UInt8, unit3: UInt8, unit4: UInt8)) {
    lhs += Character(UnicodeScalar(UInt32(rhs.unit1) << 24 | UInt32(rhs.unit2) << 16 | UInt32(rhs.unit3) << 8 | UInt32(rhs.unit4)))
}

NOTE: you can extend the list of the supported operators with overriding + operator as well, defining a list of the fully commutative operators for String.


and now you are able to append a String with a unicode (UTF-8, UTF-16 or UTF-32) character like e.g.:

var string: String = "signs of the Zodiac: "
string += (0x0, 0x0, 0x26, 0x4b)
string += (38)
string += (0x26, 76)
Ferraro answered 28/6, 2014 at 11:7 Comment(3)
Just a remark: Your code creates a String from UTF-32 input (if I understand it correctly) and mine from UTF-8 input. Reading the question again I am not 100% sure what is requested here. OP mentions both "UTF-8" and "Code point" ...Rodeo
@MartinR, you are right, to be fair, I'm not sure about the real question either, the reason is just the same as you just said...Ferraro
Note that the UTF-8 sequence for a Unicode code point has 1, 2, 3, or 4 bytes.Rodeo
R
1

If you're starting with a raw buffer, such as from the Data object returned from a file handle (in this case, taken from a Pipe object):

let data = pipe.fileHandleForReading.readDataToEndOfFile()
var unsafePointer = UnsafeMutablePointer<UInt8>.allocate(capacity: data.count)

data.copyBytes(to: unsafePointer, count: data.count)

let output = String(cString: unsafePointer)
Rascon answered 18/8, 2017 at 18:34 Comment(0)
H
0

There is Swift 3.0 version of Martin R answer

public class UTF8Encoding {
  public static func encode(bytes: Array<UInt8>) -> String {
    var encodedString = ""
    var decoder = UTF8()
    var generator = bytes.makeIterator()
    var finished: Bool = false
    repeat {
      let decodingResult = decoder.decode(&generator)
      switch decodingResult {
      case .scalarValue(let char):
        encodedString += "\(char)"
      case .emptyInput:
        finished = true
      case .error:
        finished = true
      }
    } while (!finished)
    return encodedString
  }
  public static func decode(str: String) -> Array<UInt8> {
    var decodedBytes = Array<UInt8>()
    for b in str.utf8 {
      decodedBytes.append(b)
    }
    return decodedBytes
  }
}

If you want show emoji from UTF-8 string, just user convertEmojiCodesToString method below. It is working properly for strings like "U+1F52B" (emoji) or "U+1F1E6 U+1F1F1" (country flag emoji)

class EmojiConverter {
  static func convertEmojiCodesToString(_ emojiCodesString: String) -> String {
    let emojies = emojiCodesString.components(separatedBy: " ")
    var resultString = ""
    for emoji in emojies {
      var formattedCode = emoji
      formattedCode.slice(from: 2, to: emoji.length)
      formattedCode = formattedCode.lowercased()
      if let charCode = UInt32(formattedCode, radix: 16),
        let unicode = UnicodeScalar(charCode) {
        let str = String(unicode)
        resultString += "\(str)"
      }
    }
    return resultString
  }
}
Hoarse answered 18/3, 2017 at 17:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.