Creating Strings from UTF-8 in [UInt8]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Creating Strings from UTF-8 in [UInt8]

chris.ridd
Hi,

I have a sequence of bytes in an array that might be a valid UTF-8 sequence. I can of course dip down into Foundation to convert these into a String, however I'm looking for a way to do this in "pure" Swift.

Looking at the standard library docs just makes my head spin! I can easily get UTF8 out of a String, but going in the other direction seems extraordinarily complicated.

Has anyone figured out the magic syntax for doing something like this?

Thanks,

Chris


--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/fe4c5b23-b28b-4c3a-b5e5-6b662f1bc94a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Creating Strings from UTF-8 in [UInt8]

Jens Alfke

On Jun 16, 2015, at 11:54 AM, [hidden email] wrote:

I have a sequence of bytes in an array that might be a valid UTF-8 sequence. I can of course dip down into Foundation to convert these into a String, however I'm looking for a way to do this in "pure" Swift.

Honestly I don’t know if there is one. They may not have added this yet since it’s easily done using the bridging to NSString. If so, this is something they’ll need to fix for the open-source release since Foundation won’t be available.

—Jens

--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/595462CD-AC45-4900-9F86-A05585D03A74%40mooseyard.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Creating Strings from UTF-8 in [UInt8]

Brent Royal-Gordon-2
There's a struct called UTF8 that encodes and decodes UTF-8 sequences into UnicodeScalars. Using it is a little cumbersome, but very doable:

func decode<UnicodeCodec: UnicodeCodecType>(codeUnits: [UnicodeCodec.CodeUnit], var decoder: UnicodeCodec) -> String? {

    var string = ""

    

    var codeUnitGenerator = codeUnits.generate()

    while true {

        switch decoder.decode(&codeUnitGenerator) {

        case .Result (let scalar):

            string.append(scalar)

        case .EmptyInput:

            return string

        case .Error:

            return nil

        }

    }

}



On Tue, Jun 16, 2015 at 3:51 PM Jens Alfke <[hidden email]> wrote:

On Jun 16, 2015, at 11:54 AM, [hidden email] wrote:

I have a sequence of bytes in an array that might be a valid UTF-8 sequence. I can of course dip down into Foundation to convert these into a String, however I'm looking for a way to do this in "pure" Swift.

Honestly I don’t know if there is one. They may not have added this yet since it’s easily done using the bridging to NSString. If so, this is something they’ll need to fix for the open-source release since Foundation won’t be available.

—Jens

--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/595462CD-AC45-4900-9F86-A05585D03A74%40mooseyard.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/CAEeRk-bZs%2B_Fr8ZUoKC%3DmGGdqzN97Xm5r_cakCv4p%3DU28CMhdg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Creating Strings from UTF-8 in [UInt8]

chris.ridd
On Wednesday, June 17, 2015 at 4:58:49 AM UTC+1, Brent Royal-Gordon wrote:
There's a struct called UTF8 that encodes and decodes UTF-8 sequences into UnicodeScalars. Using it is a little cumbersome, but very doable:

func decode<UnicodeCodec: UnicodeCodecType>(codeUnits: [UnicodeCodec.CodeUnit], var decoder: UnicodeCodec) -> String? {

    var string = ""

    

    var codeUnitGenerator = codeUnits.generate()

    while true {

        switch decoder.decode(&codeUnitGenerator) {

        case .Result (let scalar):

            string.append(scalar)

        case .EmptyInput:

            return string

        case .Error:

            return nil

        }

    }

}



Brent, that's fantastically helpful - thank you very much!

Chris

--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/be0e4554-8412-415b-b33d-20571f04a04f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Creating Strings from UTF-8 in [UInt8]

Jens Alfke
In reply to this post by Brent Royal-Gordon-2

On Jun 16, 2015, at 8:58 PM, Brent Royal-Gordon <[hidden email]> wrote:

There's a struct called UTF8 that encodes and decodes UTF-8 sequences into UnicodeScalars. Using it is a little cumbersome, but very doable:

That’s way too cumbersome to be the way we’re expected to convert UTF-8 to a String. I assume that the current preferred way is to use the NSString-bridged method, and that they’ll add a pure-Swift method in time for the open source release.

—Jens

--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/27BA3E06-6DE8-4C1D-BC77-8251D4DCF855%40mooseyard.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Creating Strings from UTF-8 in [UInt8]

Jens Alfke
In reply to this post by chris.ridd
I just ran into this issue — I've got raw UTF-8, not nul-terminated, and I need to create a String from it. This is in some code that needs to run as fast as possible, and may eventually be cross-platform, so I want to avoid bridging through NSString. I'm skeptical of the approach below because it looks slow (appending characters one at a time) and complex.

It looks like if I could create a String.UTF8View from a [UInt8], then I could use that to initialize a String. UTF8View doesn't show any initialization methods of its own, but it inherits from protocols like Collection. My Swift-fu is still pretty weak, however, so I'm not sure how to use that to create an instance from an array.

—Jens

On Wednesday, June 17, 2015 at 5:56:08 AM UTC-7, Chris Ridd wrote:
On Wednesday, June 17, 2015 at 4:58:49 AM UTC+1, Brent Royal-Gordon wrote:
There's a struct called UTF8 that encodes and decodes UTF-8 sequences into UnicodeScalars. Using it is a little cumbersome, but very doable:

func decode<UnicodeCodec: UnicodeCodecType>(codeUnits: [UnicodeCodec.CodeUnit], var decoder: UnicodeCodec) -> String? {

    var string = ""

    

    var codeUnitGenerator = codeUnits.generate()

    while true {

        switch decoder.decode(&codeUnitGenerator) {

        case .Result (let scalar):

            string.append(scalar)

        case .EmptyInput:

            return string

        case .Error:

            return nil

        }

    }

}



Brent, that's fantastically helpful - thank you very much!

Chris

--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/65fa3439-0d73-44a2-b645-70bc36f59c90%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Creating Strings from UTF-8 in [UInt8]

Kevin Ballard
CollectionType doesn't have any initializers. There's no (public) way to construct a String.UTF8View, short of getting one from an existing String, and there's no way to mutate a String.UTF8View once you have it. In addition, a String.UTF8View is really just a transformation on top of the String's native buffer, which turns out to be UTF-16 (even for native Strings, which I think is awful but probably exists to get the expected O(1) behavior when bridged to NSString), so even if you could construct a String.UTF8View it would just end up doing the same decoding routine.
 
The only way to do better than starting with an empty string and appending to it is to either predict how many unicode scalars the string will actually have, or actually decode the UTF-8 once to count it and then again to create the string. If you want to try and predict how many unicode scalars there are in a UTF-8 sequence of length N, it's anywhere from N/4 to N (as the maximal encoding size for one scalar is 4 UTF-8 code units). If you're ok with over-estimating and expect your input to likely be ASCII, you can pick N. If you're not sure what your input is, N/2 is a reasonable approximation. If you really don't want to overestimate at all (e.g. memory usage is a concern), go with N/4, although once you append any scalars past this point, the string will end up overestimating anyway.
 
As for actually decoding, the quoted code is as good as you're going to get. If it's inlined (and if the UTF8 struct is implemented efficiently), then it should be plenty fast.
 
Also note that this returns nil on error, another common way to handle decoding errors is to replace the bad sequence with U+FFFD instead, so you could alter the function to use `string.append("\u{FFFD}")` upon .Error if you like that idea.
 
Incidentally, there's a global function transcode() that can convert between any two UnicodeCodecs, so you could say something like
 
var s = ""
transcode(UTF8.self, UTF32.self, inputSeq, { c in s.append(UnicodeScalar(c)) }, stopOnError: yesOrNo)
 
although that has to convert from a UInt32 into a UnicodeScalar, which probably does a bounds check, and so it might actually be slower than the explicit decode() function.
 
-Kevin Ballard
 
On Thu, Nov 5, 2015, at 11:34 AM, Jens Alfke wrote:
I just ran into this issue — I've got raw UTF-8, not nul-terminated, and I need to create a String from it. This is in some code that needs to run as fast as possible, and may eventually be cross-platform, so I want to avoid bridging through NSString. I'm skeptical of the approach below because it looks slow (appending characters one at a time) and complex.
 
It looks like if I could create a String.UTF8View from a [UInt8], then I could use that to initialize a String. UTF8View doesn't show any initialization methods of its own, but it inherits from protocols like Collection. My Swift-fu is still pretty weak, however, so I'm not sure how to use that to create an instance from an array.
 
—Jens
 
On Wednesday, June 17, 2015 at 5:56:08 AM UTC-7, Chris Ridd wrote:
On Wednesday, June 17, 2015 at 4:58:49 AM UTC+1, Brent Royal-Gordon wrote:
There's a struct called UTF8 that encodes and decodes UTF-8 sequences into UnicodeScalars. Using it is a little cumbersome, but very doable:
 

func decode<UnicodeCodec: UnicodeCodecType>(codeUnits: [UnicodeCodec.CodeUnit], var decoder: UnicodeCodec) -> String? {

var string = ""


var codeUnitGenerator = codeUnits.generate()

whiletrue {

switch decoder.decode(&codeUnitGenerator) {

case .Result (let scalar):

            string.append(scalar)

case .EmptyInput:

return string

case .Error:

return nil

        }

    }

}

 
 
Brent, that's fantastically helpful - thank you very much!
 
Chris


--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
 

--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/1446763688.1076503.430494913.38446A9D%40webmail.messagingengine.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Creating Strings from UTF-8 in [UInt8]

Jens Alfke
Thanks for the info, Kevin. It still seems quite wrong to me that there’s no simple way to do this, given how fundamental it is to read strings from external data. For instance, the Obj-C project I work on, which does a lot of database access and networking, has 39 calls to NSString’s -initWithBytes:length:encoding: and -initWithData:encoding: methods in its source code (all of which specify UTF-8.)

As I said before, I’m guessing that’s because people can use the NSString methods as a crutch. When the cross-platform release of Swift comes out, there had better be a clean way to do this in pure Swift.

(There does exist String.fromCString, but it requires that the UTF-8 data be nul-terminated, which mine isn’t.)

—Jens

--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/A64F7FA4-F26D-49AF-BAEA-F1FBE53FD37C%40mooseyard.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Creating Strings from UTF-8 in [UInt8]

Kevin Ballard
There's a _lot_ of areas where the Swift stdlib doesn't really provide
what you need and assumes you'll have Foundation to back you up. For
example, there's tons of APIs on String itself that are only present
once you import Foundation, including `init?<S: SequenceType where
S.Generator.Element == UInt8>(bytes: S, encoding: NSStringEncoding)`
that would be the simplest way to do the decoding you want except that
it's not cross-platform.

-Kevin

On Thu, Nov 5, 2015, at 03:17 PM, Jens Alfke wrote:

> Thanks for the info, Kevin. It still seems quite wrong to me that there’s
> no simple way to do this, given how fundamental it is to read strings
> from external data. For instance, the Obj-C project I work on, which does
> a lot of database access and networking, has 39 calls to NSString’s
> -initWithBytes:length:encoding: and -initWithData:encoding: methods in
> its source code (all of which specify UTF-8.)
>
> As I said before, I’m guessing that’s because people can use the NSString
> methods as a crutch. When the cross-platform release of Swift comes out,
> there had better be a clean way to do this in pure Swift.
>
> (There does exist String.fromCString, but it requires that the UTF-8 data
> be nul-terminated, which mine isn’t.)
>
> —Jens
>
> --
> You received this message because you are subscribed to the Google Groups
> "Swift Language" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [hidden email].
> To post to this group, send email to [hidden email].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/swift-language/A64F7FA4-F26D-49AF-BAEA-F1FBE53FD37C%40mooseyard.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/1446765657.1083093.430523921.198240AC%40webmail.messagingengine.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Creating Strings from UTF-8 in [UInt8]

chris.ridd

> On 5 Nov 2015, at 23:20, Kevin Ballard <[hidden email]> wrote:
>
> There's a _lot_ of areas where the Swift stdlib doesn't really provide
> what you need and assumes you'll have Foundation to back you up. For
> example, there's tons of APIs on String itself that are only present
> once you import Foundation, including `init?<S: SequenceType where
> S.Generator.Element == UInt8>(bytes: S, encoding: NSStringEncoding)`
> that would be the simplest way to do the decoding you want except that
> it's not cross-platform.

Is it worth raising radars for each dependency on Foundation for “basic” functionality that should be in the stdlib?

Or should we just wait and submit bugs/patches for the open source stdlib?

Chris

--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/9A639031-AC43-4FE3-8088-A7DB601E99E1%40gmail.com.
For more options, visit https://groups.google.com/d/optout.