Fast and memory efficient ASCII string class for .NET
Asked Answered
P

2

11

This might have been asked before, but I can't find any such posts. Is there a class to work with ASCII Strings? The benefits are numerous:

  1. Comparison should be faster since its just byte-for-byte (instead of UTF-8 with variable encoding)
  2. Memory efficient, should use about half the memory in large strings
  3. Faster versions of ToUpper()/ToLower() which use a Look-Up-Table that is language invariant

Jon Skeet wrote a basic AsciiString implementation and proved #2, but I'm wondering if anyone took this further and completed such a class. I'm sure there would be uses, although no one would typically take such a route since all the existing String functions would have to be re-implemented by hand. And conversions between String <> AsciiString would be scattered everywhere complicating an otherwise simple program.

Is there such a class? Where?

Ptolemaist answered 1/6, 2013 at 7:8 Comment(6)
Just a few comments: .NET string use UTF-16 internally and you can speed up comparison a lot by using a String.Compare overload which takes a StringComparison parameter and set it to Ordinal.Roadwork
Ordinal comparsion uses the integer values of the UTF-16 encoding directly. It doesn't take into account the current culture or whether the same character symbol can be described by more than one unicode code point. Comparing these normally returns true, but false when using ordinal.Roadwork
Yes, some other string functions take a StringComparison parameter too, including String.IndexOf.Roadwork
String comparison is already heavily optimized in .NET, the actual code lives inside the CLR and was written in C++. That was pretty important, necessary to avoid inspiring programmers to look for a more efficient string implementation that's fundamentally broken because it can't support but a few languages in use throughout the world. Anybody that maintains old C or C++ code knows what a giant mistake that was.Atwell
Have you considered compressing your strings? Equality comparison would be easy, and memory would be greatly improved (for most strings in practice), but ToUpper/ToLower/string conversions would be more taxing. Depending on what you're really trying to do, that might be all you need.Raul
Possibly relevant: https://mcmap.net/q/1159246/-how-to-implement-string-with-1-byte-char-and-save-memory/56778Obnubilate
C
6

I thought I would post the outcome of my efforts to implement a system as described with as much string support and compatibility as I could. It's possibly not perfect but it should give you a decent base to improve on if needed.

The ASCIIChar struct and ASCIIString string implicitly convert to their native counterparts for ease of use.

The OP's suggestion for replacements of ToUpper/Lower etc have been implemented in a much quicker way than a lookup list and all the operations are as quick and memory friendly as I could make them.

Sorry couldn't post source, it was too long. See links below.

  • ASCIIChar - Replaces char, stores the value in a byte instead of int and provides support methods and compatibility for the string class. Implements virtual all methods and properties available for char.

  • ASCIIChars - Provides static properties for each of the valid ASCII characters for ease of use.

  • ASCIIString - Replaces string, stores characters in a byte array and implements virtually all methods and properties available for string.

Crossbar answered 2/6, 2013 at 13:58 Comment(4)
Simply superb! A complete ASCIIString class with accelerated implementation of methods identical to the String class API! Fantastic work ....Ptolemaist
@PeterLaCombJr. Yes both the char structure and the string class are immutable (the only local variable is readonly).Crossbar
I corrected a last minute error in the string class where the Parse method wasn't static.Crossbar
@Crossbar - Have you posted this code in Github or created a Nuget package for it. I would like to use it, and I could copy from the pastebin, but I was wondering if it had been put in a place where the community could contribute/edit and update.Rotberg
A
-2

Dotnet has no ASCII string support directly. Strings are UTF16 because Windows API works with ASCII (onr char - one byte) or UTF16 only. Utf8 will be the best solution (java uses it), but .NET does not support it because Windows doesn't.


Windows API can convert between charsets, but windows api only works with 1 byte chars or 2 byte chars, so if you use UTF8 strings in .NET you must convert them everytime which has impact in performace. Dotnet can use UTF8 and other encondings via BinaryWriter/BinaryReader or a simple StreamWriter/StreamReader.

Aboral answered 1/6, 2013 at 20:54 Comment(1)
This does not answer the question that was asked. It should be a comment. You have not got enough rep to comment, but that's just tough. Get some rep and then you can comment. And you can get rep by answering questions with real answers.Gasholder

© 2022 - 2025 — McMap. All rights reserved.