M

5

86

For layouting we have our famous "Lorem ipsum" text to test how it looks like.

What I am looking for is a set of files containing Text encoded with several different encodings that I can use in my JUnit tests to test some methods that are dealing with character encoding when reading text files.

Example:

Having a ISO 8859-1 encoded test-file and a Windows-1252 encoded test-file. The Windows-1252 have to trigger the differences in region 80₁₆ – 9F₁₆. In other words it must contain at least one character of this region to distinguish it from ISO 8859-1.

Maybe the best set of test-files is that where the test-file for each encoding contains all its characters once. But maybe I am not aware of sth - we all like this encoding stuff, right? :-)

Is there such a set of test-files for character-encoding issues out there?

Merocrine answered 8/2, 2012 at 9:8 Comment(3)

+1: I've just spent quite a bit of time implementing a UTF-8 decoder. Handling all the corner cases requires more unit tests than you might think. – Need 10/2, 2012 at 13:24

"Text encoded with several different encodings": for good coverage you also want sample byte sequences that contain invalid bytes. According to the UTF-8 Wikipedia page, mishandling those cases has introduced security vulnerabilities in some high profile products. – Need 10/2, 2012 at 13:27

@Need Of course, that's a good point. I was not aware of this. In my opinion just one more reason for a mature test-suite for encoding issues. It does not have to be a set of files. It can also be a library providing test data that can be used in JUnit tests. For example it can provide critical/invalid byte sequences for common charsets and reference Strings for comparison after decoding sample byte sequences. Just some thoughts and I wonder how this encoding stuff got tested in all the libs around ... – Merocrine 10/2, 2012 at 14:0

M

28

How about trying to use the ICU test suite files? I don't know if they are what you need for your test, but they seem to have pretty complete from/to UTF mapping files at least: Link to the repo for ICU test files

Marinamarinade answered 16/2, 2012 at 12:41 Comment(2)

+1 my favorite so far. I read in the documentation for 1 hour and it seem to provide everything I need - at least for unicode related stuff. – Merocrine 16/2, 2012 at 15:26

I think this is really the best answer so far. I accepted it and hope you'll get some reputation for it. If answered a week earlier I am sure it would have scored much better in comparison to other answers here. Anyway thanks! – Merocrine 17/2, 2012 at 0:6

R

42

The Wikipedia article on diacritics is pretty comprehensive, unfortunately you have to extract these characters manually. Also there might exist some mnemonics for each language. For instance in Polish we use:

Zażółć gęślą jaźń

which contains all 9 Polish diacritics in one correct sentence. Another useful search hint are pangrams: sentences using every letter of the alphabet at least once:

in Spanish, "El veloz murciélago hindú comía feliz cardillo y kiwi. La cigüeña tocaba el saxofón detrás del palenque de paja." (all 27 letters and diacritics).

in Russian, "Съешь же ещё этих мягких французских булок, да выпей чаю" (all 33 Russian Cyrillic alphabet letters).

List of pangrams contains an exhaustive summary. Anyone care to wrap this in a simple:

public interface NationalCharacters {
  String spanish();
  String russian();
  //...
}

library?

Ralphralston answered 8/2, 2012 at 9:23 Comment(1)

For sure this is a +1 answer. I'll wait a bit in hope that there is really a well-thought set of test-files out there. Because there are encodings build on top of others etc. I think it would be very good having test files for each encoding triggering the differences. But maybe I am wrong and there are good reasons why they do not exist or so. – Merocrine 8/2, 2012 at 9:40

M

28

How about trying to use the ICU test suite files? I don't know if they are what you need for your test, but they seem to have pretty complete from/to UTF mapping files at least: Link to the repo for ICU test files

Marinamarinade answered 16/2, 2012 at 12:41 Comment(2)

+1 my favorite so far. I read in the documentation for 1 hour and it seem to provide everything I need - at least for unicode related stuff. – Merocrine 16/2, 2012 at 15:26

I think this is really the best answer so far. I accepted it and hope you'll get some reputation for it. If answered a week earlier I am sure it would have scored much better in comparison to other answers here. Anyway thanks! – Merocrine 17/2, 2012 at 0:6

M

8

I don't know of any complete text documents, but if you can start with a simple overview of all character sets there are some files available at the ftp.unicode.org server

Here's WINDOWS-1252 for example. The first column is the hexadecimal character value, and the second the unicode value.

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT

Mimeograph answered 10/2, 2012 at 22:40 Comment(1)

+1 Thanks for your effort. Very interesting resource of files. – Merocrine 17/2, 2012 at 0:8

I

3

There are a few ready-to-use comprehensive unicode setups straight-forward downloadable.

From w3c

Here, there's a nice test file by w3.org including: maths, linguistics, Greek, Georgian, Russian, Thai, Runes, Braille among many others in a single file:

https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html

Coming from w3.org should be nice to use, shouldn't it?

Cutting out the HTML part

If you want to get the "original txt file" without risk of your editor messing it up, 1) download, 2) tail+head it, 3) Check with a diff:

wget https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html
tail +8 UTF-8-demo.html | head -n -3 > UTF-8-demo.txt
diff UTF-8-demo.html UTF-8-demo.txt

This generates a UTF-8-demo.txt without human intervention and without risk of loosing data.

More from w3c

There are many more files one level up in the directory structure, still inside the dir utf-8-test:

https://www.w3.org/2001/06/utf-8-test/

From github

There's a very interesting file here too with ALL printable chars (including Chinese, Braille, Arab, etc.)

https://raw.githubusercontent.com/bits/UTF-8-Unicode-Test-Documents/master/UTF-8_sequence_separated/utf8_sequence_0-0x10ffff_assigned_printable.txt

Want also non printable characters?

There are also many more test files in the same repo:

https://github.com/bits/UTF-8-Unicode-Test-Documents

and also a generator if you don't trust the committed file and you want to generate it by yourself.

My personal choice

I have decided that for my projects I'll start with 2 files: The specific one I pointed out from w3c and the specific one I pointed out from the github repo by bits.

Idiographic answered 17/8, 2021 at 22:25 Comment(0)

G

1

Well, I had used an online tool to create my text char sets from Lorem Ipsum. I believe it can help you. I dont have one which has all the different charsets in a single page.

http://generator.lorem-ipsum.info/

Gesticulative answered 8/2, 2012 at 11:21 Comment(1)

Lorem ipsum consist of only Latin characters, as it is in Latin. This is not what is being asked here. BTW: repo1.maven.org/maven2/org/codeswarm/lipsum/1.0 – Ralphralston 8/2, 2012 at 11:36