Generating a pseudo-natural phrase from a big integer in a reversible way
Asked Answered
E

4

6

I have a large and "unique" integer (actually a SHA1 hash).

Note: While I'm talking here about SHA1 hashes, this is not a cryptography / security question! I'm not trying to break SHA1. Imagine a random 160-bit integer instead of SHA1 if that will help.

I want (for no other reason than to have fun) to find an algorithm to map that SHA1 hash to a computer-generated (pseudo-)English phrase. The mapping should be bidirectional (i.e., knowing the algorithm, one must be able to calculate the original SHA1 hash from that phrase.)

The phrase need not make sense. I would even settle for a whole paragraph of nonsense. (Though quality — englishness — of a paragraph should probably be better than for a mere phrase.)

A better algorithm would produce shorter, more natural-looking, more unique phrases.

A variation: it is OK if I will be able to work only with a part of hash. Say, first six hex digits is fine.

The possible usage of the generated phrase: the human readable version of Git commit ID, to use as a motto for a given program version, which is built from that commit. (As I said, this is "for fun". I don't claim that this is very practical — or be much more readable than the SHA1 itself.)

Possible approach: In the past I've attempted to build a probability table (of words), and generate phrases as Markov chains, seeding the generator (picking branches from probability tree), according to the bits I read from the SHA. This was not very successful, the resulting phrases were too long and ugly. I'm not sure if this was a bug, or the general flaw in the algorithm, since I had to abandon it early enough.

Now I'm thinking about attempting to solve the problem once again. Any advice on how to approach this? Do you think Markov chain approach can work here? Something else?

Enrage answered 13/1, 2011 at 18:10 Comment(3)
I don't really know anything about cryptography. So i just want to make sure i understand the question. You basically want to encode a large integer into a unique sentence, so that it sounds as natural as possible ?Hixon
@yurib: yes, that is basically it.Enrage
@yurib: except that I also want to be able to convert this sentence back to that integer later.Enrage
I
3

A very simple approach would be: Take list of say 1024 nouns, 1024 verbs and 1024 adjectives each. Your phrase could then be sentence of the form

noun[bits_01-10] verb[bits11-20] adjective[bits21-30] verb[bits31-40],
noun[bits_41-50] verb[bits51-60] adjective[bits61-70] verb[bits71-80],
noun[bits_81-90] verb[bits91-100] adjective[bits101-110] verb[bits111-120] and 
noun[bits_121-130] verb[bits131-140] adjective[bits141-150] verb[bits151-160].

With a bit more linguistic thought you can probably construct slightly more complicated ad thus not so repetitive looking sentences (say, a bit for singular / plural, a bit of two for different tenses,...). Longer word lists use up a few more bits but my guess is that you reach rather exotic words quite fast.

Insolent answered 13/1, 2011 at 20:23 Comment(4)
Clever! Well, one more lesson in KISS for me. :-)Enrage
Also: I think that "rather exotic words" could be a half of the fun. (Think "Maverick Meerkat" for example.)Enrage
Does someone know where to get good word corpus, split by nouns verbs and adjectives?Enrage
Looks like that Part of Speech Database here would do: wordlist.sourceforge.netEnrage
E
1

We'll, lets see... The english language has about 1,000,000 words. That's about 20 bits per word. SHA1 is 160 bits, so you'll need 8 words. Theoretically, All you'll have to do is to take the n'th word of the oxford english dictionary, where n is a group of 20 bits at a time.

Now, to make it more natural, you can try to add "in/at/on/and/the..." between words, according to their type (nouns,verbs...) using some simple algorithm. (You should remove all these words from your base dictionary, of course).

The algorithm is reversible: Just remove all the words you've added, and convert each word to it's 20-bit index.

Also, try google "insult generator". Some of those generators are pretty nice. I'm not sure about the number of combinations, though.

You can buy the Oxford English Dictionary on CD-ROM with more than 500,000 words (19-bit). I'm not sure if it would be easy to extract the words and their types, however. I'm not sure if it is legal, but I think you can't claim a patent on dictionary entries...

Elysium answered 13/1, 2011 at 18:49 Comment(6)
-1: what is this supposed to mean? it is HASH algorithm, it depends an all data, and you cannot predict collisions, is this super-naive or what?! EDIT: -1 removed, question is ambiguous, conversion of hash into words might be understood un-cryptograhic wayUnnumbered
@peenut: please read my comment to your answer. I'm not trying to break SHA.Enrage
@peenut: It's just 160 bits. I'm just suggesting a 1 to 1 mapping between any 160 bit stream and something readable in english.Elysium
@LK: Will it be reversible? (Looks like it would.) OK, nice idea. Any other ideas, how to make the phrase even more natural?Enrage
@LK: Also: do you know where to get a corpus of 1M English words?Enrage
@LK: I think that insult generators use Markovian models.Enrage
D
1

This is an old question but entropoetry is a JavaScript (Node/frontend) library that also solves this problem. It combines Markov poetry with Huffman coding, so given the same dictionary (i.e., the same version of the library), converting poetry↔︎numbers will be bidirectional.

Example, from the Node command line:

> var Poet = require('entropoetry'); var p = new Poet();
> p.stringify(Buffer.from('deadbeef', 'hex'))
'old trick of loving you\nif you but'
> console.log(p.parse(`old trick of loving you
... if you but`))
<Buffer de ad be ef>

And as technology marches on, what seemed like a “fun only” idea in 2011 has some real uses in 2017: memorizing cryptocurrency private keys (brain wallet), Dat/IPFS links, etc.

Discount answered 31/12, 2017 at 1:47 Comment(0)
U
0

Hash function means it is not possible (within reasonable limits) to get a data from hash, unless it is broken (insecure).

Question should be about breaking SHA-1 hash algorithm - look at Google, it's not that broken. So no, you cannot create English phrase from SHA-1 hash code, if you can, please make a huge paper about that, lot of them are useless, this would be breakthrough :-)

Edit: if only part of hash is enough, I suggest just brute force (+ simple map of hash<->phrase, possibly in a file or db), breaking hash algorithm is very "strong soup" (difficult problem).

Edit2: guys be more specific when asking question, not my fault... I will not delete this so that it scares off any other crypto guys around :-)

Unnumbered answered 13/1, 2011 at 18:44 Comment(1)
Sorry, I do not ask about extracting information from SHA-1. I ask about generating information, using SHA-1 (a big integer) as a seed. This is not a security question.Enrage

© 2022 - 2024 — McMap. All rights reserved.