How to create paragraphs from markov chain output?
Asked Answered
H

2

6

I would like to modify the script below so that it creates paragraphs out of a random number of the sentences generated by the script. In other words, concatenate a random number (like 1-5) of sentences before adding a newline.

The script works fine as-is, but the output is short sentences separated by a newline. I would like to gather up some sentences into paragraphs.

Any ideas on best practices? Thanks.

"""
    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
"""

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
sentencesep  = "\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE SENTENCE OUTPUT
maxsentences  = 20

w1 = stopword
w2 = stopword
sentencecount = 0
sentence = []

while sentencecount < maxsentences:
    newword = random.choice(table[(w1, w2)])
    if newword == stopword: sys.exit()
    if newword in stopsentence:
        print ("%s%s%s" % (" ".join(sentence), newword, sentencesep))
        sentence = []
        sentencecount += 1
    else:
        sentence.append(newword)
    w1, w2 = w2, newword

EDIT 01:

Okay, I have cobbled together a simple "paragraph wrapper," which works well to gather the sentences into paragraphs, but it messed with the output of the sentence generator - I'm getting excessive repetitiveness of the first words, for example, among other issues.

But the premise is sound; I just need to figure out why the functionality of the sentence loop was affected by the addition of the paragraph loop. Please advise if you can see the problem:

###
#    usage: $ python markov_sentences.py < input.txt > output.txt
#    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
###

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
paragraphsep  = "\n\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE PARAGRAPH OUTPUT
maxparagraphs = 10
paragraphs = 0 # reset the outer 'while' loop counter to zero

while paragraphs < maxparagraphs: # start outer loop, until maxparagraphs is reached
    w1 = stopword
    w2 = stopword
    stopsentence = (".", "!", "?",)
    sentence = []
    sentencecount = 0 # reset the inner 'while' loop counter to zero
    maxsentences = random.randrange(1,5) # random sentences per paragraph

    while sentencecount < maxsentences: # start inner loop, until maxsentences is reached
        newword = random.choice(table[(w1, w2)]) # random word from word table
        if newword == stopword: sys.exit()
        elif newword in stopsentence:
            print ("%s%s" % (" ".join(sentence), newword), end=" ")
            sentencecount += 1 # increment the sentence counter
        else:
            sentence.append(newword)
        w1, w2 = w2, newword
    print (paragraphsep) # newline space
    paragraphs = paragraphs + 1 # increment the paragraph counter


# EOF

EDIT 02:

Added sentence = [] as per answer below into elif statement. To wit;

        elif newword in stopsentence:
            print ("%s%s" % (" ".join(sentence), newword), end=" ")
            sentence = [] # I have to be here to make the new sentence start as an empty list!!!
            sentencecount += 1 # increment the sentence counter

EDIT 03:

This is the final iteration of this script. Thanks to grieve for the help in sorting this out. I hope others can have some fun with this, I know I will. ;)

FYI: There is one small artifact - there is an extra end-of-paragraph space that you might want to clean up if you use this script. But, other than that, a perfect implementation of markov chain text generation.

###
#    usage: python markov_sentences.py < input.txt > output.txt
#    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
###

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
sentencesep  = "\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE SENTENCE OUTPUT
maxsentences  = 20

w1 = stopword
w2 = stopword
sentencecount = 0
sentence = []
paragraphsep = "\n"
count = random.randrange(1,5)

while sentencecount < maxsentences:
    newword = random.choice(table[(w1, w2)]) # random word from word table
    if newword == stopword: sys.exit()
    if newword in stopsentence:
        print ("%s%s" % (" ".join(sentence), newword), end=" ")
        sentence = []
        sentencecount += 1 # increment the sentence counter
        count -= 1
        if count == 0:
            count = random.randrange(1,5)
            print (paragraphsep) # newline space
    else:
        sentence.append(newword)
    w1, w2 = w2, newword


# EOF
Hypercorrection answered 20/10, 2012 at 21:13 Comment(0)
C
3

You need to copy

sentence = [] 

Back into the

elif newword in stopsentence:

clause.

So

while paragraphs < maxparagraphs: # start outer loop, until maxparagraphs is reached
    w1 = stopword
    w2 = stopword
    stopsentence = (".", "!", "?",)
    sentence = []
    sentencecount = 0 # reset the inner 'while' loop counter to zero
    maxsentences = random.randrange(1,5) # random sentences per paragraph

    while sentencecount < maxsentences: # start inner loop, until maxsentences is reached
        newword = random.choice(table[(w1, w2)]) # random word from word table
        if newword == stopword: sys.exit()
        elif newword in stopsentence:
            print ("%s%s" % (" ".join(sentence), newword), end=" ")
            sentence = [] # I have to be here to make the new sentence start as an empty list!!!
            sentencecount += 1 # increment the sentence counter
        else:
            sentence.append(newword)
        w1, w2 = w2, newword
    print (paragraphsep) # newline space
    paragraphs = paragraphs + 1 # increment the paragraph counter

Edit

Here is a solution without using the outer loop.

"""
    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
"""

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
sentencesep  = "\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE SENTENCE OUTPUT
maxsentences  = 20

w1 = stopword
w2 = stopword
sentencecount = 0
sentence = []
paragraphsep == "\n\n"
count = random.randrange(1,5)

while sentencecount < maxsentences:
    newword = random.choice(table[(w1, w2)])
    if newword == stopword: sys.exit()
    if newword in stopsentence:
        print ("%s%s" % (" ".join(sentence), newword), end=" ")
        sentence = []
        sentencecount += 1
        count -= 1
        if count == 0:
            count = random.randrange(1,5)
            print (paragraphsep)
    else:
        sentence.append(newword)
    w1, w2 = w2, newword
Caution answered 23/10, 2012 at 20:36 Comment(4)
Oops! Yeah, I must have yanked that out at some point and forgotten to put it back in. Thanks for the insight! That did the trick - almost. It seems as if the sentence loop re-uses the same starting words for each sentence. Any ideas on how to mix up the first words it chooses for the sentence generation?Hypercorrection
I added a separate solution that doesn't need the outer loop.Caution
I don't have python 3 installed at the moment, so you may need to tweak the second solution for syntax.Caution
Sweet. Thanks grieve! That works perfectly. There was a small edit needed, but nothing major. Please refer to the original post for the final code. I can't thank you enough - I was pulling my hair out. Very nice job.Hypercorrection
S
1

Do you understand this code? I bet you can find the bit that's printing the sentence, and change it to print several sentences together, without returns. You could add another while loop around the sentences bit to get multiple paragraphs.

Syntax hint:

print 'hello'
print 'there'
hello
there

print 'hello',
print 'there'
hello there

print 'hello',
print 
print 'there'

The point is that a comma at the end of a print statement prevents the return at the end of the line, and a blank print statement prints a return.

Soap answered 21/10, 2012 at 3:2 Comment(1)
Yep, I follow. The trouble is, everything I tried with the print statement didn't help to gather the sentences into paragraphs (unless you count taking out all line breaks, making one giant paragraph). The while loop is what I was thinking of, but I wasn't quite sure how to wrap the sentence part. Everything I tried led to various errors, so I figured I would ask the experts. What's the best way to tell it "generate x (1-5 for example) amount of sentences, then insert a line break, and then repeat until maxsentences is reached"?Hypercorrection

© 2022 - 2024 — McMap. All rights reserved.