What is a regular language?
Asked Answered
C

2

92

I'm trying to understand the concept of languages levels (regular, context free, context sensitive, etc.).

I can look this up easily, but all explanations I find are a load of symbols and talk about sets. I have two questions:

  1. Can you describe in words what a regular language is, and how the languages differ?

  2. Where do people learn to understand this stuff? As I understand it, it is formal mathematics? I had a couple of courses at uni which used it and barely anyone understood it as the tutors just assumed we knew it. Where can I learn it and why are people "expected" to know it in so many sources? It's like there's a gap in education.

Here's an example:

Any language belonging to this set is a regular language over the alphabet.

How can a language be "over" anything?

Clipclop answered 16/7, 2011 at 15:4 Comment(3)
Actually, Wikipedia does not too bad of a job explaining all this. You might want to start at en.wikipedia.org/wiki/Formal_language and then work your way through the topics. This would come up in a "theoretical computer science" course for example.Whitefish
I'd say questions like this are on-topic for SO. See Where to discuss computer science? on meta.Sophiasophie
Read: What is basically a regular language? And Why a*b* is regular? But language { a^nb^n | n > 0 } is not a regular languageFeverish
T
172

In the context of computer science, a word is the concatenation of symbols. The used symbols are called the alphabet. For example, some words formed out of the alphabet {0,1,2,3,4,5,6,7,8,9} would be 1, 2, 12, 543, 1000, and 002.

A language is then a subset of all possible words. For example, we might want to define a language that captures all elite MI6 agents. Those all start with double-0, so words in the language would be 007, 001, 005, and 0012, but not 07 or 15. For simplicity's sake, we say a language is "over an alphabet" instead of "a subset of words formed by concatenation of symbols in an alphabet".

In computer science, we now want to classify languages. We call a language regular if it can be decided if a word is in the language with an algorithm/a machine with constant (finite) memory by examining all symbols in the word one after another. The language consisting just of the word 42 is regular, as you can decide whether a word is in it without requiring arbitrary amounts of memory; you just check whether the first symbol is 4, whether the second is 2, and whether any more numbers follow.

All languages with a finite number of words are regular, because we can (in theory) just build a control flow tree of constant size (you can visualize it as a bunch of nested if-statements that examine one digit after the other). For example, we can test whether a word is in the "prime numbers between 10 and 99" language with the following construct, requiring no memory except the one to encode at which code line we're currently at:

if word[0] == 1:
  if word[1] == 1: # 11
      return true # "accept" word, i.e. it's in the language
  if word[1] == 3: # 13
      return true
...
return false

Note that all finite languages are regular, but not all regular languages are finite; our double-0 language contains an infinite number of words (007, 008, but also 004242 and 0012345), but can be tested with constant memory: To test whether a word belongs in it, check whether the first symbol is 0, and whether the second symbol is 0. If that's the case, accept it. If the word is shorter than three or does not start with 00, it's not an MI6 code name.

Formally, the construct of a finite-state machine or a regular grammar is used to prove that a language is regular. These are similar to the if-statements above, but allow for arbitrarily long words. If there's a finite-state machine, there is also a regular grammar, and vice versa, so it's sufficient to show either. For example, the finite state machine for our double-0 language is:

start state:  if input = 0 then goto state 2
start state:  if input = 1 then fail
start state:  if input = 2 then fail
...
state 2: if input = 0 then accept
state 2: if input != 0 then fail
accept: for any input, accept

The equivalent regular grammar is:

start → 0 B
B → 0 accept
accept → 0 accept
accept → 1 accept
...

The equivalent regular expression is:

00[0-9]*

Some languages are not regular. For example, the language of any number of 1, followed by the same number of 2 (often written as 1n2n, for an arbitrary n) is not regular - you need more than a constant amount of memory (= a constant number of states) to store the number of 1s to decide whether a word is in the language.

This should usually be explained in the theoretical computer science course. Luckily, Wikipedia explains both formal and regular languages quite nicely.

Thalassography answered 16/7, 2011 at 15:18 Comment(16)
Thank you, I understood all but the last bit on non-regular languages. Youy need memory to store the number of 1s yes, but in the MI6 example don't you need memory to remember that the first two characters are 0?Clipclop
@user666254 A constant amount of memory is always allowed, because it can be encoded into states. The problem is you need more than any constant amount of memory to test for 1^n2^n when n approaches infinity. Formally, you usually use the pumping lemma for regular langauges to show a language isn't regular.Thalassography
" our double-0 language is finite, two" - do you mean that it's infinite, too? (I see you used 'two' instead of 'too' elsewhere as well...)I
To be more specific,you need 3n+1 states to decide whether a string is in the 1n2n language.Know
Thanks a lot man I thought this was a great post and found it to be very helpful. It's good to get these definitions in writing.Dagnah
I personally don't think Wikipedia only does a good job of explaining regular languages if one has not studied the mathematical notation. Even when they explain some of the symbols and variables, they chuck a few extras in and do not explain them. The article might more readable to those who do not need it, but to the rest of us it needs clarification.Meill
well, I think maybe, the Wikipedia article must be flagged as too much technical. because many refer to it while what people like me need is the wonderful explanation as yours!Selfinduced
@phihag:That's a good intuitive answer.Could you explain a little bit more about 'constant amount of memory'.Is there any other language that provides infinite memory?I'm thinking that's there isn't probably any language that provides infinite memory.Nonillion
I'm always confused by why $1^n2^n$ is not regular. Sure it can be expressed with finite memory, if n is a finite number (i.e. not infinity). Is it that for an expression to be regular, it has to be able to be represented with no memory?Interdenominational
@Interdenominational 12, 1122, 111222 etc. are all regular. When we write 1^n2^n, we mean the language consisting of all of these. n is only determined when matching a specific string, not before.Thalassography
Is 00 a valid MI6 agent number? If not, you should probably change the regular expression 00[0-9]* to 00[0-9]+, 00[0-9][0-9]* or 00(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*, depending on how formal you want to be.Bakemeier
@Bakemeier I like to imagine that as a bureaucracy, MI6 just associates everything starting with 00 with the 00 program, so 00 is a valid MI6 agent number.Thalassography
@Thalassography If we use the pumping lemma to prove that a language is not regular, then does that mean that the pumping lemma defines a regular language?Gaul
@P.Soutzikevich No, in the same way a scale can be used to decide whether a vehicle counts as car or truck (>12 tonnes), but the scale does not define the car. There are a number of other ways to prove a language is not regular. You are correct in that the pumping lemma for regular languages is very closely related to the definition of a regular language. It just doesn't define a regular language.Thalassography
Given this definition, why is Python not considered a regular language?Warbeck
@ap In an expression, the number of closing parens must match the opening ones. ((( 1))) is valid Python code, but (1)) isn't. Therefore (and for a number of other reasons), the grammar of the Python programming language does not form a regular language.Thalassography
S
5

Here are some of the equivalent definitions from Wikipedia:

[...] a regular language is a formal language (i.e., a possibly infinite set of finite sequences of symbols from a finite alphabet) that satisfies the following equivalent properties:

  • it can be accepted by a deterministic finite state machine.

  • it can be accepted by a nondeterministic finite state machine

  • it can be described by a formal regular expression.

    Note that the "regular expression" features provided with many programming languages are augmented with features that make them capable of recognizing languages which are not regular, and are therefore not strictly equivalent to formal regular expressions.

The first thing to note is that a regular language is a formal language, with some restrictions. A formal language is essentially a (possibly infinite) collection of strings. For example, the formal language Java is the collection of all possible Java files, which is a subset of the collection of all possible text files.

One of the most important characteristics is that unlike the context-free languages, a regular language does not support arbitrary nesting/recursion, but you do have arbitrary repetition.

A language always has an underlying alphabet which is the set of allowed symbols. For example, the alphabet of a programming language would usually either be ASCII or Unicode, but in formal language theory it's also fine to talk about languages over other alphabets, for example the binary alphabet where the only allowed characters are 0 and 1.

In my university, we were taught some formal language theory in the Compilers class, but this is probably different between different schools.

Sophiasophie answered 16/7, 2011 at 15:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.