Strings, Alphabets, and Languages

Today, we embark on a journey through the foundational concepts of strings, alphabets, and languages, essential pillars in the domains of computer science and formal language theory as applied to programming.

Strings
The Alphabet
The Language
Language Classes

Strings

A string is essentially a finite sequence of symbols picked from a finite set, known as an alphabet. The alphabet is essentially the pool of symbols that strings can draw from.

The length of a string x is measured by the count of symbols it encompasses, denoted by |x|.

In their simplest form, strings are finite sequences of symbols from a predefined set or alphabet.

This seemingly straightforward definition opens up a world of applications, theoretical explorations, and technological advancements.

Example

Take, for instance, the word x="string", which consists of the symbols "s", "t", "r", "i", "n", "g", derived from the 26-letter English alphabet. This string is composed of 6 symbols.

$$ |x| = |"string"|= 6 $$

Similarly, the arithmetic expression "1+2-3" is a string formed by the symbols "1", "2", "3", "+", and "-". This string comprises 5 symbols.

$$ |x| = |"1+2-3"|= 5 $$

A string of zero length $ x = "" $ is known as an empty string and is often denoted by the symbol ε.

$$ x = "" = \epsilon $$

The empty string can be considered the additive identity because it does not change the length of a string when concatenated.

For example, if the string is x="dog", concatenating it with the empty string does not alter its content.

$$ x + \epsilon = "dog" + "" = "dog" $$

String concatenation is the process of adding one string to the end of another. For instance, if one string is x="abc" and another string is y="def", their concatenation is "abcdef".

$$ x + y = "abc" + "def" = "abcdef" $$

Sometimes, concatenation is represented using multiplicative notation because it omits the symbol and is more concise.

$$ xy = "abcdef" $$

A string x is a prefix of another string w if something can be added to x (i.e., concatenating a string y) to obtain w. For example, if w="abcdef", x="abc" and y="def", then x="abc" is a prefix of w="abcdef" because xy="abcdef" = w.

$$ w = xy = "abcdef" $$

A string x is called a proper prefix of w if x is a prefix of w but x itself is not equal to w. Therefore, "abc" is a proper prefix of "abcdef", but "abcdef" is not a proper prefix of "abcdef".

The reverse of a string x is denoted with the superscript R, $ x^R $.

$$ x = "hello" $$

$$ x^R = "olleh" $$

A substring is a sequence of consecutive symbols or characters (string) contained within another string. For example, in the string "programming"

$$ x="programming" $$

some of its substrings are "pro", "gram" and "ming".

$$ w_1 = "pro" $$

$$ w_2 = "gram" $$

$$ w_3 = "ming" $$

In general, any sequence of consecutive characters found within a longer string can be considered a substring of that string.

The power of a string $ x^n $ represents the concatenation of the string $ x $ with itself $ n $ times. For example, if the string $ x $ is "abc" and $ n $ is 3, then $ x^n $ will be "abcabcabc".

$$ x= "abc" $$

$$ x^3 = "abcabcabc" $$

In formal terms:

$ x^0 $ is the empty string.
$ x^1 $ is the string $ x $ itself.
$ x^n $ for $ n > 1 $ is the concatenation of $ x $ with $ x^{n-1} $.

Here's a practical example. If $ x = "hello" $ and $ n = 3 $, then $ x^3 = "hellohellohello" $.

$$ x^0 = \epsilon $$

$$ x^1 = "hello" $$

$$ x^2 = xx = "hellohello" $$

$$ x^3 = xxx = "hellohellohello" $$

The Alphabet

An alphabet is a finite collection Σ of symbols that serve as the building blocks for strings.

To qualify as an alphabet, a finite set Σ must adhere to a crucial principle: two finite sequences of elements in Σ are identical if, and only if, the elements in both sequences match exactly in their order.

This underscores the importance of both the sequence and identity of elements in the crafting of strings.

For example, the sets Σ={0, 1} and Σ={00, 01} are considered valid alphabets because each symbol sequence can be distinctly identified based on the order and identity of its symbols. Conversely, the set Σ={1, 11} does not meet the criteria for an alphabet since the sequence "11" could be ambiguously interpreted as either a single symbol "11" or as two "1" symbols repeated.

The Language

A language is a set of strings chosen based on specific criteria or rules.

From the myriad of possible strings that can be created using an alphabet Σ, we define a language as a particular subset of these strings.

Formal language theory delves into these collections of strings and the rules that dictate their membership in a specific language.

A straightforward example of a formal language might use the alphabet $\Sigma = \{a, b, c\}$ to define a language consisting of all strings that start with "a" and end with "c", allowing any combination of alphabet letters in between, including none. This language could be represented as:

$$ L = \{a(bc)^*c\} $$

Where:

$ a $ and $ c $ are fixed symbols, marking the start and end of each string in the language, respectively.
$ (bc)^* $ signifies the "bc" sequence may repeat any number of times, including not at all.

Examples of strings within the language:

- "ac" (where the "bc" sequence is absent, which is permissible because the "*" denotes zero or more repetitions)
- "abcc" (with the "bc" sequence occurring once)
- "abcbcc" (where "bc" is repeated twice)
- "abcbcbcc" (with "bc" appearing three times)

These strings comply with the guideline of beginning with "a" and concluding with "c", with the option for the "bc" sequence to occur any number of times in between.

Examples of strings not included in the language:

These strings fall outside the defined language because they do not adhere to the rule of starting with "a" and ending with "c", with the "bc" sequence optionally repeating in between.

- "a"
- "abc"
- "bac"
- "cbc"
- "accb"

This illustrates how formal languages with precise structural rules for strings can be defined based on specific alphabets.

In the set of strings that characterize a language, the same lexicographic order used in dictionaries is commonly applied.

For instance, "apple" precedes "banana" because 'a' comes before 'b' in the alphabet.

In some cases, the shortlex order is used, which is a variation of the lexicographic order where shorter strings come before longer ones.

For example, in the modified shortlex order for the alphabet {0,1}, the sequence of strings would be: the empty string (ε), "0", "1", "00", "01", "10", "11", "000", and so on. Thus, the strings are ordered first by length and then lexicographically.

A language is termed prefix-free if no string in the set is a proper prefix of another string in the set.

This means that no two strings exist where one is the beginning (prefix) of the other.

For example, in the set {"100", "101", "11"}, no string is a proper prefix of another, making it prefix-free.

Conversely, the set {"10", "101", "11"} is not prefix-free because "10" is a proper prefix of "101".

Language Classes

A language class denotes a group of languages that share common characteristics or can be generated or recognized by a certain computational mechanism.

Grouping languages into classes aids in understanding the strengths and limitations of various computational models and the grammars that produce them.

Example. Building on the previous example, we can conceive a language class that encompasses variations of the language using the alphabet $\Sigma = \{a, b, c\}$. This class could be defined as a collection of languages $L_n$ where each language is comprised of strings that start with "a", end with "c", and include a specific sequence repeated $n$ times, with $n$ being any natural number, including zero.

Examples of languages within this class:

Here are a few conceivable languages:

$L_0$ could be the language consisting solely of the string "ac", without any repetition between "a" and "c". Thus, L₀ would only contain the string "ac".
$L_1$ might include strings that begin with "a", are followed by a "b" or "c", and conclude with "c". For example, L1 could encompass strings such as "abc", "acc".
$L_2$ could contain strings starting with "a", followed by two possible combinations of "b" and "c" in between, and ending with "c". For instance, L₂ could include strings like "abbc", "abcc", "acbc", "accc".

This class of languages illustrates how a variety of formal languages can be constructed from a shared foundational principle, with variations in a specific aspect.

In summary, the concept of a string, while simple in its essence, plays an integral role in computer science and the theory of information.

This fundamental concept gives rise to more intricate notions such as formal languages and language classes, facilitating an exploration of the structure, generation, and recognition of symbolic information.