Phonotactic probability

The phonotactic probability is likelihood of observing a sequence in a given language. It's typically calculated as either the co-occurrence probability of a series of phones or diphones, or the cumulative transitional probability of moving from one portion of the sequence to the next.

This package currently provides the co-occurrence method of calculating the phonotactic probability, and this can be done taking the position of a phone or diphone into account, or just looking at the co-occurrence probability.

Examples

using LexicalCharacteristics
sample_corpus = [
["K", "AE1", "T"], # cat
["K", "AA1", "B"], # cob
["B", "AE1", "T"], # bat
["T", "AE1", "T", "S"], # tats
["M", "AA1", "R", "K"], # mark
["K", "AE1", "B"], # cab
]
freq = [1,1,1,1,1,1]
p = prod([4,4,4] / 20)
phnprb(sample_corpus, freq, [["K", "AE1", "T"]])

1 rows × 2 columns

	Query	Probability
	Array…	Any
1	["K", "AE1", "T"]	0.008

In this example, each phone has 4 observations in the corpus, and the likelihood of observing each of those phones is 4/20. Because there are 3, the phonotactic probability of this sequence is ${\frac{4}{20}}^3$, which is 0.008. Floating point errors sometimes occur in the arithmetic in programming, but this is unavoidable.

using LexicalCharacteristics
sample_corpus = [
["K", "AE1", "T"], # cat
["K", "AA1", "B"], # cob
["B", "AE1", "T"], # bat
["T", "AE1", "T", "S"], # tats
["M", "AA1", "R", "K"], # mark
["K", "AE1", "B"], # cab
]
freq = [1,1,1,1,1,1]
p = prod([3,2,3,2]/26)
phnprb(sample_corpus, freq, [["K", "AE1", "T"]]; nchar=2)

1 rows × 2 columns

	Query	Probability
	Array…	Any
1	["K", "AE1", "T"]	7.87788e-5

In this example here, the input is padded so that the beginning and ending of the word are taken into account when calculating the phonotactic probability. There are 3 counts of [. K] (where [.] is the word boundary symbol), 2 counts of [K AE1], 3 counts of [AE1 T], and 2 counts of [T .]. There are 26 total diphones observed in the corpus, so the phonotactic probability is calculated as

\[\frac{3}{26} \times \frac{2}{26} \times \frac{3}{26} \times \frac{2}{26} \,.\]

Function documentation

LexicalCharacteristics.phnprb — Method

phnprb(corpus::Array, frequencies::Array, queries::Array; positional=false,
    nchar=1, pad=true)

Calculates the phonotactic probability for each item in a list of queries based on a corpus

Arguments

corpus The corpus on which to base the probability calculations
frequencies The frequencies associated with each element in corpus
queries The items for which the probability should be calculated

Keyword arguments

positional Whether to consider where in the query a given phone appears

(e.g., should "K" as the first sound be considered a different category than "K" as the second sound?)

nchar The number of characters for each n-gram that will be examined (e.g., 2 for diphones)
pad Whether to add padding to each query or not

Returns

A DataFrame with the queries in the first column and the probability values in the second

source