Phonotactic probability
The phonotactic probability is likelihood of observing a sequence in a given language. It's typically calculated as either the co-occurrence probability of a series of phones or diphones, or the cumulative transitional probability of moving from one portion of the sequence to the next.
This package currently provides the co-occurrence method of calculating the phonotactic probability, and this can be done taking the position of a phone or diphone into account, or just looking at the co-occurrence probability.
Examples
using LexicalCharacteristics
sample_corpus = [
["K", "AE1", "T"], # cat
["K", "AA1", "B"], # cob
["B", "AE1", "T"], # bat
["T", "AE1", "T", "S"], # tats
["M", "AA1", "R", "K"], # mark
["K", "AE1", "B"], # cab
]
freq = [1,1,1,1,1,1]
p = prod([4,4,4] / 20)
phnprb(sample_corpus, freq, [["K", "AE1", "T"]])
1 rows × 2 columns
Query | Probability | |
---|---|---|
Array… | Any | |
1 | ["K", "AE1", "T"] | 0.008 |
In this example, each phone has 4 observations in the corpus, and the likelihood of observing each of those phones is 4/20. Because there are 3, the phonotactic probability of this sequence is , which is 0.008. Floating point errors sometimes occur in the arithmetic in programming, but this is unavoidable.
using LexicalCharacteristics
sample_corpus = [
["K", "AE1", "T"], # cat
["K", "AA1", "B"], # cob
["B", "AE1", "T"], # bat
["T", "AE1", "T", "S"], # tats
["M", "AA1", "R", "K"], # mark
["K", "AE1", "B"], # cab
]
freq = [1,1,1,1,1,1]
p = prod([3,2,3,2]/26)
phnprb(sample_corpus, freq, [["K", "AE1", "T"]]; nchar=2)
1 rows × 2 columns
Query | Probability | |
---|---|---|
Array… | Any | |
1 | ["K", "AE1", "T"] | 7.87788e-5 |
In this example here, the input is padded so that the beginning and ending of the word are taken into account when calculating the phonotactic probability. There are 3 counts of [. K] (where [.] is the word boundary symbol), 2 counts of [K AE1], 3 counts of [AE1 T], and 2 counts of [T .]. There are 26 total diphones observed in the corpus, so the phonotactic probability is calculated as
Function documentation
LexicalCharacteristics.phnprb
— Methodphnprb(corpus::Array, frequencies::Array, queries::Array; positional=false,
nchar=1, pad=true)
Calculates the phonotactic probability for each item in a list of queries based on a corpus
Arguments
- corpus The corpus on which to base the probability calculations
- frequencies The frequencies associated with each element in
corpus
- queries The items for which the probability should be calculated
Keyword arguments
- positional Whether to consider where in the query a given phone appears
(e.g., should "K" as the first sound be considered a different category than "K" as the second sound?)
- nchar The number of characters for each n-gram that will be examined (e.g., 2 for diphones)
- pad Whether to add padding to each query or not
Returns
A DataFrame
with the queries in the first column and the probability values in the second