My Conlang #13, tentatively called {säb zjed'a}, is primarily interesting for the lexicon development methodology I'm using for it. I started by generating a set of phonologically redundant words using Perl scripts I had written for that purpose, assigned meanings to a few dozen of them to get a small starter lexicon, and started writing some sentences in the language. I had at first only a vague notion of its grammar (VSO, prepositional, ergative, mostly isolating, but with semantic category and part of speech marking). I used frequency analyses on my Toki Pona corpus and my gjâ-zym-byn corpus, and the list of "semantic primes" from the Wikipedia article on "Natural semantic metalanguage" as sources for ideas about the seed lexicon, along with Rick Harrison's Univeral Language Dictionary. Any concept I could readily figure out how to represent with a phrase rather than a root word (there are no compounds in this language, at least in phase 1), I did. I have a more radical use of opposite-derivation than Esperanto, for instance (except the "mal-" morpheme is a preposition rather than a prefix, forming phrases like "opposite-of hot" = "cold").
So I went on writing sample sentences, coining new words as needed when I couldn't figure out a way to express ideas with existing words. Many of them came from the Conlang Test Sentences.
My plan is to go on translating the Conlang Test Sentences, and write at least one example sentence for most or all of the words in the lexicon; then translate a couple of longer texts (the Tower of Babel story, for instance) — still coining new words only when I can't figure out how to make do with existing ones.
When I've got a corpus of a few thousand words (or perhaps sooner), I'll do a frequency analysis on it — not just of the relative frequency of words, but of two- and three-word sequences. I may also use my experience with the language so far to rule out some consonant clusters that I thought feasible at first, but have proven too difficult to pronounce consistently, in modifying the phonology format files and regenerating a new list of root words.
Then I'll relex the language, applying the following rules:
Then I'll convert the corpus to the new lexicon, spend some time familiarizing myself with the relex, and write and translate more text. When the corpus gets larger and more representative, I'll do another frequency analysis and relex again.
Obviously I'm not going to make a serious attempt to become fluent in this language until it's gone through several such relexes (each probably less drastic than the last).
Based on advice from David J. Peterson, I'm counting the lengths of words by a weighting algorithm that assigns weights to each kind of phoneme in the word. Vowels count the most toward a word's length, then nasals, then liquids, then fricatives, and least of all plosives; but the exact weights assigned are still uncertain. My draft script which I used to sort the generated words by length before assigning meanings to them in the seed lexicon used these weights:
my $fricative_weight = 0.25; my $nasal_weight = 0.5; my $liquid_weight = 0.375; my $plosive_weight = 0.1; my $vowel_weight = 1;
But the weighting will probably change, for a less drastic difference between nasals and liquids, for instance.
Ensuring that the corpus is representative is going to be tricky. The first few hundred words of it are made up of sample sentences that are probably not very representative of continuous text in any language. I could try to duplicate the Brown Corpus in miniature, i.e. have texts of the same genres in roughly similar proportions; but that seems like a daunting task. Maybe I should simply exclude the grammar example sentences from the corpus analysis, once I have enough connected text (narratives, articles, etc.)?
I'm not sure what the the criterion should be for a phrase occuring "often enough" in the corpus to deserve its own root word. My rule of thumb is that if a phrase occurs as more than 0.2% of the corpus it probably deserves its own word, but after doing an actual frequency analysis I may set the threshold lower than that.
As the lexicon is remade for each phase of the language's development, the grammar may change as well. So I'll document the grammar and phonology in separate documents for each phase.
Last updated July 2009