Radio Inspire

How To Learn Sign Language

The Lost Language Recovery Trick – counting an undeciphered script


A lone mysterious text teases your curiosity.
You’re in a dark room, let’s say in a museum basement. No, even better, somewhere
more adventurous: a cave, yeah. Your excitement fades to baffled wonder as your brain starts
to hum with the tedium of cracking this text, a tedium that calls not for the brazen adventuring
of an Indiana Jones but the careful testing of a Turing or Champollion. You don’t know
the language of this text. You don’t even know what words it encodes. You can’t even
start to pronounce its symbols. You’re not even sure this is writing. At this moment of confusion, if your ear is
free, fellow language traveler and budding decipherer, let me hold out my hand and offer
a morsel of hope. What if I told you that you can know what these symbols are without
ever needing to know how to read them? And what if I also told you that all you needed
to do this was to know how to count? First, pluck out the basic writing types you
met in Thoth’s Pill. What, you haven’t seen that? No way! Come on, let’s go watch
it first and then come back here and pluck out the basic types of writing types you met
in Thoth’s Pill: consonant and vowel symbols, syllable symbols and word symbols. Now let’s
start counting. Take a string in English (just a bit of written
text) and count the number of distinct graphemes you find in it. Ok, but longer is better here. More chances
for all the signs in the writing system to show up, including those less common ones:
your q’s, your x’s, your z’s. So let’s count the signs in all of Shakespeare’s
Sonnets. I’m counting everything but X. So keep taking more and more bigger and bigger
samples in English, and you’ll end up counting the 26 letters of our alphabet. Do the same for Russian, and I hear you’ll
count 33. Try it with Hawai’ian, and you’ll only find 13. Now reach for Japanese texts written entirely
in syllables. You’ll count 46 hiragana. Cherokee has 85 syllables. Wow, now here’s
a jump up to Hittite with over 500 symbols, Maya with more than 800 and Chinese with thousands
upon thousands. So we end up with alphabets on one end, syllabaries
in the middle and logographic or logophonetic scripts on the far side. But now you’re facing a new script. Totally
unknown to you. And you count the symbols. It has hundreds. What kind of script do you
think it is? What if it has 20? What if it has 50 or 60? We’re using this simple algorithm to make
an educated guess about an unknown script. This thing’s referenced conceptually in papers
about decipherment, but I don’t know what it’s called. Robinson points us back to
Archibald Sayce, a man who loved himself some Assyrian. So do we call it the Sayce test?
The Archibald heuristic? Pulling a Sayce? Yeah, pulling a Sayce. Pulling a Sayce does
come with challenges and drawbacks, too. First, there’s the problem that you still
can’t read the text. True, but before you take all of the wind out of my sails, please
celebrate what this handy tool can do for you! Second, scripts can emphasize things, but
they’re not really purely phonetic or pure logographs. I mean, think about the mix of
letters, numbers, punctuation and even ideographs you find all around you, and that’s just in
English. Perhaps the biggest problem is how to identify
distinct symbols. If you were trying to decipher the Latin alphabet, could you tell that all
of these are the “same” symbol? How? What about capital letters and lowercase letters?
Are accent marks distinct letters? How do we count ligatures, linked, scripted together,
mashed-together characters? Or the component consonant-vowel pieces of a single Indic syllable?
Or the sound plus meaning pieces inside of a single Chinese character? We have explanations for these, and maybe
they’re good enough for the scripts we already know, but what about trying to read an undeciphered
script like Rongorongo? Are these strange people-looking things two flipped variants
of a single sign or are they two different signs? True, this little algorithm doesn’t pull
us out of the dark. But it gives us a starting point. We can identify script types, we can
list symbols, we can then use those lists to identify the symbols when they show up
in other texts. With a smile on our face. Because, even though this script is still
unknown to us, we’ve devised a clever way to reach a strange familiarity with it, a
big step on our way to decipherment. I took a little time to play around with and
put together some simple code that runs this test on any piece of text you give it. It
counts the number of different symbols that are in that text, then uses those cutoff
numbers from earlier to guess what kind of writing system this is. So, let’s try it for English. The English
alphabet. Here’s the 26 English letters. We’ll pass it down through here and run this thing…
it tells us that English is an alphabet! Who’d have thought? Same thing but with some random sentences.
Still an alphabet. Let’s go for Hiragana down here. Run it.
And… it’s syllables! What about Hawaiian? If you too have ever
shared the misfortune of leaving Hawai’i, you may have heard this. And it’s in an
alphabet. Then there’s Kanji. Oh, yes. A whole list
of them. And they’re word symbols. Well, that’s super nerdy, but thanks for
letting me play with language a bit here. It’s kind of nice doing a more off the cuff
video like this. That’s a thing, right? Off the cuff? It sounded wrong for some reason.
Well, leave comments and subscribe, and encourage me to come back again! Ok, bye.

78 Replies to “The Lost Language Recovery Trick – counting an undeciphered script”

  • Thai is an anomaly in terms of alphabets. It has 76 letters, including 4 tone markers, which is more than the Japanese syllabary.

  • When I 1st heard of Rongo-Rongo, I was sad to find out it has yet to be unlocked. As far as I know linguists think is more than likely writing instead of proto-writing, but it could have some mnemonic features, I'm not even sure if they have compared it with the Rapa Nui language or for that matter any other Polynesian language. I hope one day is unlock so that maybe the Rapa Nui can gain more of their native culture back same with the various Andean peoples and the Quipu knot system which could be a form of storing more than just numbers but maybe even some small words as I've read some archaeologist think it does.

  • Finally! Thank you for restarting your regular videos, I'm glad that comp-chomp thing is over. The first few videos of comp-chomp were interesting, but after a while it got pretty repetitive and boring

  • Hey! Just sent you an email under the subject line ‘Optica Entertainment & Creative Nation Inquiry’ if that helps you find it. Hoping to hear back from you! 🙂

  • I really like your videos, very fun to watch! Could i suggest making a short video on comparing the brahmic scripts, there are so many of them and they all look the same, for me it's near impossible to recognize them when I encounter one

  • haha the second i saw the rondo rondo i was all like i know why no one had deciphered it yet……. because every time someone does there life ends up becoming a call of cthulhu movie and they are never heard from again hehe.
    What do u render with if i may ask? and loved the vid btw

  • Loved the uncle/ankle bit a few episodes back so when i saw this in this vid i was like lawl
    https://drive.google.com/open?id=0ByE_3TIlr1zfckVIbXlBcVF1QTQ
    translation: bird uncle from above lol jk XD Your series inspired me to make some Enter cuneiform gifs
    https://drive.google.com/open?id=0ByE_3TIlr1zfMWRlTVdkMkZ2bDA
    The rondo rondo tablet i was looking at on google images seems to be the other side of the black and gray one in the vid.
    The one u showed in the vid was not that creepy hehe

  • This is incredible! I've always wondered how lost scripts are deciphered!

    Love your channel and these videos! You're currently my main source of procrastination at work!

  • Very interesting! A few of your Thoth's pill and language videos have actually inspired me to create a system of my own. It's simply a cross between an alphasyllabary and an alphabet. You write the consonant (21) and the vowel mark (6) above or below but if the consonant or the vowel is repeated, you write the base form of the consonant and vowel! Your videos have inspired me to do this much and I can't wait to see even more!

    Verelle sen! (Thank you!)

  • Good series but really disappointing. It took 3 videos to arrive at almost nothing. You could have explained this much better, faster and included way more stuff to explain it in great detail. Also include more examples.

  • English has 26 letters, but has more than 26 symbols. The capital and lower-case letters would make 52, adding the digits 0-9 would be 62 symbols. Punctuation symbols would take that number over 70, depending on what is counted: something like "?" or "!" could reasonably be confused as a letter, while "." and "," are too insignificant to be letters in their own right (probably.) Also, a math text would have more symbols than a newspaper, including such things as "+" "%" and the like.

  • As I was watching this it occurred to me as I listened to this that if a script falls into the category of an Alphabet, that some of the characters in the script may be not entirely obvious word dividers, For Example in English and most other modern languages we currently use blank spaces to divide words, but another technique was used in Runic scripts were dots that were used as word separators, Is there a way (perhaps linked to the Zipf distribution of the script and of similar scripts) to estimate what the average character length of words is? If we can do this we may be able to figure out what characters are candidates for word dividers.

    (I know the terms I use aren't quite right but I'm a programmer so I think of characters not graphemes)

  • Please do more on this! I keep seeing all these videos that say "come watch this next video!" And I can't find it. Please complete your series!

  • Interesting .. informative
    at 2:28 .. it is said that alphabets are bw 10-40 … pl correct its beyond 50 .. for instance.. Sindhi which is spoken in Sindh province Pakistan, and Kach in India has 52 alphabets.

  • I would guess that each symbol (3:55) is a word, it has too much detail to write long texts with each one being a single letter or a sylablle

  • I suppose the next step is to try and identify groups of glyphs that are recurring and then find hints on grammatical structure, word separation etc. If educated guesses can be made about the language and sufficiently many sentence examples (in that language) and also inscriptions (in the undeciphered script) then there are brute force methods of looking at glyph histograms and trying to match them up in ways consistent with predicted glyph frequencies… etc… etc…

  • When I was a kid I created my own alphabet or whatever it was no not a code i.e. not letter substitution. I wonder what it would say about it but I forget where I put it 🙁 I was getting pretty good at writing it fluently.

  • In cryptography, the idea of counting the symbols is known as "frequency analysis". I don't know what linguists call it but it would make sense to use the same term.

  • For linguistics stuff I recommend coding in lua; it deals very well with unicode strings and has a lot of useful functions that deal with strings.

  • You know, the Latin and Cyrillic scripts are pretty unique for having so many phonetically identical versions of the same graphemes.

    I mean, yes, in Arabic, you can have up to four versions of each letter, but that's based on the technique of writing it and that's a technique not being used on this disk.

    I'm just noting that we may have a better idea of how many symbols there are in this text than we think we do.

    Oh, and YAY! The program works!

  • here is super minimalistic version of your script

    input file location, and it does your conclusion

    with open(input("file location")) as file:
    c=len(set(list(file.read()))
    if c<49: print("alphabetic!")
    if 50<=c<100: print("syllabary!")
    if 100<=c: print("logographical!")

  • Can you do another Decipherment Club video me and some friends are working on creating a language and I'm leading out so I'm trying to decide which form of writing will be most convenient for us

  • UTF-8 symbols have variable byte length. I doubt the script will work correctly for alphabets like Cyrillic or Inuktitut.

  • Wait, so… you said yourself that a complete stranger to English wouldn't have any way of knowing that uppercase and lowercase letters aren't different symbols. So, if you fed English as it's written into the algorithm, it would detect at least 52 symbols, and that's not even counting punctuation marks. Who'se to say "?" isn't a letter?

  • will there be more of these!! I'm so intrigued!! I want to know what the creepy beast looking gliphs are meaning

  • Ok. So, how would this work with real Japanese, which uses 4 systems of writing? Think of it. Kanji can be used in multiple ways, that some kana do double, triple, even quadruple duty, or serve as auxiliaries to kanji, romaji which may or may not spell European words or japanized words from god knows what language.

  • I couldn't help but get intrigued… Run your script on some grade 2 braille. For that matter. what exactly DO you call grade 2 UEB?? Pesudo-syllabic?

  • i'm a linguist who also programs, it would be great to see more exploration of language through programming

  • I am full of questions. So they count and identify the characters, then how do they know what it says? if it is educated guess, then do they just guess about the story the scripts are telling? who decides the desifer is accurate? I am left with more questions! 🙂

  • It is certain that the English tradition says Tolkien invented The Hobbit and Lord of the Rings as leasure reading.

    How likely would the alternative be, him finding a book in Adunaic and in tengwar and deciphering all that?

  • 1:05 Leaving out tengwar here, what do you think of recurring 32 symbols, which Genevieve von Petzinger is investigating?

    I think I mentioned them before, and before viewing this video I thought "32 symbols? could be alphabetic"

    If you think any "text" she found is too short, how about alphabetic used as mnemotechnics?

    Adam, Seth, Enos, Cainan, Malaleel, Jared, Henoch, Mathusala, Lamech, Noah,
    abbreviated as
    Aleph, Shin, Aleph, Kaph, Mem, Iod, He (one of them!), Mem, Lam, Nun

    Or more likely sth like

    (Noah), Japheth, Gomer, Ascenez
    abbreviated as
    (Nun), Iod, Gimel, Aleph

    or

    (Noah), Japheth, Javan, Tharsis
    abbreviated as
    (Nun), Iod, Iod, Thet (or Tau?)

    etc.

    It could be from when Noah was predividing Earth between the peoples, and he could have said division should become in force or law at the birth of Phaleg / Peleg.

    Which, being flouted, was instead followed by a Babel project leading up to another division. But the Babel project leads us already into Neolithic, since now known as Göbekli Tepe.

  • You might also have to look at line endings to determine which direction the writing is in, or even to determine whether or not the direction is consistently left or right.

  • How about hieroglyphics and similar writing systems? It is basically an abjad but with a vast amount of determinatives. Is there any possibility of separating the letters from the determinatives based on the frequency they appear or places they appear in a text? Or is the near impossibility of this exactly the reason why it took the Rosetta stone (and her little brother from Philae) to decipher hieroglyphics?

  • Don't you have to check on something which does not fall into any category to assure it works if nothing is found??

  • not really fair to say alphabets have 10 -40 characters, because you completely skipped numbers and possibly punctuation. Its a bit too simple to just mention only "letters"…

  • How would your script know if it's an alphabet if it has symbols around it like the diacritics (nikudot) Hebrew has and sometimes uses. If you have a collection of texts & didn't know.

  • You were wrong on the hiragana count. This is because you only counted the 46 part of the 五十音 (Gojūon). In addition to this there are the following: 20 Dakuon 濁音 (voiced), 5 Handakuon 半濁音 (the p sounds), 36 Yōon 拗音 (a combination of the consonant of i-column syllables and ya, yu or yo using smaller versions of the y_ characters, makes sounds like kyu, jū etc.), 1 Sokuon 促音 (っ, makes double consonants), and 6 additional letters (another hiragana followed by one of the 6 vowels, used for some sounds in foreign words–these are normally written in katakana though). This adds up to a total of 114. Otherwise, loved the overall video.

  • why that list in the video didn't have Arabic? 🙁 it is actually very correctable with everything you're talking here.
    aleph = alpha = Al in the first of every english word, it is the beginning jst like alpha and the omega, alpha is father, omega is mother, as for in arabic Om = MO-ther and h-OM-e. u see?

    B in arabic is Ba', ب – باء – means the base of things, basics, basement, bar, become, boat, ball, back, bone, brain. in arabic also words that states (basics) starts with B.
    Bet means Home in arabic

    Noun in arabic , the letter N, spelled noun, also means noun in english, for the meaning of letter ن – نون in arabic resembles to depth of meaning behind something, jst as for Jonas story with the whale in the (deep) blue sea, the nun = the whale, jonas knew with nun.

    Meme in english is also meem in arabic, the letter M = م – ميم . also means the water of things, the meaning or interpretation of something, it is adjustable and flexible just like water.
    mimic, medic, mother, mom, matter, mind… u get it.

    Paradise / faradise / Fardaws in arabic – فردوس = heaven
    what's beautiful is that I still find correlation between nordic and norse religion names correlating with Arabic. and Islamic religion naming and words that hold parables or deep philosophies.

    and a lot more.

    thanks anyway and I hope you proceed this series.

Leave a Reply

Your email address will not be published. Required fields are marked *