Radio Inspire

How To Learn Sign Language

MyoSign: enabling end-to-end sign language recognition with wearables


– Sign Language is the
main communication form among deaf and hearing-impaired people. The sign gesture is mainly
composed of three components: hand shape, movement, palm
orientation, and the location. Unfortunately, very few people with normal hearing
understand sign languages. And existing communication approaches like Sign Language
interpreter, writing on paper, they have key limitations in cost,
availability, and convenience. Therefore, an automatic Sign
Language Recognition system is highly desirable. The system you will
capture the sign gestures by sensor-like cameras or wearable sensors and then translate the gestures into text or speech using some orders. Many efforts have been done
during the past decade, vision based mercers, although gets high accuracy but they may not work in dark and may not work in anywhere and anytime. The mercers using data glove, they are maybe they are intrusive and inconvenient. Mercers based on- Mercers based on armbands, sorry, armbands, provide a more natural and convenient manner. However, most of them focus on user dependent scenarios and performance will degrade generally when the user is new to the system. Moreover, this method performs only single word
recognition other than for sentence sequence translation. In this paper, we present MyoSign, a deep learning based system that enables user-independent end-to-end
American Sign Language recognition as both word
and sentence levels. We (mumbles) a lightweight of the shape variable (mumbles) Myo Arm Band. This figure shows a typical application using of our system. When a deaf person wearing the Myo Band communicates with a hearing people, our system translates the
performing gestures to spoken English via smart watch. Meanwhile, the speech
recognition system translates the spoken into, spoken English into into text in their smart watch. The Myo armband contains
three inertial sensors and 8 EMG sensors. The inertial sensors is good at capturing movements information and the EMG sensors does well in capturing muscle activity patterns which can be used to distinguish like figure movements. There are some challenges
we need to solve. First is, user diversity. Different people have
different gestural performing habits and their muscle
tissues is also different. The second is, sensory data confusion. As our system collects sensory data from four modalities, three
inertial data, and EMG. Different modalities
perceive sign language from different perspectives and own distinctive signal and noise modals. The third is, how to
recognize sign language in an end-to-end recognition. In other words, do not need to pre segment these gestures because the pre segmentation is very
inconvenient for users because users will not
pause between the two words. To resolve all challenges, we developed a unified deep learning network. This is comprised of three modals: Multimodal CNN, Convulsion Narrow Network, and Bidirectional Ostium and CTC. Then I introduce these modals. First, we split split the continuous data into data clips. For each clip, we design a multimodal CNN to learn intra-modality
and cross-modality. We first apply three layers
CNN on each individual sensor data and then use
the three dimension CNN learn their cross-modality relations. We modeled the temporal dependencies of the input sign language sequences using RNN. Many ASL signs show very similar characteristics as the beginning. For example, want and what in this figure. The similarity is inclined
to confuse traditional RNN. So, we implied a bidirectional modal which performs influence as each point of the trajectory based
on both past and future trajector information. A word sign gesture
generally contains three temporal or relating phases: preparation, nucleus, and retraction of which the nucleus is most most discriminative. We should design mercers
which can automatically rely on the nucleus for classification. As for sentences, it is
harder to pre segment- It is harder to pre segment the whole sentences into individual words as user will not pause between adjacent words. MyoSign adopts a
probabilistic framework based on Connectionist Temporal Classification for end-to-end translation. The CTC based framework makes MyoSign to get round- get round, temporal alignment between the input and the output
for both word level and sentence level. We collect the word dataset from 15 volunteers and get about 10,000 samples. We perform a user-independent test. There exist great challenges
when it constitutes user-independence due to
the individual differences. But MyoSign can still get an outstanding result. Then, we compare MyoSign
with other algorithms. Among them, two are
traditional algorithms: Waltham Forest and SVM. We also designed five
other deep learning modals which are the variant of MyoSign modals. By changing one or two
design components in this architecture, this comparative results reinforces the necessity of our design
components of MyoSign. We also construct the sentence dataset and perform user-independent and unseen sentences test. The average accuracy is 93.1% for 15 volunteers. We also evaluate the
recognition performance of MyoSign for unseen
sentences and compare it with other methods. The average accuracy is 92.4%. The impressive results further confirm the effectiveness of MyoSign in end-to-end sign language recognition and eliminates the
burden of collecting all American Sign Language sentences. As a conclusion, we propose a non-intrusive wearable system that enables end-to-end American Sign Language recognition as both word and sentence levels. And desire a deep learning
network supporting Sign Language Recognition with robust user-independent features extraction, effective modality fusion and no requirement for temporal segmentation. MyoSign achieves an
average accuracy of 93.7% at word level and 93.1% at sentence level in user-independent settings. That’s all of my talk. Thanks for listening. (audience clapping) I’m glad to take some questions. – I have a lot of
questions, but let me start with a very basic one which is in your transcription here on the sides, you
always show the output as English words, but are you
transcribing English words or transcribing ASL words? – No, no, we didn’t- we haven’t
done this part of the job. We just doing translating sign language to English text. – Okay, so the output (mumbles) is actually English, not ASL? Is that what you’re saying? – Pardon? – What I’m trying to
say is so, okay, let’s, if we have an analogy that will be speech recognition, right? Suppose I wanted to speak
English and translate and get an output in German text. I could have an end-to-end system, end-to-end system that’s trained on English speech and produces German text and it will be speech recognition and machine translation together, but but practically people find it very
useful to have one system which transcribes the English into English word text and then
have machine translation that translates the English to German. So my question is here,
if you take this analogy, is your system transcribing
ASL signs into ASL words written or is it doing the transcription and translation to English in one step? – In one step. – One step, okay. So my next question is why? – Because if we translate it into first into words firstly and then translate- combine these words into sentences, it should do some pre segmentation. You know, the segment, the
words, words, words and then combines, combines to a sentence. – Okay. I’m not sure if I fully understand that. (mumbles) Move to somebody else. (audience mumbles) – Yeah, are there any
other questions before we? We have a couple minutes still. Yeah? Okay. – Yeah, thank you for the talk. In your slides, you said that you had a 70 different words that
you were recognizing so is this correct that your vocabulary was only 70 words? – 70? – Okay, so I think I saw
somewhere that you have like 70 words that you were recognizing. Still back. – So, could be louder? – So well, could you answer
what kind of vocabulary were you using, how big it was? – How big it was? – So, how- yeah, if you go previous, back, back, so here. So yeah, 70 words. So, I was just asking that why were you using this kind of small set of words? – Yeah, because, we
choose the most frequency 70 words and we think this can be extended later to more words. And then, so, are you asking me why, why we choose that these 70 words? – Uh, yes. – Because we choose the most frequency 70 words. – Uh, where the sentence is
also using those 70 words? – Yes, yes, it’s also
using these 70 words and some sentences is not, is not included in the training set. So we introduce it as unseen sentences. – Are they seventy English
words or seventy ASL words? – Seventy is equal a sign language words is equal to English words. Like what sign language and the English is want to. – Yeah, so I think we’ll end
about now so we can move on.

3 Replies to “MyoSign: enabling end-to-end sign language recognition with wearables”

Leave a Reply

Your email address will not be published. Required fields are marked *