How to Make a Language Translator – Intro to Deep Learning #11
September 18, 2019
Hello World! It’s Siraj and let’s make our own language translator using TensorFlow Today there are about 6,800 different languages spoken across the world and in an increasingly globalised world nearly every culture has interactions with every other culture in some way that means there are an incalculable number of translation requirements every second of every day across the world Translating is no easy task A language isn’t just a collection of words and of rules of grammar and syntax it’s also a vast inter-connecting system of cultural references and connotations and this reflects a centuries old problem of two cultures wanting to communicate but are blocked by a language barrier our translation systems are fast improving though, so whether it be an idea or a story or a quest each new advancement means one less message will be lost in translation. During the Second World war, the British government was hard at work trying to decrypt the morse coded radio communications that Nazi Germany used to send messages sucurely, known as Enigma. They decided to hire a man named Alan Turing to help in their effort and when the American government learnt of their translation effort they were inspired to try it themselves, post war. Specifically because they needed a way to keep up with Russian scientific publications. The first public demo of a machine translation system, translated 250 words between Russian and English in 1954. It was dictionary based so it would attempt to match the source language to the target language word for word The results were poor since it didn’t capture syntactic structure. The second generation of systems used Interlingua that means they changed a source language to a special intermediary language with specific rules encoded into it then from that generated the target language. This proved to be more efficient, but this approach was soon overshadowed by the rise of statistical translation in the early 90s Primarily from engineers at IBM Innovation at IBM Watch this A popular approach was to break the source text down into segments then compare them to an aligned bi-lingual corpus using statistical evidence and probabilities to choose the most likely translation. Nowadays the most used statistical translation system in the world is Google Translate and with good reason. Google uses deep learning to translate from a given language to another with state of the art results So how do they do this? Let’s recreate their results in TensorFlow to find out The dataset we’ll be using to train our language translation model is a corpus of transcribed TED talks It’s got both the English version and the French version and our goal will be to create a model that can translate from one to the other after training. We’ll be using TensorFlow’s built in data_utils class to help us pre-process our data set and we’ll start by defining our vocab size which is the number of words we want to train on from our dataset. We’ll set it to 40k for each which is a small portion of the data then we’ll use the data_utils class to read the data from the data directory. Giving it our desired vocab size and it will return the formatted and tokenised words in both languages We’ll then initialise TensorFlow placeholders for our encoder and decoder inputs Both will be integer tensors that represent discrete values they will be embedded into a dense representation later We’ll feed our vocabulary word to the encoder and the encoded representation that’s learnt to the decoder. Now we can build our model Google published a paper more recently discussing a system they integrated into their translation service called Neural Machine Translation. It’s an encoder decoder model inspired by similar work from other papers on topics like text summarisation. So whereas as before Google Translate would translate from language A to English to language B with this new NMT architecture, it can translate directly from one language to the other It doesn’t memorise phrase to phrase translations instead it encodes the semantics of the sentence. This encoding is generalised so it can even translate between a language pair like Japanese and Korean that it hasn’t explicitly seen before. So I guess we can use a LSTM recurrent network to encode a sentence in language A the RNN spits out a hidden state ‘s’ which represents the vectorised contents of the sentence. We can then feed ‘s’ to the decoder which will generate the translated sentence in language B, word by word. Sounds easy enough right? WRONG! There is a drawback to this architecture, it has limited memory that hidden state ‘s’ of the LSTM is where we’re trying to cram the whole sentence we want to translate ‘s’ is usually only a few hundred floating point numbers long The more we try to force our sentence into this fixed dimensionality vector the more lossy our neural net is forced to be. We could increase the hidden size of the LSTM after all they’re supposed to remember long term dependencies but what happens is as we increase the hidden size ‘h’ of the LSTM the training time increases exponentially. So to solve this we’re going to bring ‘Attention’ into the mix If I was translating a long sentence, I’d probably glance back at the source sentence a couple times to make sure I was capturing all the details. I’d iteratively pay attention to the relevant parts of the source sentence We can let neural nets do the same by letting them store and refer to previous outputs of the LSTM This increases the storage of our model without changing the functionality of the LSTM So the idea is once we have LSTM outputs from the encoder stored we can query each output asking how relevant they are to the computation happening in the encoder Each encoder output gets a relevancy score which we can convert to a probability score by applying a softmax activation to it. Then we extract a context vector which is a weighted summation to the encoder outputs depending on how relevant they are. Memory ain’t enough, pay attention Memory ain’t enough, pay attention (In Hindi) Memory ain’t enough, pay attention (In German) Memory ain’t enough, pay attention (In Spanish) Memory ain’t enough, pay attention We build our model using TensorFlow’s built in embedding attention sequence to sequence function giving it our encoder and decoder inputs as well as a few hyper parameters we define like the number of layers. It builds a model that is just like the one we discussed TensorFlow has several built in models like this that we can drop into our code easily So normally this alone would be fine and we could run this and the results would be decent but they added another improvement to their model that requires MORE CODE A 100 GPUs and a WEEK OF TRAINING Seriously that’s what it took we won’t implement it all programatically but let’s dive into it conceptually If the outputs don’t have sufficient context then they won’t be able to give a good answer we need to include info about future words, so that the encoder output is determined by the words on the left and the right. We humans would definitely use this kind of full context to determine the meaning of a word we see in a sentence. The way they did this is tho use a bi-directional encoder so it’s two RNNs. One that goes forward over the sentence and the other goes backwards. So for each word it concatenates the vector outputs which produces a vector with context from both sides. and they added a lot of layers to their model. The encoder has one bi-directional RNN layer and seven uni-direciotnal RNN layers The decoder has eight uni-directional RNN layers The more layers the longer the training times so that’s why we use a single bi-directional layer if all the layers were bi-directional the whole layer would have to finish before layer dependencies could start computing But by using uni-directional layers, computation is going to be more parallel. We’ll initialise our TensorFlow section, then our model inside of it Let’s see some results after training. First I’ll give it this phrase Looks good and now another phrase DOPE! While it’s not perfect and we still have a way to go we’re definitely getting closer to having a universal translation model. Breaking it down Encoder-Decoder architectures are for state-of-the-art performance in machine translation by storing the previous outputs of the LSTM cells we can judge the relevancy of each to decide which to use via an attention mechanism. And by using a bi-directional RNN, the context of both past and future words is used to create an accurate encoder output vector. The coding challenge winner from last week is Ryan Lee This was very impressive, he created a recipe summariser by scraping a 125,000 recipes from the web and documented it all beautifully with installation steps so you can reproduce the results yourself. WIZARD OF THE WEEK! and the runner up is Sarah Collins her code converts scientific papers to text and prioritises them by topic. This weeks coding challenge is to create a simple translation system using an encoder-decoder model. All the details are in the readme, post your github link in the comments and I’ll announce the winner next week. Please subscribe for more programming videos Check out this related video and for now I’ve got to get a better GPU So, thanks for watching!