The Death of a Language

Kyle interviews Zane and Leena about the Endangered Languages Project. My learnings are Project is taking in 3.5 hours of audio content from an endangered language called “Ladin” It creates phonetic transcriptions from audio samples of human languages Model has so far produced decent levels of vowel identifications Currently working on phoneme segmentation and larger consonant categories From the project blurb In this project, we are trying to speed up the process of language documentation by building a model that produces phonetic transcriptions from audio samples of human languages.

Sequence to Sequence Models

Kyle Polich discusses sequence to sequence models. The following are the points from the podcast Many approaches of ML suffer from fixed-input-fixed-output Natural Language does not have fixed input and fixed output. Summarizing a paper, cross language translation does not have fixed length input-output What a word means depends on the context. There is an internal state representation that the algo is learning The encoder/decoder architecture has obvious promise for machine translation, and has been successfully applied this way.

Simultaneous Translation

Kyle Polich discusses with Liang Huang about his work on Baidu on Simultaneous translation. The following are the points covered in the podcast: Most of the advertized cross language translation vendors such as skype do not do simultaneous translation. They wait for the speaker to finish and then the system does the translation. Skype does consecutive translation and not simultaneous translation Simultaneous translation trades off between accuracy and latency You cannot wait too much of a time to do the translation Prefix-to-Prefix method of translating What’s the dataset used ?

Human vs Machine Transcription

Kyle Polich discusses with Andreas Stolcke about a paper that compares human and machine transcription study. The following are the highlights of the paper Dataset used was switchboard, one that contains voice recordings of individuals on carefully chosen topics and these voices were then transcribed in to sentences. This served as labeled dataset for machine learning algorithms The researchers found that human error rate was 5% and the neural network achieved a good comparative error rate.

Data Skeptic - Word Embeddings Lower Bound

The following are the learnings from a Data Skeptic podcast interview with Kevin Patel: Word embedding dimension of 300 is mostly chosen based on intuition When a telecom company wanted to analyze the sentiment of the sms messages, they were challenged by the huge 300 dim representation of words. They wanted to have a fewer dimensional representation - more like a 10 dim space. This was a problem as most of the datasets were atleast 100 to 300 dim embedding space Till date there has not been any scientific investigation in to the hyperparameter choice Kevin Patel and his team investigated on this hyperparameter on brown corpus and found that a dimension of 19 was enough to efficiently represent the word vectors in brown corpus The team borrowed concepts from algebraic topology.