The Transformer

The following are the learnings from the podcast: Transfer learning entails reusing existing models. Use the model that comes from training on different tasks Value delivery through custom feature engineering is not required. Most of the recent successes are in the field of computer vision If you do not have a lot of training data, then you can use a model that is already trained on a large image dataset(ImageNet).

Mapping Dialects with Twitter Data

The following are the learnings from the podcast: Bruno Gonçalves who is now working in JP Morgan chase is a PhD from Emory university He has done some interesting work on looking at all twitter data and look for geographical based patterns. Can one draw a map based on language patterns? 10 TB of data - Twitter Create a huge matrix of latitude and longitude Words and Geolocation matrix pattern matching PCA + Kmeans based clustering based on the patterns in the high dimensional matrix that combines word embeddings and geo location Mobile phones have made marrying the two datasets possible Evolution of language across time can also be done Ton of people working on emoji’s in twitter feed Ton of stuff can be done based on Reuters News and NLP based work

The Transformer

The following are the learnings from the podcast The word “bank” has different meanings in different contexts. It could be a river bank or a financial institution Transformer is a encoder-decoder architecture that makes word embeddings more robust to the context It is a modern NLP technique Attention Is All You Need - A paper that has revolutionized this space The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.

Named Entity Recognition

Kyle Polich discusses NER in this podcast. My learnings are What is an entity in an unstructured dataset? It depends on the context and the task that the ML algo is trying to accomplish Spacy package is a python package that can do NER NER is used in chatbot applications, semantic search applications Lot of NER packages are good but not great Market research - Parse the brands that were mentioned Wikipedia has a lot of markup - Easy to do NER.

The Death of a Language

Kyle interviews Zane and Leena about the Endangered Languages Project. My learnings are Project is taking in 3.5 hours of audio content from an endangered language called “Ladin” It creates phonetic transcriptions from audio samples of human languages Model has so far produced decent levels of vowel identifications Currently working on phoneme segmentation and larger consonant categories From the project blurb In this project, we are trying to speed up the process of language documentation by building a model that produces phonetic transcriptions from audio samples of human languages.