Python Testing 101 and Testing 201 with pytest

What did I learn from going through 2 hours of videos on Pytest ?

Python Testing 101

  • pytest the most popular Python package for testing
  • it is also a basis for rich ecosystem for testing plugins
  • unittesting comes with Python. It is used to test the internals of core python. It is a good solid tool but there are a lot of api calls that one might have to learn
  • The test should be divided in to three stages
    • Arrange : Setting up test, variables, data structure
    • Act : Execute the code on the above setting
    • Assert : Assert the way the code has run the test
  • If you are building a user interface, you build a CLI tool and then keep adding additional tests in pytest scripts

Python Testing 202

  • .’s are used to represent the number of functions tested in a test file
  • tests can expect an Exception and pytest can be used to check whether the right exceptions are occurring in the code.
  • What if you want to test more examples ?
  • The first assertion that fails aborts the rest of assertions
  • One can test a bunch of examples with parametrized feature
  • method is a function attached to a class
  • One can group the functionality in to a class
  • Pytest incorporates fixtures that helps you incorporate the set up that goes along with testing a function
  • fixtures should be a part of conftest.py
  • pytest has a built-in fixtures
  • pytest has a rich set of ecosystem that gives a set of variety of new fixtures
  • fixtures for setup/reuse
  • one can organize the test code in to classes

Takeaway

After working through the examples, I am now much more comfortable in going through the book on pytest. There is no doubt that I will be using all these things in the UOB project implementation

Conversations with Hugging Face CTO

The following are the learnings from Hugging Face Interview in Oct 2019

  • GPT2 from Open AI is impressive - Packaged in to Demo Application
  • Conversational AI + Open Source package(Transformers)
  • Half a million monthly active users
  • Hard to good Deep Conversational AI
  • Self starter - Was working in 2008 on ML and then moved on to do some software jobs
  • I was curious to see what the number of downloads for various pre-trained models were. So, wrote a small Python program to get the downloads
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import torch
import pandas as pd
from bs4 import BeautifulSoup
import requests
url                = "https://huggingface.co/models"
response           = requests.get(url)
html_soup          = BeautifulSoup(response.text, 'html.parser')
models             = html_soup.find(class_ = 'models-list')
model_names=[]
for a in models.find_all('li'):
    model_names.append(a.a['href'])

model_names  = [m[1:] for m in model_names]
download_counts = []
for a in models.find_all('li'):
    count = a.find(class_='tooltip').text.strip().split("downloads")[0].strip()
    download_counts.append(count)
model_stats = pd.DataFrame({'model':model_names,'downloads_l30days':download_counts})
model_stats.head(20)

Here are the top 25 models as of [2020-07-06 Mon]

Super Mario Effect for Learning

Watched a fantastic Ted Talk that highlighted the importance of gamifying learning

  • What if you looked at your learning as similar to playing Super Mario
  • You focus on princess and all the rest of the steps are your learnings on the way
  • Life as a straight path is never a story worth telling
  • Turn any learning process in to game and then things become super interesting
  • Research that showed no penalty means increased attempts and better score
  • Nobody gets disappointed when the italian plumber falls in to a ditch - They just learn that they need to be careful at that level, the next time they play
  • 3 year effort - dart board moves based on how one throws the dart
  • Redesign boring tasks to games

Attention is all you need

The following are my learning from the paper titled, Attention is all you need :

  • Using RNN’s for language modeling has been particularly painful as they take long time to train and have problems with learning representational encodings all at once
  • In Transformer architecture, the number of operations required to relate signals from two arbitrary input or output positions is a constant
  • Self-attention is an attention mechanism relating different positions of a single sentence in order to compute a representation of a the sequence
  • Transformer is a first transduction model relying entirely on self-attention to compute representations of its inputs and output without using sequence aligned RNNs or convolution networks
  • Learnt about the relationship between Induction, Deduction and Transduction
    • Induction, derives the function from the given data, i.e. creates an approximating function
    • Deduction derives the values of the given functions for points of interest
    • Transduction derives the values of an unknown function for points of interest from the data

img

Data Leakage

The following is an excellent summary of Data Leakage in time series testing.

BERT

img

Kyle Polich discusses BERT. The following are my takeaways.

Heuristics

In this brief post, I would like to pen down my thoughts on two aspects: Heuristics and Non-Intepretability of models.

Let’s look at word embedding matrix. If you take a bunch of words and want to build a learning algorithm, the first task is to convert the text in to a bunch of numbers. The two popular algorithms that have revolutionized the field of NLP are Skipgram method and CBOW method. Both involve learning a lower dimensional representation of the word. The dimensions are not interpretable as the dimensions are not unique. The fact that dimensions are not interpretable did not stop someone from developing fantastic applications. Suppose you are in foreign country and you are lost and want to check with someone the correct way to your destination: You flip open your phone, speak your native language and your phone translates the sentence to a foreign language (text/audio), and use it converse with strangers. The job gets done. Do you really care how the word embedding algo is working ? Not really. So, we don’t need to be hung up in intepretability for all applications. In trading for example, if the strategy makes money, you might not care too much about the interpretability of the strategy.

LSTM output

This post has two pop quizzes relating to the output of LSTM.

Stacked LSTM

This post creates a Stacked LSTM and learns a simple pattern in the sequence.

Sentiment Analysis via LSTM

I was puzzled with the way LSTMs were used to do sentiment analysis. Finally the book by Antonio Gulli helped me understand the mechanics of the LSTM.

Train a Simple RNN to track a shift sampled from a normal distribution

In this article, I will explain the way you can code a simple RNN that tracks a simple shift in the pattern, i.e a value from a normal distribution.

As compared to previous implementations where we had used OutputProjectionWrapper, this code does away with that component and does it more efficiently

Create Training and Validation Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import numpy as np
import re
from sklearn.model_selection import train_test_split
import tensorflow as tf
input_seed = 1234
time_steps = 24
n_samples = 100000
X = np.random.randint(1,30,n_samples*time_steps).reshape(n_samples, time_steps)
Y = np.apply_along_axis(lambda x : x + np.random.normal(3, 1, 1),1,X)
X = X.reshape(X.shape[0],X.shape[1],1)
Y = X.reshape(Y.shape[0],Y.shape[1],1)
np.random.seed(input_seed)
idx     = np.arange(len(X))
np.random.shuffle(idx)
X, Y    = X[idx,:,:], Y[idx,:,:]
X_train, X_valid, Y_train, Y_valid = train_test_split(X,Y,test_size=0.25, random_state = input_seed)

Set up the RNN Model in TensorFlow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
tf.reset_default_graph()
hidden_units  = 32
tf_X          = tf.placeholder(tf.float32, shape=[None, time_steps, 1])
tf_Y          = tf.placeholder(tf.float32, shape=[None, time_steps, 1])
rnn_cell      = tf.contrib.rnn.BasicRNNCell(num_units= hidden_units, activation=tf.nn.relu)
outputs, states  =tf.nn.dynamic_rnn(rnn_cell, inputs = tf_X,
                                   dtype=tf.float32)

stacked_outputs = tf.reshape(outputs,[-1,hidden_units])
stacked_outputs = tf.contrib.layers.fully_connected(stacked_outputs, 1,activation_fn=None)
outputs = tf.reshape(stacked_outputs,[-1,time_steps,1])
loss           = tf.square(outputs - tf_Y)
total_loss     = tf.reduce_mean(loss)
learning_rate  = 0.001
optimizer      = tf.train.AdamOptimizer(learning_rate= learning_rate).minimize(loss=total_loss)
batch_size =1000
n_batches = int(X_train.shape[0]/batch_size)
epochs = 20
i      = 0

Train the Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  for e in range(epochs):
      idx     = np.arange(len(X_train))
      np.random.shuffle(idx)
      X_train, Y_train    = X_train[idx,:,:], Y_train[idx]
      for i in range(n_batches):
          x  = X_train[(i*batch_size):((i+1)*batch_size),:,:]
          y  = Y_train[(i*batch_size):((i+1)*batch_size),:,:]
          _, curr_loss = sess.run([optimizer, total_loss],
                                 feed_dict={tf_X:x, tf_Y:y})
      loss_val,output_val = sess.run([total_loss,outputs], feed_dict={tf_X:X_valid, tf_Y:Y_valid})
      print("Epoch:",str(e), " Loss:", loss_val)

The output from testing validation data is

Train a Simple RNN to track a simple shift

In this article, I will explain the way you can code a simple RNN that tracks a simple shift in the pattern

Create Training and Validation Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import numpy as np
import re
from sklearn.model_selection import train_test_split
import tensorflow as tf
input_seed = 1234
time_steps = 24
n_samples = 100000
X = np.random.randint(1,30,n_samples*time_steps).reshape(n_samples, time_steps)
Y = np.apply_along_axis(lambda x : x + 10,1,X)
X = X.reshape(X.shape[0],X.shape[1],1)
Y = X.reshape(Y.shape[0],Y.shape[1],1)
np.random.seed(input_seed)
idx     = np.arange(len(X))
np.random.shuffle(idx)
X, Y    = X[idx,:,:], Y[idx,:,:]
X_train, X_valid, Y_train, Y_valid = train_test_split(X,Y,test_size=0.25, random_state = input_seed)

Set up the RNN Model in TensorFlow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
tf.reset_default_graph()
hidden_units  = 32
tf_X          = tf.placeholder(tf.float32, shape=[None, time_steps, 1])
tf_Y          = tf.placeholder(tf.float32, shape=[None, time_steps, 1])
rnn_cell      = tf.contrib.rnn.OutputProjectionWrapper(tf.contrib.rnn.BasicRNNCell(
                                                        num_units= hidden_units,
                                                        activation=tf.nn.relu),
                                                    output_size = 1)
outputs, states  =tf.nn.dynamic_rnn(rnn_cell, inputs = tf_X,
                                   dtype=tf.float32)

loss           = tf.square(outputs - tf_Y)
total_loss     = tf.reduce_mean(loss)
learning_rate  = 0.001
optimizer      = tf.train.AdamOptimizer(learning_rate= learning_rate).minimize(loss=total_loss)
batch_size =1000
n_batches = int(X_train.shape[0]/batch_size)
epochs = 20
i      = 0

Train the Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  for e in range(epochs):
      idx     = np.arange(len(X_train))
      np.random.shuffle(idx)
      X_train, Y_train    = X_train[idx,:,:], Y_train[idx]
      for i in range(n_batches):
          x  = X_train[(i*batch_size):((i+1)*batch_size),:,:]
          y  = Y_train[(i*batch_size):((i+1)*batch_size),:,:]
          _, curr_loss = sess.run([optimizer, total_loss],
                                 feed_dict={tf_X:x, tf_Y:y})
          #print("Epoch:",str(e), "Batch:",str(i), "loss:",str(curr_loss))
      loss_val,output_val = sess.run([total_loss,outputs], feed_dict={tf_X:X_valid, tf_Y:Y_valid})
      print("Epoch:",str(e), " Loss:", loss_val)

The output from testing validation data is

Train a Simple RNN to track cumulative sums

In this article, I will explain the way you can code a simple RNN that tracks cumulative sums of a sequence

Create Training and Validation Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import numpy as np
import re
from sklearn.model_selection import train_test_split
import tensorflow as tf
input_seed = 1234
time_steps = 24
n_samples = 100000
X = np.random.randint(1,30,n_samples*time_steps).reshape(n_samples, time_steps)
Y = np.apply_along_axis(np.cumsum,1,X)
X = X.reshape(X.shape[0],X.shape[1],1)
Y = X.reshape(Y.shape[0],Y.shape[1],1)
np.random.seed(input_seed)
idx     = np.arange(len(X))
np.random.shuffle(idx)
X, Y    = X[idx,:,:], Y[idx,:,:]
X_train, X_valid, Y_train, Y_valid = train_test_split(X,Y,test_size=0.25, random_state = input_seed)

Set up the RNN Model in TensorFlow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
tf.reset_default_graph()
hidden_units  = 32
tf_X          = tf.placeholder(tf.float32, shape=[None, time_steps, 1])
tf_Y          = tf.placeholder(tf.float32, shape=[None, time_steps, 1])
rnn_cell      = tf.contrib.rnn.OutputProjectionWrapper(tf.contrib.rnn.BasicRNNCell(
                                                        num_units= hidden_units,
                                                        activation=tf.nn.relu),
                                                    output_size = 1)
outputs, states  =tf.nn.dynamic_rnn(rnn_cell, inputs = tf_X,
                                   dtype=tf.float32)

loss           = tf.square(outputs - tf_Y)
total_loss     = tf.reduce_mean(loss)
learning_rate  = 0.001
optimizer      = tf.train.AdamOptimizer(learning_rate= learning_rate).minimize(loss=total_loss)
batch_size =1000
n_batches = int(X_train.shape[0]/batch_size)
epochs = 20
i      = 0

Train the Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  for e in range(epochs):
      idx     = np.arange(len(X_train))
      np.random.shuffle(idx)
      X_train, Y_train    = X_train[idx,:,:], Y_train[idx]
      for i in range(n_batches):
          x  = X_train[(i*batch_size):((i+1)*batch_size),:,:]
          y  = Y_train[(i*batch_size):((i+1)*batch_size),:,:]
          _, curr_loss = sess.run([optimizer, total_loss],
                                 feed_dict={tf_X:x, tf_Y:y})
          #print("Epoch:",str(e), "Batch:",str(i), "loss:",str(curr_loss))
      loss_val = sess.run(total_loss, feed_dict={tf_X:X_valid, tf_Y:Y_valid})
      print("Epoch:",str(e), " Loss:", loss_val)

The output from testing validation data is

Train an XOR using Simple RNN

I have been struggling to implement parity of sequence since many weeks. Finally after going through Geron’s book, I am now able to successfully implement an algo that learns the parity of a sequence. This does not use the classification tweak that many apply to solve the parity problem

Create Training and Validation Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
import re
from sklearn.model_selection import train_test_split
import tensorflow as tf
input_seed = 1234
time_steps = 50
n_samples = 100000
N = 20
X = ['{0:050b}'.format(i) for i in range(2**N)]
X = [[int(j) for j in list(i)] for i in X ]
X = np.asarray(X)
X = X.reshape(X.shape[0],X.shape[1],1)
Y = [np.cumsum(i)%2 for i in X]
Y = np.asarray(Y)
Y = X.reshape(Y.shape[0],Y.shape[1],1)
np.random.seed(input_seed)
idx     = np.arange(len(X))
np.random.shuffle(idx)
X, Y    = X[idx,:,:], Y[idx]
sample_size = 100000
X,Y     = X[:sample_size,:,:], Y[:sample_size]
X_train, X_valid, Y_train, Y_valid = train_test_split(X,Y,test_size=0.25, random_state = input_seed)

Set up the RNN Model in TensorFlow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
tf.reset_default_graph()

hidden_units  = 32


tf_X          = tf.placeholder(tf.float32, shape=[None, time_steps, 1])
tf_Y          = tf.placeholder(tf.float32, shape=[None, time_steps, 1])
rnn_cell      = tf.contrib.rnn.BasicRNNCell(num_units= hidden_units, activation=tf.nn.relu)
outputs, states  =tf.nn.dynamic_rnn(rnn_cell, inputs = tf_X,
                                   dtype=tf.float32)

stacked_outputs = tf.reshape(outputs,[-1,hidden_units])
stacked_outputs = tf.contrib.layers.fully_connected(stacked_outputs, 1,activation_fn=None)
outputs = tf.reshape(stacked_outputs,[-1,time_steps,1])

loss           = tf.square(outputs - tf_Y)
total_loss     = tf.reduce_mean(loss)
learning_rate  = 0.001
accuracy        = tf.reduce_mean(tf.cast(tf.equal(tf.round(outputs), tf_Y),tf.int32))
optimizer      = tf.train.AdamOptimizer(learning_rate= learning_rate).minimize(loss=total_loss)
batch_size =1000
n_batches = int(X_train.shape[0]/batch_size)
epochs = 20
i      = 0

Train the Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for e in range(epochs):
        idx     = np.arange(len(X_train))
        np.random.shuffle(idx)
        X_train, Y_train    = X_train[idx,:,:], Y_train[idx]
        for i in range(n_batches):
            x  = X_train[(i*batch_size):((i+1)*batch_size),:,:]
            y  = Y_train[(i*batch_size):((i+1)*batch_size),:,:]
            _, curr_loss = sess.run([optimizer, total_loss],
                                   feed_dict={tf_X:x, tf_Y:y})
        loss_val,output_val,accuracy_val = sess.run([total_loss,outputs,accuracy], feed_dict={tf_X:X_valid, tf_Y:Y_valid})
        print("Epoch:",str(e), " Loss:", loss_val," Accuracy",accuracy_val)

The output from testing validation data is

Example of why word embeddings matter

The word “Bond” can depend on the context that you come across. If it is a community news letter, the bond could be used in the context of “Walk to Bond”, “Run to Bond”, “Meet to Bond”. Here the word “Bond”, means bonding.

In the context of Financial news, “Bond” could in all probability mean a financial instrument.

If you have a word2vec model, “Bond” is collapsed in to one dimension and hence loses all the context.

The Transformer

img

The following are the learnings from the podcast:

  • Transfer learning entails reusing existing models. Use the model that comes from training on different tasks
  • Value delivery through custom feature engineering is not required. Most of the recent successes are in the field of computer vision
    • If you do not have a lot of training data, then you can use a model that is already trained on a large image dataset(ImageNet).
    • Once the pre-trained model is done, additional layers can be overlaid on that
  • Source and Target dataset in language dataset is the same - Next word, neighboring words etc
  • How do you determine whether target is reasonable ? Domain Adaptation - Task remains the same - but source and target domain are different
    • Transfer from different sentiment categories
    • Create a similarity metric and then check whether the tasks are similar
    • If the tasks are similar, then one can apply transfer learning
  • At a practitioner level, leverage the information from a different domain
  • Whether you want to update or keep frozen - Adapt a model to a lot of different tasks - Freeze the model and then create several layers on the top of it
  • Is there an imagenet moment around the corner ?
  • It is apparent that we have reached ImageNet moment ?
    • Directly fine tuning the model or use features of the pre-trained model
  • Plethora of pre-trained models
  • XLNet
  • Domain expertise - Word from pre-trained models
  • Leverage the labels of the existing data
  • Leverage the data
  • Image recognition people vs NLP people
    • Language is more challenging
  • Deal with different languages
    • Learn to a lot more information
    • Societal context - Needs to work with data
    • Particular parts of the image
  • How different images relate to each other ?
  • Unlabeled data - We have the ability to get pre-trained information
    • Hopefully rely on fewer labels
  • Training cross-lingual models
  • Universal embedding space
    • One of the conceptually simpler approach
    • Map all the words in to a constant embedding space
    • Train the model on joint features
    • Mapping is easier is there is a common language
  • Scaling to distant languages is important
  • Powerful source dataset is needed
  • Difficult of the target task - Reasonably good binary / multi-class classification - 50 examples are good enough. 200 examples are good enough
  • Tasks that are more complex - required more training examples
  • OpenAI - used tldr
  • Transfer learning is useful for many types of tasks
  • Applying to the tasks and then use for your own tasks is pretty easy
  • Larger tasks - Fine tune in a couple of hours
  • More methodological developments are needed
  • Can generate datasets
  • Improving models and Improving techniques
  • Long term dependencies are still difficult to put in place.
  • BERT tries to solve this problem by giving a large window for capturing the context
  • Short term contextual information
  • Exploring other architectures + Exploring challenging datasets
  • Near term - Scaling up large training models - More performance. Atleast a couple of larger canonical models
  • Making the models smaller. Want to enlarge models to get most of the benefits. Don’t want to deal.
  • Lot of datasets on NLP available -
  • Developing new datasets are very useful for understanding the shortcomings of the model

Need to work on basics of NN and then move on Transfer learning

Mapping Dialects with Twitter Data

img

The following are the learnings from the podcast:

  • Bruno Gonçalves who is now working in JP Morgan chase is a PhD from Emory university
  • He has done some interesting work on looking at all twitter data and look for geographical based patterns.
    • Can one draw a map based on language patterns?
  • 10 TB of data - Twitter
  • Create a huge matrix of latitude and longitude
  • Words and Geolocation matrix pattern matching
  • PCA + Kmeans based clustering based on the patterns in the high dimensional matrix that combines word embeddings and geo location
  • Mobile phones have made marrying the two datasets possible
  • Evolution of language across time can also be done
  • Ton of people working on emoji’s in twitter feed
  • Ton of stuff can be done based on Reuters News and NLP based work

The Transformer

img

The following are the learnings from the podcast

  • The word “bank” has different meanings in different contexts. It could be a river bank or a financial institution
  • Transformer is a encoder-decoder architecture that makes word embeddings more robust to the context
  • It is a modern NLP technique
  • Attention Is All You Need - A paper that has revolutionized this space

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Named Entity Recognition

img

Kyle Polich discusses NER in this podcast. My learnings are

  • What is an entity in an unstructured dataset? It depends on the context and the task that the ML algo is trying to accomplish
  • Spacy package is a python package that can do NER
  • NER is used in chatbot applications, semantic search applications
  • Lot of NER packages are good but not great
  • Market research - Parse the brands that were mentioned
  • Wikipedia has a lot of markup - Easy to do NER.
  • NER is also called entity identification, entity chunking or entity extraction
  • Spacy features
    • Topic tagging
    • Tokenization
    • POS tagging
    • Text classification
    • NER
  • You can build your own NER on top of Spacy

The Death of a Language

img

Kyle interviews Zane and Leena about the Endangered Languages Project. My learnings are

  • Project is taking in 3.5 hours of audio content from an endangered language called “Ladin”
  • It creates phonetic transcriptions from audio samples of human languages
  • Model has so far produced decent levels of vowel identifications
  • Currently working on phoneme segmentation and larger consonant categories
  • From the project blurb

In this project, we are trying to speed up the process of language documentation by building a model that produces phonetic transcriptions from audio samples of human languages. The ultimate goal of our project is to develop a model that could be applied to any human language with minimal changes. We will be using around 3-4 hours of partially labeled audio data in an endangered language called Ladin, which we are using as our main training/test data. As of now we have produced some decent results in vowel identifications and are currently working on phoneme segmentation and identification of larger consonant categories.