Sequence to Sequence Models

img

Kyle Polich discusses sequence to sequence models. The following are the points from the podcast

  • Many approaches of ML suffer from fixed-input-fixed-output
  • Natural Language does not have fixed input and fixed output. Summarizing a paper, cross language translation does not have fixed length input-output
  • What a word means depends on the context.
  • There is an internal state representation that the algo is learning
  • The encoder/decoder architecture has obvious promise for machine translation, and has been successfully applied this way. Encoding an input to a small number of hidden nodes which can effectively be decoded to a matching string requires machine learning to learn an efficient representation of the essence of the strings.
  • Figures out a way to best way to encode the words
  • Measurement of seq2seq is based on BLEU.
  • Compare the translation to various human translations for the same sentence
  • Input - Embedding Layer - LSTM layer - Dense layer - maps to the word
  • Image captioning done via Sequence to Sequence models
  • Relevant link on seq2seq

Simultaneous Translation

img

Kyle Polich discusses with Liang Huang about his work on Baidu on Simultaneous translation. The following are the points covered in the podcast:

  • Most of the advertized cross language translation vendors such as skype do not do simultaneous translation. They wait for the speaker to finish and then the system does the translation. Skype does consecutive translation and not simultaneous translation
  • Simultaneous translation trades off between accuracy and latency
  • You cannot wait too much of a time to do the translation
  • Prefix-to-Prefix method of translating
  • What’s the dataset used ?
  • What’s the accuracy measure - BLUE
  • Different from general translation - There is a time pressure
  • Each person can sustain only 10 minutes of simultaneous translation
  • Main challenge is the structure. In German and Japanese, it is Subject-Object-Verb. Languages have a wierd mix of SOV, SVO etc. and that’s a challenge
  • Input and target side are generated incrementally
  • Variation of seq2seq model - Very easy to code prefix-to-prefix model
  • 2 million sentence pairs - Chinese pairs
  • We use BLUE score for translation quality - String level similarity between human translation and machine translation
  • Higher the BLUE score, the better
  • One given chinese - there can be a tons of english sentences
  • This is unlike unique classification task
  • Ideal situation - Using simultaneous translation to date a foreigner

Human vs Machine Transcription

Kyle Polich discusses with Andreas Stolcke about a paper that compares human and machine transcription study. The following are the highlights of the paper

  • Dataset used was switchboard, one that contains voice recordings of individuals on carefully chosen topics and these voices were then transcribed in to sentences. This served as labeled dataset for machine learning algorithms
  • The researchers found that human error rate was 5% and the neural network achieved a good comparative error rate.
  • The errors made by computers and humans were similar for soft words such as and, him, her etc
  • For a certain speaker, there was a correlation between error rate by humans and machines
  • Computers had tough time in transcribing fill words such ahem, aa etc
  • If there is a conference where people are talking simultaneously, transcription is still a difficult problem
  • Real time transcription is a still an active area of research
  • Skype does not do real time but near real time transcription. A talks, Skype transcribes, then B reads. B talks,Skype transcribes and A reads.
  • Chinese voice to English sentence translation is another active research area
  • The accents that humans found it difficult to transcribe were the same accents that computers found it difficult

Data Skeptic - Word Embeddings Lower Bound

The following are the learnings from a Data Skeptic podcast interview with Kevin Patel:

  • Word embedding dimension of 300 is mostly chosen based on intuition
  • When a telecom company wanted to analyze the sentiment of the sms messages, they were challenged by the huge 300 dim representation of words. They wanted to have a fewer dimensional representation - more like a 10 dim space. This was a problem as most of the datasets were atleast 100 to 300 dim embedding space
  • Till date there has not been any scientific investigation in to the hyperparameter choice
  • Kevin Patel and his team investigated on this hyperparameter on brown corpus and found that a dimension of 19 was enough to efficiently represent the word vectors in brown corpus
  • The team borrowed concepts from algebraic topology. If there are four points that are equally pairwise same, then one needs to have three dimensional representation of the point in the form of a tetrahedron. If you are going to project in to a 2D space, there will be a loss of information
  • Bank can have two meanings - Money in the Bank, Boat landed on a river bank. In the traditional word2vec representations, bank will be collapsed in to one vector based on the training corpus.
  • How many dimensions should be chosen for the representation ? The team investigated the lower bounds of the hyperparameter
  • The team is now working on the upper bounds
  • Intrinsic set of evaluations
    • Word pair similarity task(labeled by humans)
    • Word analogy
    • Word categorization in to some buckets and then verify the clustering
  • Extrinsic set of evalutions - Look at the downstream application of embedding layer and then measure the performance
  • Regular Tetrahedron requires 3 dim. If you move to 2 dim, one has to break the equidistance points
  • Pair Wise equi similar points - What is the minimum dimension of vector space?
  • New bounds for equiangular lines - This paper has been used as a guide for evaluating word embeddings
  • Train Brown corpus for embedding dimensions of 1, 2,3,…, 19, ….
  • Dimension for Brown corpus is about 19
  • The team also plans to work on
    • Interpretable word embeddings - Is there any interpretation to these individual dimensions?
    • How can one empirically validate their lower bound result on bigger datasets?
    • How do you use persistent homology to compute the number of neurons needed in the NN?
    • Word embeddings from a machine translation
  • Most of the work is NP-complete task. Hence the team is constrained to give results for toy datasets

I found the podcast very interesting as I have just started to understand the basic ideas of word embeddings and how they are used in NLP

Index Funds and ETFs

bookcover2

 

The first three chapters of the book are targeted towards those who want to get a basic understanding of index funds. The first chapter talks about the massive growth of indexing and hence index funds & ETFs. The second chapter walks the reader through the history of various fund structures that came before index funds and ETFs. The third chapter gives a laundry list of entities that have benefited from the rise of index funds.

Machine Learning With Boosting - Summary

bookcover

Gradient Boosting Algorithm is one of the powerful algos out there for solving classification and regression problems.This book gives a gentle introduction to the various aspects of the algo with out overwhelming the reader with the detailed math of the algo. The fact that there are a many visuals in the book makes the learning very sticky. Well worth a read for any one who wants to understand the intuition behind the algo.

A Brief Introduction to Cloud Computing

book-cover

The following are some of the points mentioned in the book :

  • Cloud is just a building full of computers. There are racks of computers specially built to fit in datacenters
  • A cloud provider rents computers as a service. It is akin to a car rental agency
  • Cloud computing is the new electricity and everyone’s fighting to be the new utility provider of choice
  • Cloud Storage came first with S3(Simple Storage Service) came in 2006
  • Cloud Computing began with EC2(Elastic Cloud Compute) in 2007
  • Software developers have made S3 and EC2 successful
  • Cloud costing - Shift from Capex to Opex
  • Cloud Native - Cloud-centric approach to developing software
  • Netflix - An example of Cloud Native service: Early adopter of EC2 and one of the biggest success stories of Cloud
  • Computers failing in the cloud is the norm. it is expected. it is ok. This changes the way the software is developed
  • Concept of multiple availability zones in a region - Makes the cloud resilient to failure
  • IaaS - Rent infra from a cloud provider
  • PaaS - Rent a platform - Google app engine ,databases, Elastic Search, Machine Learning Services and a ton of other platforms that are available
  • SaaS - Software as a service
  • Both IaaS and PaaS are services used by developers to build products. No end user will ever use IaaS or PaaS directly. They are exclusively for developers. SaaS is meant for end users to use directly.
  • A VM allows many programs to run on one computer at the same time. VMs are how big, powerful computers are shared.
  • Analogy of VMs and a house owner who puts his room on Airbnb and somehow manages to magically accomodate all the demand - Virtualization explained well
  • The real genius of the cloud: selling previously unused computer capacity
  • Containers evolved after VMs and are even better at sharing computers because they use fewer resources. Since containers use fewer resources than VMs, it’s possible to pack even more applications on a bare-metal computer.
  • Containers are so efficient at letting programs share a computer that two to three times more applications can share a computer using containers compared to VMs.
  • Cloud computing as renting computers in the cloud. But in reality people are really renting are slices of computers using VMs. Via containers, you could rent even smaller slices of a computer using containers.
  • Serverless takes renting smaller slices of a computer to the next level. With serverless, instead of running a complete application in a VM or a container, you upload a function into a serverless service and it takes care of everything else.
  • Why is serverless computing the future :
    • You don’t rent computers anymore. You don’t have to worry about VMs or containers. You no longer have to deal with capacity planning, deploying software, installing software, patching software, or elastically scaling in response to demand. You’re free from all that extra overhead. All you do is create and upload functions. You don’t even have to worry about failures like you do with EC2.
  • Coke handles all their vending machine transactions using AWS Lambda
  • Microsoft sells a private cloud product as do RedHat, VMware, and IBM.
  • Netflix story of leveraging AWS to become what it is today

Make Your Own Neural Network

book_cover

This book is a great quick read that highlights various aspects of Neural Network components. There are three parts to the book. The first part of the book takes the reader through the various components of a Neural Network. The second part of the book equips the reader to write a simple piece of python code from scratch(not using any off-the-shelf library) that sets up a neural network, trains the neural network and test the structure for a certain set of test samples. The third part of the book adds some fun exercises that test a specific neural network.

Neural Networks - A Visual Introduction For Beginners

COVER

The author says that there are five things about Neural Networks that any ML enthusiast should know:

  1. Neural Networks are specific : They are always built to solve a specific problem

  2. Neural Networks have three basic parts, i.e. Input Layer, Hidden Layer and Output Layer

  3. Neural Networks are built in two ways

    • Feed Forward : In this type of network, signals travel only one way, from input to output. These types of networks are straightforward and used extensively in pattern recognition

Machine Learning with Random Forests and Decision Trees

book-cover

The entire book is organized as 20 small bite sized chapters. Each chapter focuses on one specific thing and explains everything via visuals(as is obvious from the title).

The author starts off by explaining the basic idea of Random Forests, i.e. a collection of decision trees that have been generated via randomization. The randomness comes from the fact that a random subset is used from training the dataset and a random set of attributes are used for splitting the data. Before understanding the

Decision Trees and Random Forests : A Visual Introduction For Beginners

image

This book provides a non-mathy entry point in to the world of decision trees and random forests.

Chapter 1 introduces the three kinds of learning algorithms:

  • Supervised Learning Algorithm: The algo feeds on labeled data and then use the fitted algorithm on unlabeled data

  • Unsupervised Learning Algorithm: The algo feeds on unlabeled data to figure out structure and patterns on its own

  • Semi-Supervised Learning: The algo feeds on labeled and unlabeled data. The algo needs to figure out the structure and patterns of the data. However it gets some help from the labeled data.

The Idiot Brain: Summary

bookcover

Context

I stumbled on this book on my way to Yangon and devoured the book with in a few hours. It took me more time to write the summary than to read the book. The book is 300 pages long and full credit goes to the author for making the book so interesting.In this post, I will attempt to summarize the contents of the book.

This book is a fascinating adventure into all the aspects of the brain that are messy and chaotic. These aspects ultimately result in an illogical display of human tendencies - in the way we think, the way we feel and the way we act. The preface of the book is quite interesting too. The author profoundly apologizes to the reader on a possible scenario, i.e. one in which the reader spends a few hours on the book and realizes that it as a waste of time. Also, the author is pretty certain that some of the claims made in the book would become obsolete very quickly, given the rapid developments happening in the field of brain sciences. The main purpose of the book is to show that the human brain is fallible and imperfect despite the innumerable accomplishments that it has bestowed upon us. The book talks about some of the every day common experiences we have and ties it back to the structure of our brain and the way it functions.

The Power of Habit : Summary

bookcover

Most of the decisions that we take or activities that we do, on a daily basis are not a result of deliberate thought. These are the result of habits that we have built over time. We realize some of these decisions/activities as habits but often carry out many activities in auto-pilot mode. This is good as it frees our mind to do other things. The flip side is that we do not seem to be in control of the actions and hence feel powerless.

Incorporating Implied volatility in Portfolio Risk Estimation

The paper titled, “Making Covariance-Bassd Portfolio Risk Models Sensitive to the Rate at which Markets Reflect New Information”, makes a case against using any GARCH type of volatility estimate for portfolio risk estimation. Instead it suggests a simple way to incorporate changes in the level of implied volatility of the options in to the computation of risk of a portfolio comprising underlying securities.

The following is a brief summary of the paper :

Graph Databases : Book Summary

cropped-graphdatabases_cover390x5121

Neo4j’s founder Emil Eifrem shares a bit of history about the way Neo4j was started. Way back in 1999, his team realized that the database that was being internally used had a lot of connections between discrete data elements. Like many successful companies which grow out of a founder’s frustration with status quo, Neo4j began its life from the founding team’s frustration with the fundamental problem with the design of relational databases. The team started experimenting on various data models centered around graphs. Much to their dismay, they found no readily available graph database to store their connected data. Thus began the team’s journey in to building a graph database from scratch. Project Neo was born. What’s behind the name Neo4j ? The 4 in Neo4j does not stand for version number. All the versions numbers are appended after the word Neo4j. I found one folksy explanation on stackoverflow that goes like this,   

Learning Cypher : Summary

learningcypher

Cypher is a query language for Neo4j graph database. The basic model in Neo4j can be described as

  • Each node can have a number of relationships with other nodes

  • Each relationship goes from one node either to another node or to the same node

  • Both nodes and relationships can have properties, and each property has a name and a value

Cypher was first introduced in Nov 2013 and since then the popularity of graph databases as a category has taken off. The following visual shows the pivotal moment:

Learn Git in a Month of Lunches : Review

Learn-git

Git and Github have revolutionized the way one creates, maintains and shares software code. It is said to be the Linus Torvald's second gift to the world, first obviously being the Linux operating system. Nowadays it is common for job seekers to showcase their work in the form of several github repositories so that various employers can evaluate the job seeker in a much better way. Open source projects are thriving because of easy to use git based social coding platforms. The popularity of these platforms has grown to such an extent that many non-programmers are using git and github for maintaining version control of their work. Personally I know two nonfiction writers/journalists who use git to maintain their various documents.

The Cartoon Guide to Genetics : Book Review

cartoon-guide

Books such as these, give visual images that are necessary to make learning stick. It is fair to say that I do not remember anything much about cell biology nor anything related to DNA. It was way back in my high school that I had crammed something, held it in my working memory for a few years in order to write exams. Some bits would have percolated to my long term memory, but since I have never retrieved them, they lie somewhere in some inaccessible part of my brain.

Quote for the day

A quote often attributed to Gloria Steinem says: “We’ve begun to raise daughters more like sons… but few have the courage to raise our sons more like our daughters.” Maker culture, with its goal to get everyone access to the traditionally male domain of making, has focused on the first. But its success means that it further devalues the traditionally female domain of caregiving, by continuing to enforce the idea that only making things is valuable. Rather, I want to see us recognize the work of the educators, those that analyze and characterize and critique, everyone who fixes things, all the other people who do valuable work with and for others—above all, the caregivers—whose work isn’t about something you can put in a box and sell.

Fail, Fail Again, Fail Better

book_cover

Ani Pema Chodron, the author of the book, gave a commencement address to the 2014 graduating class of Naropa, University of Boulder, Colorado. She did so, to keep her promise with her grand daughter, who was amongst the graduating class. The speech went viral on the net and this book is an offshoot of it. It contains the full text of the speech and a Q&A session. The title of the speech and hence the book, is inspired from a quote by Samuel Beckett. 

The Master Algorithm : Book Summary

book_cover

This book gives a a macro picture of  machine learning. In this post, I will briefly summarize the main points of the book. One can think of this post as a meta summary as the book itself is a summary of all the main areas of machine learning.

Prologue

Machine learning is all around us, embedded in technologies and devices that we use in our daily lives. They are so integrated with our lives that we often do not even pause to appreciate its power. Well, whether we appreciate or not, there are companies harnessing the power of ML and profiting from it. So, the question arises, whether we need to care about ML at all ? When a technology becomes so pervasive, you need to understand a bit. You can’t control what you don’t understand. Hence at least from that perspective, having a general overview of the technologies involved, matters.

Why Information Grows : Book Review

book_cover

I stumbled on to this book a few weeks ago and immediately picked it up after a quick browse through the sections of the book. I had promptly placed it in my books-to-read list. I love anything related to information theory mainly because of its inter-disciplinary applications. The principles of information theory are applicable in a wide range of fields. In fact it will hard to pinpoint a specific area where concepts from information theory have not been applied. In this post, I will summarize the main points of the book.

Quote for the day

Once quiet, we will hear an inner voice proclaim, “You are not enough.” 

So why not just avoid the compost? Why not skirt the stinking heap?

For centuries poets and philosophers, theologians and therapists have taught us that turning over these compost piles is a risk we must undertake, for within these same mounds lies the fertile matter out of which new life arises and is nourished, a cyclic alchemy always at play, just as it is in any garden. This is the necessary work, the means of discovering spirit and self, a call that must be heeded. We do this by going deep. That is our charge.

Silence : Book Review

book_cover

There is no denying about the importance of Silence and Solitude in one’s life. For me, they have always provided an appropriate environment to learn and understand a few things deeply. Drawing from that experience, I strongly feel one should actively seek some amount of “silent time” in one’s life. Is it difficult for a person leading a married life, to carve out spaces of silence? Not necessarily. I remember reading a book by Anne D.LeClaire, in which the author writes about her experience of remaining completely silent on the first and third Mondays of every month. Anne explains that this simple practice brought tremendous amount of calmness in her family life. The family members unknowingly start giving importance to “pauses”, the “pauses” that actually make the sentence meaningful, the “pauses” that make the music enjoyable, the “pauses” that make our lives meaningful. Indeed many have written about the transformative experience of silence. But how many of us consciously seek silence and more importantly incorporate in our daily lives ? In the hubbub of our lives and in our over-enthusiasm for acquiring/reaching/grabbing something that is primarily externally gratifying, we often turn our back on “silence” and consequently deny or at least partly deny those experiences that are internally gratifying.

The Golden Ticket : Book Review

book_cover

The book titled, “The Golden ticket”, gives a non-technical introduction to the most important unsolved problem in the computer science,“Whether P = NP?”. If you are curious about knowing this problem and do not want to slog through the dense math behind it, then this book might be a perfect introduction to the problem. The author makes sure that the concepts are introduced using plain English and nothing more. By the time you are done with the book, I bet you will start appreciating a few if not all the NP-complete problems that are yet to be cracked. In this post, I will try to summarize briefly the main points of the book.

Has the Chicago Professor cracked P=NP

Via Sarada:

Professor Babai of the University of Chicago recently made an alleged breakthrough in theoretical computer science which concerns the problem of graph isomorphism. In lecture, we have often been told that there are massive networks in biology and social media, to only name two areas, that can not be effectively visualized because of the sheer mass of links and nodes; these networks can grow to be so convoluted that computers occasionally have a difficult time telling if two networks are, in fact, identical or not in an acceptably short time. This is an issue of computational complexity.

Pure vs. Applied Mathematician

Via Wired :

A pure mathematician, when stuck on the problem under study, often decides to narrow the problem further and so avoid the obstruction. An applied mathematician interprets being stuck as an indication that it is time to learn more mathematics and find better tools.

What happens when a $1B valued start-up goes south

Via [NYT](When a Unicorn Start-Up Stumbles, Its Employees Get Hurt):

This article shows the painful reality of employees who worked at a $1B valued start-up, that went south. 

Instead of they getting paid to work, the employees ended up paying to work for the start-up.

Quote for the day

“You say I am repeating
Something I have said before. I shall say it again.
Shall I say it again? In order to arrive there,
To arrive where you are, to get from where you are not,
You must go by a way wherein there is no ecstasy.
In order to arrive at what you do not know
You must go by a way which is the way of ignorance.
In order to possess what you do not possess
You must go by the way of dispossession.
In order to arrive at what you are not
You must go through the way in which you are not.
And what you do not know is the only thing you know
And what you own is what you do not own
And where you are is where you are not.”

                                   - T.S.Eliot

Statistical Modeling–The two cultures

The paper titled, “Statistical Modeling – The two cultures”, by Leo Breiman(man behind CART) was published 14 years ago, a time when machine learning techniques had started to become popular. In the last decade or so, the field has exploded. There are many graduate and PhD programs that are offering full-fledged courses on the subject. There are also dedicated journals that publish ML research findings of researchers from a wide range of fields. Larry Wasserman attributes the quick rise of the field to “conference culture”, an environment where an ML researcher is valued based on his/her conference publications. The thought process and the activities that precede a “conference publication” is different from the usual “journal” publication and this enables an ML researcher to work on far many interesting problems usually from a diverse set of areas as compared to a statistician.

The Pattern on the Stone : Book Review

book-cover

If one rips apart a computer and looks at its innards, one sees a coalescence of beautiful ideas. The modern computer is built based on electronics but the ideas behind its design has nothing to do with electronics. A basic design can be built from valves/water pipes etc. The principles are the essence of what makes computers compute.

This book introduces some of the main ideas of computer science such as Boolean logic, finite-state machines, programming languages, compilers and interpreters, Turing universality, information theory, algorithms and algorithmic complexity, heuristics, uncomutable functions, parallel computing, quantum computing, neural networks, machine learning, and self-organizing systems.

How to Find Fulfilling Work

book-cover

The author tries to answer two questions via this book :

  1. What are the core elements of a fulfilling work ?

  2. How should we make a change to our current path so that our work is in line with our being.

Give meaning to work

The author talks about five aspects relevant to work :

  1. earning money

  2. achieving status

  3. making a difference

  4. following our passions

  5. using our talents

The author categorically states the earning money and achieving status might not give meaning to our work. Most often than not, one gets on to a “hedonistic” treadmill that is difficult to get off. Instead pursuing “making a difference” path is a better option. It might lead an individual to try out a variety of jobs, learn a variety of skills, and become a portfolio worker. It is one of the ways to explore “many selves” that lie within us. The chapter also explores the various challenges that one might have to face in following any of the above paths. Personally I find “using your talents” path to be quite appealing. I have taken a path of “developing talents and then doing your bit in making your talents useful to others”. If you forget the monetary aspect and status aspect of any work and immerse yourself in developing talents, I guess one might find that learning a specific skill and putting it to use, could be a wonderful experience. There are obvious challenges that one must encounter on day to day basis. However I guess the “love” element towards your work will give all the strength to face the challenges. The author’s message from this chapter is that there are three core elements of a fulfilling work, i.e. meaning, flow and freedom. Each of these elements are concisely discussed via examples and anecdotes. Light reading but there is no fluff here.

Superforecasting: Book Review

book_cover

In this post, I will attempt to briefly summarize the main points of the book

An optimistic skeptic

The chapter starts off by saying that there are indeed people in the world who become extremely popular, make tons of money, get all the press coverage, by providing a perfect explanation, after the fact. The author gives one such example of a public figure who rose to fame explaining events post-fact, Tom Friedman. At the same time, there are many people who are lesser known in the public space but have an extraordinary skill at forecasting. The purpose of the book is to explain how these ordinary people come up with reliable forecasts and beat the experts hands down.

Quote for the day

“Do not be too timid and squeamish about your actions. All life is an experiment. The more experiments you make the better. What if they are a little coarse and you may get your coat soiled or torn? What if you do fail, and get fairly rolled in the dirt once or twice? Up again, you shall never be so afraid of a tumble.”

-- Ralph Waldo Emerson

Reproducible Research with R and RStudio : Summary

reproducible-research

The book starts by explaining an example project that one can download from the author’s github account. The project files serve as an introduction to reproducible research. I guess it might make sense to download this project, try to follow the instructions and create the relevant files. By compiling the example project, one gets a sense of what one can accomplished by reading through the book.

**Introducing Reproducible Research
**The highlight of an RR document is that data, analysis and results are all in one document. There is no separation between announcing the results and doing number crunching. The author gives a list of benefits that accrue to any researcher generating RR documents. They are

Compiling RR book

The author of the book titled Reproducible Research with R and RStudio - Second edition , Christopher Gandrud, has made the relevant code available on github. This code can be run to obtain a book that is identical to the one that is available in the stores(that is priced at $45). I have spent more than an hour trying to run the code as there are many prerequisite packages that need to be installed. Even though the directions are pretty straightforward, I had to make a few tweaks in order to recreate the content of the book.

Dearth of literature

Compared to Western classical music, there is a dearth of literature in Hindustani classical music.

Stephen Slavek, a disciple of Pt. Ravi Shankar writes in his book, “Sitar Technique in Nibaddha forms” :

Musicians talk a lot about music but rarely write anything about it, and never attempt to transcribe a performance for the purpose of analysis. The reason is that music tradition is one of improvisation and retains its vitality by means of transient, ephemeral nature of the music that comprises it. 

Minimum Resting Time – A terrible idea

There have been many articles in the media saying that SEBI might introduce a minimum resting time for any order before it could be cancelled. A report by Santa Fe Institute written J. Doyne Farmer and Spyros Skouras is a detailed write-up on the various aspects one must think about, while meddling with the market micro structure. The opinion expressed by both the authors is in the context of European markets.

Event History Analysis With R

book_cover

This book can be used as a companion to a more pedagogical text on survival analysis. For someone looking for an appropriate R command to use, for fitting certain kind of survival model, this book is apt. This book neither gives the intuition nor the math behind the various models. It appears like an elaborate help manual for all the packages in R, related to event history analysis.

I guess one of the reasons for the author writing this book is to highlight his package eha on CRAN. The author’s package is basically a layer on survival package that has some advanced techniques which I guess only a serious researcher in this field can appreciate. The book takes the reader through the entire gamut of models using a pretty dry format, i.e. it gives the basic form of a model, the R commands to fit the model,and some commentary on how to interpret the output. The difficulty level is not a linear function from start to end. I found some very intricate level stuff interspersed among some very elementary estimators. An abrupt discussion of Poisson regression breaks the flow in understanding Cox model and its extensions. The chapter on cox regression contains detailed and unnecessary discussion about some elementary aspects of any regression framework. Keeping these cribs aside, the book is useful as a quick reference to functions from survival, coxme, cmprsk and eha packages.

How Students Learn Statistics

This is a nice article that talks about students’ impediments in learning statistics. The article gives a laundry list of recommendations to improve the state of statistical pedagogy at college-level courses.

The basic goals for any instructor should be to impart the following ideas:

  1. The idea of variability of data and summary statistics.

  2. Normal distributions are useful models though they are seldom perfect fits.

  3. The usefulness of sample characteristics (and inference made using these measures) depends critically on how sampling is conducted.

Mindless Statistics

The paper titled, Mindless Statistics, by Gerd Gigerenzer, makes a case for banishing the mindless “null ritual” from statistics. In this blog post, I will summarize the main points of the paper.

The author starts off by emphasizing the importance of developing a statistical toolbox. Indeed statistics is a rich subject that can be enjoyed by thinking through a given problem and applying the right kind of tools to get a deeper understanding of the problem. One should approach statistics with a bike mechanic mindset. A bike mechanic is not addicted to one tool. He constantly keeps shuffling his tool box by adding new tools or cleaning up old tools or throwing away useless tools etc. Far from this mindset, the statistics education system imparts a formula oriented thinking amongst many students. Instead of developing a statistical or probabilistic thinking in a student, most of the courses focus on a few formulae and teach them null hypothesis testing.

A Multi-Language Computing Environment for Literate programming and Reproducible research

The paper titled, A Multi-Language Computing Environment for Literate programming and Reproducible research, gives an introduction to org-mode. In order to communicate research work to others, it is often important to mix prose and code in same document. There are many tools out there that do the job. However org-mode is one such tool that is useful for literate programming as well as reproducible-research. Be it a research environment or a pedagogical environment, the need for mixing code and prose is always present. This paper talks about Org-mode that is probably one of the most powerful tools prose and code from many languages.

Active Documents with Org-mode

The paper titled, ”Active Documents with Org-mode”, gives a concise introduction to the way org-mode can be used for reproducible research. In one single document, Linux shell commands, Python code and R code are all used for analyzing base-ball statistics. Org-mode is used to produce one comprehensive documents that details all the various steps in the analysis.

The following visual from the paper gives an overview of the org-mode document :

Should vs. Must

Link : What to Do at the Crossroads of Should and Must ?

There are two paths in life: Should and Must. We arrive at this crossroads over and over again. And each time, we get to choose. Should is how others want us to show up in the world — how we’re supposed to think, what we ought to say, what we should or shouldn’t do. When we choose Should the journey is smooth, the risk is small

Martingales in Survival Analysis

The paper titled, History of Application of Martingales in Survival Analysis, provides a nice narrative of the various scientists, mathematicians, events and concepts behind the wide-spread usage of martingales in Survival analysis.

There are two major takeaways from this paper. One is of course the time line of all the developments in the field of survival analysis. The second takeaway from this paper is a good intuitive understanding of martingales + martingale stochastic integrals and their practical application in getting to asymptotic properties of many estimators. Survival analysis is one field where theoretical developments and practical applications have gone hand-in-hand. The paper, mirroring the developments in the field, gives a balanced view of the way martingales played a crucial role in all the developments.

Survival Analysis – A Self-Learning Text

book_cover_sa_self

As the title suggests, this book is truly a self-learning text. There is minimal math in the book, even though the subject essentially is about estimating functions(survival, hazard, cumulative hazard). I think the highlight of the book is its unique layout. Each page is divided in to two parts, the left hand side of the page runs like a pitch, whereas the right hand side of the page runs like a commentary to the pitch. Every aspect of estimation and inference is explained in plain simple English. Obviously one cannot expect to learn the math behind the subject. In any case, I guess the target audience for this book comprises those who would like to understand survival analysis, run the model using some software packages and interpret the output. So, in that sense, the book is spot on. The book is 700 pages long and so all said and done, this is not a book that can be read in one or two sittings. Even thought the content is easily understood, I think it takes a while to get used the various terms, assumptions for the whole gamut of models one comes across in survival analysis. Needless to say this is a beginner’s book. If one has to understand the actual math behind the estimation and inference of various functions, then this book will equip a curious reader with a 10,000 ft. view of the subject, which in turn can be very helpful in motivating oneself to slog through the math.