Data Science Weekly – Volume 1 – April 2014

Via TP - Data Scientist Interviews

Parham Aarabi, Founder of Modi Face
ModiFace technology simulates skin-care and cosmetics products on user photos. So, a skin care product that reduces dark spots, or a shiny lipstick, or a glittery eyeshadow … we specialize in making custom simulation effects for all facial products. This is us as a core
Pick problems that in your view truly matter. Too often, we find ourselves pursuing goals that deep down we don’t believe in, and this will only lead to failure or unappreciated success. Pick problems that in your view truly matter. Too often, we find ourselves pursuing goals that deep down we don’t believe in, and this will only lead to failure or unappreciated success.
The impact that mobile apps, especially data processing and intelligent apps, will have on our society.
Pete Warden, Founder of Jetpac
We use machine learning, neural networks, and a lot of other fancy approaches to analyze the images, but Excel formulas are key too. A lot of people underestimate the usefulness of old-school data tools like spreadsheets.
We help you discover fun places to go, both locally and when you’re traveling. We aim to offer the kind of insights you’d get from a knowledgeable local friend about the best bars, hotels and restaurants. The information we get from the mass of pictures, and the pictures we present in the guide, combine to give you a much better idea of what a place is like than any review-based service
Trey Causey –Founder of the spread
I’d say to pick a data set or sets you know really well and explore it like crazy. It’s really helpful to be able to apply a new method to a dataset and have the ability to assess the face validity of your findings. It’s fun to get counter-intuitive findings, but you should really stop and check your work if somehow you find that Ryan Leaf is actually a better quarterback than Peyton Manning. Examples that use uninteresting data (iris anyone?) are a lot less likely to result in you going the extra mile to learn more and exploring after the lesson is over. I’d also say not to get too discouraged. This stuff is hard and it takes a lot of practice and a lot of willingness to make mistakes and be wrong before you get it right. And, if I had one single piece of advice – take more matrix algebra.
Ravi Parikh, Co-Founder of Heap
For me that “aha” moment was when I learned about Anscombe’s quartet. It’s a group of four datasets each of which consist of several (x,y) pairs. Each of these datasets has the same mean of x, mean of y, variance of x, variance of y, x/y correlation, and the same linear regression line. Basically many of the “standard” summary statistics we might use to characterize these datasets are identical for all four. However, when visualized, each of the four datasets yield significantly different results.
Ryan Adams - HIPS (Harvard Intelligent Probabilistic Systems)
I think machine learning will continue to merge with statistics, as ML researchers come to appreciate the statistical approach to these problems, and statisticians realize that they need to have a greater focus on algorithms and computation
I also think that the area of Bayesian optimization is going to get bigger, as we figure out how to tackle harder and harder problems. People are also beginning to understand better the behavior of approximate inference algorithms, which will become a bigger deal I expect.
Go deep. Learn all the math you can. Ignore artificial boundaries between institutions and disciplines. Work with the best people you can. Be wary of training in “data science” where you just learn to use other people’s tools. To innovate, you have to learn how to build this stuff yourself.
Kang Zhao – Applying ML to dating
I guess ML will develop along two directions. The first would be on the algorithm side–better and more efficient algorithms for big data, as well as machine learning that mimics human intelligence at a deeper level. The second would be on the application side - how to make ML understandable and available to the general public? How to make ML algorithms as easy to use as MS Word and Excel?
one must learn how to answer the question– “Now we have the data, what can we do with it?”. This is very valuable in the era of big data.
Dave Sullivan, Founder and CEO of Blackcloud
You’ve got to think through all the ramifications of it, and the more I do, the more I become convinced “data” and what we do with it is going be as transformative to our society during the next 20 years as the Internet has been in the past 20.
Any industry where the accuracy of their predictions can make a significant financial impact to their business. For a company like Netflix, increasing the accuracy of movie recommendations from what they were doing before by 10% might not be a huge deal. But for a company involved in any kind of algorithmic trading (be it options, commodities, or comic books), an extra 10% increase in the quality of certain decisions in their pipeline can make a really big difference to their bottom line
Don’t be intimidated about getting into it! The basics aren’t that complicated - with enough banging your head against the wall, anyone can get it. This is a field that is wide open - there is no “theory of relativity” for AI yet, but there probably will be, and I think it’s actually pretty likely that we’ll see that in our lifetimes. It’s a really unique time in history right now, and this is a revolution that pretty much anyone in the world with an Internet connection can take part in. While many aspects of the worldwide economy are really messed up and will continue to be, I don’t think there’s ever been a time where economic mobility has been more decentralized. No matter where you are or who you are, you can take part if you’re smart enough. So yeah, my advice: jump in, before it gets crowded!
Laura Hamilton, Founder and CEO of Additive Analytics
I like to use Octave, Python, and Vowpal-Wabbit. Sometimes I find it’s helpful to do some initial summary and graphing with Excel. The Additive Analytics web application is built with Ruby on Rails. It sits on top of a Postgres database. For data visualization, I like D3 and DataTables. If I need a quick chart for the Additive Analytics blog, sometimes I will use Infogr.am
Words of wisdom
T ake Dr. Andrew Ng’s Machine Learning course on Coursera.
Take Dr. Abu-Mostafa’s Learning from Data course on edX
Get as many features as you can. Think about where you can get additional data, and add as many new data sources as you can.
Data visualization is as important as the model. You can have the most sophisticated model in the world, but if you don’t present it in a way that’s intuitive to the user it will be useless. All analyses should be actionable
Beware overfitting!
Harlan Harris, Co-Founder and current President of Data Community DC (DC2).
I almost entirely use R and Julia. I was an initial developer of some of the statistical data representation and manipulation functions for Julia,
Get involved in your professional community, whether it’s attending Meetups (and meeting people at the bar afterwards), or answering questions on StackOverflow or CrossValidated, or trying your hand at a Kaggle competition or a hackathon. Learn about the many different points of view of people doing work related to your interests.
Abe Gong, Data Scientist at Jawbone
A lot of data scientists miss the importance of the infrastructure layer, and that ends up seriously constraining the speed, scope, and quality of their work
I’m a python guy. I love ipython, pandas, scikit-learn, and matplotlib. Probably two-thirds of my workflow revolves around those tools. I used R a lot in grad school, but gave it up as I started working more closely with production systems -it’s just so much easier to debug, ship, and scale python code. For backend systems, I’m agnostic. I tend to use the AWS stack for my own projects, but the right combination of streaming/logging/messaging/query/scheduling/map-reduce/etc. systems really depends on the problem you’re trying to solve. In my opinion, a full-stack data scientist should be comfortable learning the bindings to whatever data systems he/she has to work with. You don’t want to be the carpenter who only knows hammers.
First, storytelling: after watching D.J. Patil’s talk about how storytelling is an important skill for data scientists, I put a lot of my spare cycles into reading about, thinking about, and practicing storytelling. I learned to look for story elements in data: plot, characters, scenes, conflict, mood, etc. Often, our first instinct is to reduce data to numbers and hypothesis tests. Looking for the stories in data is another good way to make data meaningful, especially when you want users to get personally involved with the meaning-making.
I’ve really enjoyed exploring the craft of storytelling. It’s a tradition at least as old as the scientific method, and sometimes much more powerful: you may be able to persuade individual humans without telling stories, but it is almost impossible to persuade a whole group without good storytelling - stories are the API to human culture change. I’m not sure that this is unique to data science, but it’s definitely worth knowing. If others want to read up on the subject, I highly recommend Story, by Robert McKee, Save the Cat, by Blake Snyder, and Campbell’s classic The Hero with a Thousand Faces - in that order.
Kari Hensien - Sr Director Product Development at Optimum Energy and Cameron Turner - Data Scientist at The Data Guild.
I believe that statistics and creative data science can create answers to some of the world’s toughest problems. Sometimes solutions can be finessed by correlation and analysis, rather than brute force approaches that attempt to answer a question directly
We use correlation/covariance analysis along with regressions to do basic modeling and build out our view of the landscape. We use both supervised and unsupervised learning to build clustering and identify untold structure in plant performance. We use recursive partitioning to identify custom rules for local set points based on global algorithm development. In terms of favorites: R/R-Studio, Python, Java, SQL, Tableau, Hadoop, AWS
In fact, energy and related fields will be the first to embrace and extend the concepts of machine learning and true big data opportunity
Andrej Karpathy, Machine Learning PhD student at Stanford
“aha” moment - I think it was a gradual process. After I learned the mathematical formulation of regression and some of the related approaches I started to realize that many real-world problems reduced to it. Armed with logistic regression (but also more generally Machine Learning mindset) as a hammer, everything started to look like a nail.
I’m convinced that the future of Machine Learning looks very bright and that we will see it become ubiquitous. The skill to manipulate/analyze and visualize data is a superpower today, but it will be a necessity tomorrow.
I expect we should see a very successful and widely used Machine Learning library for Javascript within a few years.
You learn the most by reinventing the wheel. Don’t just read about Machine Learning algorithms and fall into trap of thinking you understand the concepts because everything you read sounds reasonable. Read it once and then re-implement it from scratch, yourself
George Mohler, Chief Scientist at PredPol
Some of the models we use at PredPol are self-exciting point processes that were originally developed for modeling earthquake aftershock distributions [Marsan and Lenglin, 2008]. The fact that these point process models fit earthquake and crime event data quite well is, by itself, a cool result.
there are many ways to learn data science on your own. I think Kaggle is a great way to start out and there are some entry level competitions that walk you through some of the basics of data science. Coursera has free courses in data science and machine learning;
Carl Anderson, Director of Data Science at Warby Parker
Getting analysts across the company to knuckle down and learn statistics. With a Ph.D. from a statistics department, I am very biased, but a sound basis in statistics seems to be an essential tool of any analyst. Like many skills, statistics may not feel useful and relevant until a specific project comes along and the analyst needs it — for instance, when we need to A/B a change in the website or optimize email subject lines and send times.
Just do it. There is no substitute for getting your feet wet and working on things. These days, there is no shortage of data (even the government has finally caught on: https://www.data.gov), free online courses, books, meetups of like-minded individuals and open sou rce projects. Reading a book is one thing but getting real data and having to deal with missing data, outliers, encoding issues, reformatting, i.e. general data munging, which really can constitute 80% of time of a data science project are the kind of dues that you must pay to build the suite of skills to be a data scientist. Kaggle has some great introductory 101 competitions and http://scikit-learn.org/ has some great documentation that can help get you started.