Who is A Data Scientist

I wrote a short note about data science for my undergraduate alma mater. I decided to save it for posterity since I have not found a great way to discover this page organically. (Perhaps someone on the HSE side should look into this?)

On Difference Between Theory and Practice

An ex-Amazon leader, John Rauser, recently shared a story from his Amazon years. It's a pretty short read, and the critical piece of it is the attached paper by the member of the U.S. Atomic Energy Commission in 1953 that can be summarized with a single sentence: In theory, there is no difference between theory and practice. But, in practice, there is.

Installing Python on Ubuntu

For reasons similar to those mentioned in my previous post, I had to install Python on bare Ubuntu boxes, be that Amazon EC2 instances or Azure Virtual Machines. While most Ubuntu systems come with some version of Python pre-installed, those versions are frequently old and do not support some of the useful packages from the python data science ecosystem, such as numpy, pandas and scikit-learn. Having done this a few times, I decided to write down the steps to avoid going through the pain in the future. Before we start, I have to point out that in the last year or a very nice solution to the problem emerged in the form of conda. If you can leverage it, you absolutely should, since it will save you a ton of time. The rest of this post assumes that you, for whatever reason, want to install python from scratch.

Installing software on Linux is usually either super-simple, or involves massive amounts of pain. This has a lot to do with the Linux software paradigm, which I like to explain via a house construction analogy. A Windows program is like a manufactured home: all the necessary building blocks are packaged together into an application, which gets delivered to the user as a bundle. While this limits flexibility and increases bundle size, the program will typically work on any Windows machine since it is self-contained. In contrast, a Linux program is more like a brick house. Linux is comprised of many tiny programs such as grep, which generally can only do one thing well. These small programs, like bricks, provide building blocks for more complex programs. It is not always possible to list all the bricks on which the final house stands, and, as such, portability of Linux programs can be a problem. The simple path almost always involves invoking a command such as sudo apt-get install package-name. In Ubuntu, apt is a program whose job is to manage dependencies for other programs, i.e. finding what other bricks you need when you try to install a new piece of software. The pain starts when apt-get either cannot locate the necessary brick, or, worse, is unaware of a certain dependency, usually when it is nested a few layers down. At this point one has to perform an indeterminate number of google searches to uncover the name of the missing brick and install it via apt-get. (No, it is not always clear what the name of the missing program is, I wish. For instance, when scipy refused to install due to lack of freetype, I had to install libfreetype6-dev to proceed.)

The exact steps that I had to take to install python were numerous, so, rather than listing them all one-by-one in this blog post, I decided to commit them to my Github public repository for anyone to use, including future me. I provide two files. First script was used to install all python components on a VirtualBox VM that I made for our book on computing. The second script was used more recently to install python on Azure Ubuntu VM. The highlight of it is that I wanted to play around with TensorFlow, and for whatever reason, I ended up having to build it from source, rather than installing a pre-built version. It is likely possible to obtain TensorFlow without having to build it from source, but I ended up having to do just that.

Installing and Configuring Stata

It has been a while since I used Stata last time, admittedly. Nonetheless, at some point I used it extensively, and I definitely satisfy the 10,000 hours criterion for mastery. In the process, I had to install Stata on a number of Linux-based systems, and a number of small issues re-emerged again and again. So I decided to write down a guide for future self, as well as for anyone else who might be interested.

In what follows, I assume that you need to install Stata 13 or newer on a Linux system to which you only have command-line access, most likely over ssh. You have a CD from which you can install Stata on Windows or OS X machine, assuming those have a CD drive, which aren't that common these days. You will need sudo priviliges on the Linux host to install Stata if you want to install it for all users, but it's not necessary if you primarily seek to install it for yourself only.

Start by copying the entire CD contents into a folder on your local machine. Then rsync this folder to the Linux host, e.g. /home/username/stataInstall. Make a folder into which you want to install Stata. The default is /usr/local/stataXX, where XX represents version number, and you almost certainly will need sudo rights to create it. Navigate to this directory using cd and start the installation via [sudo] /home/username/stataInstall/install (sudo is optional, depending on installation location).

You will have to answer a few questions, which are mostly self-explanatory. I assume that by now nobody will want to have a 32-bit installation, since a cap of 4GB for a data set can be restrictive these days. Once the installation concludes, you will need to run ./stinit to activate Stata with your key. Once that process concludes, you should be able to invoke Stata via /usr/local/stataXX/stata-mp.

Even though the installation process is finished, there are a few more steps you should take. First, you should get the updates via the update all command from within Stata. Note that if you installed Stata into /usr/local/stataXX, you will have to invoke the executable with sudo stata-mp if you want the update to succeed.

Second, you will want to add Stata folder to PATH, so that you could invoke the stata-mp command from anywhere. In bash shell this is done via export PATH=$PATH:/usr/local/stataXX: command, which you may want to add to your .bashrc file. (If you have sudo rights and want Stata to be in PATH for all users, then you need to edit the /etc/profile file instead.) If, like me, you prefer zsh to bash, the command is slightly different: PATH=$PATH:/usr/local/stataXX:; export PATH.

Third, and this is where most people trip themselves, you need to make sure that Stata has ample disk space for its temp files. Since it is limited to having a single dataset in memory at any point, many Stata commands require making temporary disk dumps to work. On many Linux hosts the default path for storing temp files is /tmp, which is almost always mounted under the root of the filesystem, i.e. on the primary physical/logical hard drive, where space is usually limited. (To see the overview of the file system, use the df -h command.) There is even a Stata FAQ on how to accomplish this, but, it a nutshell, all you need is to define another environment variable TMPDIR in your .bashrc file and make sure it is pointing to a directory on a disk with lots of space. The above link tells you how to test if the mapping was successful.

This pretty much sums up the process. As a bonus, it is possible to archive the entire contents of /usr/local/stataXX after installation was concluded and updates were deployed. This archive can then be copied to another machine to avoid interactive installation. You will still have to set up the environment variables per above, but this can be done via a shell script. And, of course, make sure that you do not violate the number of concurrent users allowed by your Stata license.

On Building Data Science Teams and Explaining Some Quirks in R

A number of people passed around a link to a fantastic piece on building well-running data science teams. Based on this paradigm, we did a lot of things wrong at Amazon, and we are doing a lot of things right at Microsoft now, at least in Azure. I particularly enjoyed how Jeff drew parallels between data scientists and advanced report creators, as well as the thinker/doer dichotomy.

I also recently found an improvised FAQ of random quirks in R language, and greatly enjoyed reading through it. I feel like I now finally understand the difference between "<-" and "=" operators for assignment, for once.

Today's Reading

Short and informative note from Kaiser Fung concerning hiring data scientists. The only thing I would add to it is that for me a data scientist who does not know any SQL raises a yellow flag. Complete unfamiliarity with SQL is not a deal-breaker per se, but it is frequently symptomatic of the kinds of problems the candidate never had to face.

Today's Reading

An insightful article concerning ways to test whether your model will be useful. There are links that break the text into four equally sized pieces, since there is a substantial amount of text, but it is well worth your time.

Today's Reading

A short, but insightful interview with Hadley Wickham, a major contributor to the R language ecosystem. I particularly enjoyed his take on how big are the proverbial "Big Data" and what it implies for analysis.

A superb open-access paper on best practices for scientific computing, and a set of slides to accompany it. I am not yet convinced that test-driven development is as useful for scientific applications, but I generally agree with over ninety percent of the suggestions.

Visualizing Decision Trees

Terrfic illustration of how decision trees, one of the core machine learning algorithms, work. May not work equally well in every browser, I had good luck with Chrome. Thanks go to Andreas Müller for pointing out the link.

What Is Code and Users vs. Programmers

Today I'd like to share two pieces that would be helpful for non-IT readers to understand some peculiarities of this business. First, a long article about contemporary software industry: What Is Code? (Yes, I've shared this previously on Facebook before, but this is definitely a worthwhile read.) Second, a note by Philip Guo concerning user culture versus programmer culture. Both of these resonated with me quite strongly.

For those who either went to graduate school or is contemplating this action, Philip Guo is the author of the awesome grad school memoir: The Ph.D. Grind. Though Philip studied computer science, this piece is the single best summary of what it takes to make it through grad school that I ever encountered.

Syndicate content