Blogs

Today's Reading

A short, but insightful interview with Hadley Wickham, a major contributor to the R language ecosystem. I particularly enjoyed his take on how big are the proverbial "Big Data" and what it implies for analysis.

A superb open-access paper on best practices for scientific computing, and a set of slides to accompany it. I am not yet convinced that test-driven development is as useful for scientific applications, but I generally agree with over ninety percent of the suggestions.

Visualizing Decision Trees

Terrfic illustration of how decision trees, one of the core machine learning algorithms, work. May not work equally well in every browser, I had good luck with Chrome. Thanks go to Andreas Müller for pointing out the link.

What Is Code and Users vs. Programmers

Today I'd like to share two pieces that would be helpful for non-IT readers to understand some peculiarities of this business. First, a long article about contemporary software industry: What Is Code? (Yes, I've shared this previously on Facebook before, but this is definitely a worthwhile read.) Second, a note by Philip Guo concerning user culture versus programmer culture. Both of these resonated with me quite strongly.

For those who either went to graduate school or is contemplating this action, Philip Guo is the author of the awesome grad school memoir: The Ph.D. Grind. Though Philip studied computer science, this piece is the single best summary of what it takes to make it through grad school that I ever encountered.

Shrinkage Estimator

Surprisingly few people I've met have encountered the shrinkage idea, even among the statistically literate ones. If you routinely estimate means of several groups of objects, this will likely come to be a surprising result for you. Here is a paper that describes how and why this estimate works, and when it may not the the best available option to use.

These days, however, with all the so-called "Big Data" hype all around us, shrinkage is a very powerful idea. In my experience, "Big Data" is rarely big in a sense most people think of it. True, the raw data amounts that are available can easily be in terabytes, but often these are data concerning a large number of individual entities, e.g., customers, and results are sought at the granularity of individual entity. For some of them, it is possible (there are customers who shop prolifically), but for most, we'll have a handful of observations at best. This is exactly the kind of situations where shrinkage estimator rocks.

Setting Up Git on Windows, Part 2

If you followed through the directives in Part 1 of this guide, you should now have a working version of Git on your Windows machine. Those with a *nix machine (including Macs) should already have Git installed. Those who have absolutely no idea how to use Git can either familiarize themselves with Chapter 2 of this book, or wait until my book gets published :)

In this part of the guide I will talk about setting up an SSH identity. Strictly speaking, you don't really need this to use Git with eithe GitHub or Bitbucket - you could push and pull over HTTPS instead of SSH by typing the password from your corresponding account every time. However, if your password is sufficiently long and secure (and it should be), this can get pretty tedious fairly quickly. That is where an SSH identity will come handy: it will teach GitHub and Bitbucket to recognize your computer when you try accessing a Git repository from it. The downsides are twofold: first, anyone with access to your computer will be able to interface with GitHub and Bitbucket repositories; second, if you routinely work on a number of computers you would need to go through this process on each of them.

Most of the steps are laid out straightforwardly in this guide from Bitbucket. You can skip Step 1 even though it is informative to get familiar with in case you don't have a clue what SSH is. I would pretty much encourage you to follow Steps 2 through 6 verbatim. (I would skip entering a passphrase in Step 3 unless you frequently share your computer with someone who you don't trust.) Step 7 assumes that you already have a local Git repository on your computer configured to use HTTPS. In case you are starting from scratch - i.e., when you have a set of files that you seek to put under version control - do the following:

  1. First, use the Git Bash shell to navigate to the directory with those
    files: cd /c/path/to/folderWithFiles
  2. Next, turn on Git in this folder: git init
  3. Finally, set up remote branch: git remote add origin ssh://git@bitbucket.org/yourBitbucketUsername/yourBitbucketRepositoryName.git

Once this is done, you should be interact with Bitbucket from Git Bash without any issues. To do the same with GitHub, you can use the same SSH identity. Simply open the file id_rsa.pub that you created previously with Notepad, copy its contents, and use the guidance in Step 4 of this guide to add the key to your account. This should be sufficient.

Congratulations, now you can freely use Git on Windows and interact with two major Git hosting services without any hassle! There is nothing really difficult about this process, but putting it all together can be less than straightforward for a beginner, and hopefully this two-part guide will make it easier.

Setting up Git on Windows, Part 1

These days, Git is the default way to perform version control when developing code. I could go on a prolonged rant about how economists (and most social scientists, in my opinion) should invest a week of their time to learn how to stop worrying and love the tool. However, such rants can be easily found on Google, and, instead, I decided to draft up a quick guide on how to get Git up and running on a Windows machine. Git usually comes pre-installed on a Mac or Linux machine already, but the second part of the guide would be helpful for those platforms as well.

You will first need to obtain Git, I suggest getting the installer from git-scm.com. It is a pretty standard Windows-style installer, and I strongly encourage you to accept the default options: first, use Git only inside Git Bash, and second, checkout Windows-style, commit Unix-style. Other choices are less relevant.

Once the installation concludes, you can start Git Bash, and it will open up a terminal window with a standard Unix-style bash shell, which will be Git-aware. By default, the only editor it will support will be vi, which is powerful but has a steep learning curve. For new users I would instead suggest using nano, which is basically a Linux version of Notepad. Here is a post that details how to install nano on Windows and use it with Git. I would add to it that, to edit environment variables on Windows, you can follow this guide from Microsoft, which is a tad more detailed. Test your success by starting the Git Bash shell again and typing nano inside - if successful, press Ctrl+X to exit back to shell.

You are now all set to use Git locally, provided that you know how to do it. To harness Git’s full power, however, you will want to be able to interact with a remote repository. The two best web services to host a remote Git repository are GitHub and Bitbucket. GitHub offers free repositories with unlimited number of users, provided that you make all the code publicly accessible to everyone via their web interface; you would need to pay GitHub to make your code private. Bitbucket instead offers private repositories, but limits the number of users on them to five, and asks you to pay them if you need more users. The choice is yours to make, I use both for different applications.

In the next section I will discuss how you can push code to your GitHub/Bitbucket repositories and pull it back from them. Stay tuned.

Quirks in R and Formatting Tables

A short slide deck that illustrates a few gotchas in R. I had to discover most of them the hard way, and there is really no single source of information that would list most of these issues. Well, Harry Paarsch and myself are hoping to fix some of it with our forthcoming book, but it still has to get printed first.

The second point is more of a pet peeve of mine: formatting tables with data for readability. I found this fantastic illustration of the "less is more" principle; as usual, a picture is worth a thousand words.

Today's Readings

I read voraciously, and, at some point, I decided it would be nice to share some of the stuff I read with the rest of the world. There were two worthy items today:

Peter Norvig, Director of Research at Google, discusses mistakes people make when they perform A/B tests. Plenty of useful insights, I have made (or seen people make) at least a half of these mistakes. The best observation, I think, was about people confusing uniformity with randomness.

Jake VanderPlas talks about estimating models with more parameters than available data points. Besides the main point, there is a nice illustration of what L1/L2 regularizations mean from a Bayesian perspective.

Software: Stata

In the previous post, I discussed SAS in detail. Now I will turn to my favorite package, Stata. Five years ago nobody would seriously consider it as an option for dealing with large datasets. Today, it is probably one of the most effective tools of which I am aware.

Let’s get done with the bad news first: Stata is, by design, memory-bound. This means if your machine has 16 GB of RAM, realistically you would not be able to work with datasets that are over 15 GB. (Stata itself needs very little RAM, but your operating system needs some as well.) To make matters worse, Stata does not support distributed computations. It was conceived and developed in the 80s and 90s, when the “big data” challenges could be addressed by getting a more powerful host. Third, Stata can only operate on a single dataset at any given moment. Think about a single Excel sheet the size of which is only limited by the available RAM. Finally, not unlike SAS, Stata has its own unique programming language, which is quite quirky and can be tricky to master.

Given these formidable drawbacks, why would someone even bother considering Stata? Because it is otherwise made of pure awesomeness, that’s why. (Ok, I may be exaggerating a bit here, but bear with me.) Most skilled empiricists know the sad truth of working with real data: about eighty percent of time on every project is usually spent on manipulating data and wrestling it into format suitable for analysis. I am yet to find a more efficient tool than Stata for these tasks. Assuming you can fit the relevant data into RAM, once it is loaded, Stata is blazingly swift at cutting and slicing the data. Whenever I open a new dataset, it typically takes less than five minutes to identify any potential problems that crept in during the data construction phase. This makes Stata ideal for prototyping new solutions: at the early stages of any project it is important to fail quickly. Whatever takes hours to do in SAS, usually takes minutes to do in Stata.

I already mentioned that Stata’s internal language is quirky. If you stick with it long enough to get over these quirks, however, you will come to appreciate its flexibility and pithiness. In particular, it provides an uncanny level of versatility when it comes to creating loops. There is effectively no difference in looping over variable values, variable names, strings, or numbers – a feature that enables writing exceedingly compact code. Whenever I use R, SAS, Matlab, or virtually anything else other than Stata, I resent not having this kind of versatility at my disposal.

In addition, Stata was written by economists and, primarily, for economists. A number of estimation methods and routines that are specific to economics have been implemented only in Stata, mostly methods related to panel data analysis. People have conflicting opinions on whether it is best to “code up” all the methods you use from scratch. My take on this is simple: speed maters, so whatever enables rapid exploration is good. Once you find something that you think is working, re-engineering the method from scratch is a perfectly reasonable way to bring extra robustness.

I will conclude by noting that a perpetual license for a very powerful implementation of Stata costs under ten thousand dollars, and one can get a less powerful version for about a thousand. This is my default tool for exploratory data analysis, which I personally find indispensable.

Software: SAS

In the previous posts I talked primarily on what to do and what not to do when working with data. Now I want to switch gears a bit and discuss the tools of the trade, i.e. the packages and programming environments one could use to this end. I have probably had a chance to confront the vast majority of existing analytic solutions, and as such, was able to form an opinion on each. Before I start, however, I would like to share a nice summary of what your statistical language of choice reveals about you. As usual with these kinds of lists, about eighty percent of information thereof is pretty accurate.

I will start with SAS. There are multiple companies out there that depend on SAS heavily for their data analytics, and it is understandable: for a long time, SAS was the only statistical package that could deal with large datasets. Unlike most other high-level packages such as R or Stata, SAS does not keep the data in RAM, so the HDD space is effectively the only limit on the amounts of data it can process. This probably explains why most data analysts with over ten years of experience have a solid proficiency in SAS: when they started their careers, other tools were not an option.

Personally, I am not a huge fan of SAS, for three major reasons. First, it is slow relative to other packages that keep data in memory. These days getting a host with 60+ GBs of RAM is trivial, and hosts with over 200 GBs of RAM are not unheard of. While the raw data size also grew accordingly, most datasets usually shrink to tens of gigabytes when they are processed and ready for analysis, and thus a host with 64 GBs of RAM works out fine nine times out of ten.

Second, SAS has its own peculiar programming language that is unlike anything else out there. This implies a steep learning curve for anyone new to the environment. This is not to say that the problem is unique to SAS—Stata has perhaps even less intuitive language syntax. However, this is still a downside, because it takes time to get new people up-to-speed on current work done within the team, which is never a good thing.

Finally, SAS is outrageously expensive. A relatively modest annual license can easily cost in the low six digits for the first year and in the mid-five digits every year thereafter. While SAS offers an amazingly rich array of modules that are designed to take care of the grunt work such as loading data from databases directly into SAS, these price tags are still hard to justify.

For a while, SAS was able to benefit from the fact that CPU speed grew roughly at the same rate as the size of the average dataset. As long as this was true, one could solve the “big data” problem by allocating a larger host for the analysis. In the early 2000s, however, datasets started growing much faster, and reading from disk became the hardware bottleneck. This gave birth to distributed computing frameworks (read: MapReduce), and the people at SAS Institute were for some reason remarkably oblivious to this structural change. As a result, SAS is now hopelessly behind when it comes to reading data from a distributed storage system such as HDFS. It will be interesting to see what they can make of the situation at hand.

Syndicate content