Blogs

On Building Data Science Teams and Explaining Some Quirks in R

A number of people passed around a link to a fantastic piece on building well-running data science teams. Based on this paradigm, we did a lot of things wrong at Amazon, and we are doing a lot of things right at Microsoft now, at least in Azure. I particularly enjoyed how Jeff drew parallels between data scientists and advanced report creators, as well as the thinker/doer dichotomy.

I also recently found an improvised FAQ of random quirks in R language, and greatly enjoyed reading through it. I feel like I now finally understand the difference between "<-" and "=" operators for assignment, for once.

Today's Reading

Short and informative note from Kaiser Fung concerning hiring data scientists. The only thing I would add to it is that for me a data scientist who does not know any SQL raises a yellow flag. Complete unfamiliarity with SQL is not a deal-breaker per se, but it is frequently symptomatic of the kinds of problems the candidate never had to face.

Today's Reading

An insightful article concerning ways to test whether your model will be useful. There are links that break the text into four equally sized pieces, since there is a substantial amount of text, but it is well worth your time.

Today's Reading

A short, but insightful interview with Hadley Wickham, a major contributor to the R language ecosystem. I particularly enjoyed his take on how big are the proverbial "Big Data" and what it implies for analysis.

A superb open-access paper on best practices for scientific computing, and a set of slides to accompany it. I am not yet convinced that test-driven development is as useful for scientific applications, but I generally agree with over ninety percent of the suggestions.

Visualizing Decision Trees

Terrfic illustration of how decision trees, one of the core machine learning algorithms, work. May not work equally well in every browser, I had good luck with Chrome. Thanks go to Andreas Müller for pointing out the link.

What Is Code and Users vs. Programmers

Today I'd like to share two pieces that would be helpful for non-IT readers to understand some peculiarities of this business. First, a long article about contemporary software industry: What Is Code? (Yes, I've shared this previously on Facebook before, but this is definitely a worthwhile read.) Second, a note by Philip Guo concerning user culture versus programmer culture. Both of these resonated with me quite strongly.

For those who either went to graduate school or is contemplating this action, Philip Guo is the author of the awesome grad school memoir: The Ph.D. Grind. Though Philip studied computer science, this piece is the single best summary of what it takes to make it through grad school that I ever encountered.

Shrinkage Estimator

Surprisingly few people I've met have encountered the shrinkage idea, even among the statistically literate ones. If you routinely estimate means of several groups of objects, this will likely come to be a surprising result for you. Here is a paper that describes how and why this estimate works, and when it may not the the best available option to use.

These days, however, with all the so-called "Big Data" hype all around us, shrinkage is a very powerful idea. In my experience, "Big Data" is rarely big in a sense most people think of it. True, the raw data amounts that are available can easily be in terabytes, but often these are data concerning a large number of individual entities, e.g., customers, and results are sought at the granularity of individual entity. For some of them, it is possible (there are customers who shop prolifically), but for most, we'll have a handful of observations at best. This is exactly the kind of situations where shrinkage estimator rocks.

Setting Up Git on Windows, Part 2

If you followed through the directives in Part 1 of this guide, you should now have a working version of Git on your Windows machine. Those with a *nix machine (including Macs) should already have Git installed. Those who have absolutely no idea how to use Git can either familiarize themselves with Chapter 2 of this book, or wait until my book gets published :)

In this part of the guide I will talk about setting up an SSH identity. Strictly speaking, you don't really need this to use Git with eithe GitHub or Bitbucket - you could push and pull over HTTPS instead of SSH by typing the password from your corresponding account every time. However, if your password is sufficiently long and secure (and it should be), this can get pretty tedious fairly quickly. That is where an SSH identity will come handy: it will teach GitHub and Bitbucket to recognize your computer when you try accessing a Git repository from it. The downsides are twofold: first, anyone with access to your computer will be able to interface with GitHub and Bitbucket repositories; second, if you routinely work on a number of computers you would need to go through this process on each of them.

Most of the steps are laid out straightforwardly in this guide from Bitbucket. You can skip Step 1 even though it is informative to get familiar with in case you don't have a clue what SSH is. I would pretty much encourage you to follow Steps 2 through 6 verbatim. (I would skip entering a passphrase in Step 3 unless you frequently share your computer with someone who you don't trust.) Step 7 assumes that you already have a local Git repository on your computer configured to use HTTPS. In case you are starting from scratch - i.e., when you have a set of files that you seek to put under version control - do the following:

  1. First, use the Git Bash shell to navigate to the directory with those
    files: cd /c/path/to/folderWithFiles
  2. Next, turn on Git in this folder: git init
  3. Finally, set up remote branch: git remote add origin ssh://git@bitbucket.org/yourBitbucketUsername/yourBitbucketRepositoryName.git

Once this is done, you should be interact with Bitbucket from Git Bash without any issues. To do the same with GitHub, you can use the same SSH identity. Simply open the file id_rsa.pub that you created previously with Notepad, copy its contents, and use the guidance in Step 4 of this guide to add the key to your account. This should be sufficient.

Congratulations, now you can freely use Git on Windows and interact with two major Git hosting services without any hassle! There is nothing really difficult about this process, but putting it all together can be less than straightforward for a beginner, and hopefully this two-part guide will make it easier.

Setting up Git on Windows, Part 1

These days, Git is the default way to perform version control when developing code. I could go on a prolonged rant about how economists (and most social scientists, in my opinion) should invest a week of their time to learn how to stop worrying and love the tool. However, such rants can be easily found on Google, and, instead, I decided to draft up a quick guide on how to get Git up and running on a Windows machine. Git usually comes pre-installed on a Mac or Linux machine already, but the second part of the guide would be helpful for those platforms as well.

You will first need to obtain Git, I suggest getting the installer from git-scm.com. It is a pretty standard Windows-style installer, and I strongly encourage you to accept the default options: first, use Git only inside Git Bash, and second, checkout Windows-style, commit Unix-style. Other choices are less relevant.

Once the installation concludes, you can start Git Bash, and it will open up a terminal window with a standard Unix-style bash shell, which will be Git-aware. By default, the only editor it will support will be vi, which is powerful but has a steep learning curve. For new users I would instead suggest using nano, which is basically a Linux version of Notepad. Here is a post that details how to install nano on Windows and use it with Git. I would add to it that, to edit environment variables on Windows, you can follow this guide from Microsoft, which is a tad more detailed. Test your success by starting the Git Bash shell again and typing nano inside - if successful, press Ctrl+X to exit back to shell.

You are now all set to use Git locally, provided that you know how to do it. To harness Git’s full power, however, you will want to be able to interact with a remote repository. The two best web services to host a remote Git repository are GitHub and Bitbucket. GitHub offers free repositories with unlimited number of users, provided that you make all the code publicly accessible to everyone via their web interface; you would need to pay GitHub to make your code private. Bitbucket instead offers private repositories, but limits the number of users on them to five, and asks you to pay them if you need more users. The choice is yours to make, I use both for different applications.

In the next section I will discuss how you can push code to your GitHub/Bitbucket repositories and pull it back from them. Stay tuned.

Quirks in R and Formatting Tables

A short slide deck that illustrates a few gotchas in R. I had to discover most of them the hard way, and there is really no single source of information that would list most of these issues. Well, Harry Paarsch and myself are hoping to fix some of it with our forthcoming book, but it still has to get printed first.

The second point is more of a pet peeve of mine: formatting tables with data for readability. I found this fantastic illustration of the "less is more" principle; as usual, a picture is worth a thousand words.

Syndicate content