9/29/2014
Using the Gender Package in R: Part 2 - Parallel Processing
I posted about the gender package in R and gave a few code examples to get started in part 1 here. If you tried this, then you probably realized very quickly that the processing time is very slow. It's a handy package, but if processing thousands or even tens of thousands of names then it would take a very long time. So this post will outline parallel processing (in Windows). By using the power of all of the processors it will cut the time down, especially if using 8 cores at once, which is what I will demonstrate.
The code is similar to the example from yesterday, but in this case I create a function and then use lapply to apply the function over a list of names. Let's examine this code first. We'll recreate the process from Part 1 using lapply.
Since we are using lapply, we can replace that with parLapply, which does the same thing, but uses parallel processing. There are a few additional steps to detect cores, make clusters and register the clusters. This requires a few additional packages and a few more lines of code.
Parallel processing makes a huge difference, especially on multi-core machines. Here's an example processing over 10,000+ names of people who are registered for the upcoming Tableau conference. Tableau Zen Master Anya A'Hearn wanted to do some analysis on the gender of the attendees for an upcoming Women + Data Tableau User Group on Wednesday.
When processing with the normal lapply the processors look like this.
"Scotty, We need more power!" When parallel processing with the parLapply, the processors look like this, pushing my laptop to the limit.
Benchmark times for 1,000 records appending gender on first name with 8 cores and using the code above:
In part 3, I will show you how to put both of these in Tableau and link from a name field. This probably won't be useful for more than a few hundred names, but since we have the code we might as well adapt it to Tableau, especially if we can utilize parallel processing.
I hope you find this information useful. If you have any questions feel free to email me at Jeff@DataPlusScience.com