Data + Science

9/29/2014
Using the Gender Package in R: Part 2 - Parallel Processing

I posted about the gender package in R and gave a few code examples to get started in part 1 here. If you tried this, then you probably realized very quickly that the processing time is very slow. It's a handy package, but if processing thousands or even tens of thousands of names then it would take a very long time. So this post will outline parallel processing (in Windows). By using the power of all of the processors it will cut the time down, especially if using 8 cores at once, which is what I will demonstrate.

The code is similar to the example from yesterday, but in this case I create a function and then use lapply to apply the function over a list of names. Let's examine this code first. We'll recreate the process from Part 1 using lapply.


# install the gender package if you need to
install.packages('gender')
# NOTE - if asked to install GenderData click 1 for yes.

# load package
library(gender) ;

# Import CSV with list of first names 
firstnames <- read.csv("D:\\Dropbox\\Data Mining\\R\\Top 150 names.csv", stringsAsFactors=FALSE)

# if you have a short list or don't have a CSV to try then you can build your own list - simply uncomment
# firstnames <- c("Elizabeth", "Mary", "Jeff", "John", "Morgan", "Helen", "Tim", "Diane", "Patricia")
values <- as.vector(firstnames[,1])

# Create gender search function
workerFunc <- function(n){
  return
  cbind(n, gender(n, method = "ssa", years = c(1900, 1990))$gender)
}

# Start process and track processing time
Sys.time()
res <- lapply(values, workerFunc)
Sys.time()

# Put final results together in data frame
indx <- sapply(res, length)
results <- as.data.frame(do.call(rbind,lapply(res, `length<-`, max(indx))))
colnames(results) <- c("name", "gender")

# Optional write results to CSV
write.csv(results,"D:/Dropbox/Data Mining/R/Top 150 names with gender.csv")

Since we are using lapply, we can replace that with parLapply, which does the same thing, but uses parallel processing. There are a few additional steps to detect cores, make clusters and register the clusters. This requires a few additional packages and a few more lines of code.


# install the packages if you need to
install.packages('gender') ;
# NOTE - if asked to install GenderData click 1 for yes.
install.packages('parallel') ;
install.packages('doParallel') ;

# load packages
library(gender) ;
library(parallel) ;
library(doParallel) ;

# Detect Cores and Register
cl<-makeCluster(detectCores())
setDefaultCluster(cl)
registerDoParallel(cl, cores=detectCores())
clusterEvalQ(cl, "gender")
clusterExport(cl,"gender")

# Import CSV with list of first names 
firstnames <- read.csv("D:\\Dropbox\\Data Mining\\R\\Top 150 names.csv", stringsAsFactors=FALSE)

# if you have a short list or don't have a CSV to try then you can build your own list - simply uncomment
# firstnames <- c("Elizabeth", "Mary", "Jeff", "John", "Morgan", "Helen", "Tim", "Diane", "Patricia")
values <- as.vector(firstnames[,1])

# Create gender search function
workerFunc <- function(n){
  return
  cbind(n, gender(n, method = "ssa", years = c(1900, 1990))$gender)
}

# Start process and track processing time
Sys.time()
res <- parLapply(cl, values, workerFunc)
Sys.time()

# Stop the cluster and create the result data frame
stopCluster(cl)

# Put final results together in data frame
indx <- sapply(res, length)
results <- as.data.frame(do.call(rbind,lapply(res, `length<-`, max(indx))))
colnames(results) <- c("name", "gender")

# Optional write results to CSV
write.csv(results,"D:/Dropbox/Data Mining/R/Top 150 names with gender.csv")

Parallel processing makes a huge difference, especially on multi-core machines. Here's an example processing over 10,000+ names of people who are registered for the upcoming Tableau conference. Tableau Zen Master Anya A'Hearn wanted to do some analysis on the gender of the attendees for an upcoming Women + Data Tableau User Group on Wednesday.

When processing with the normal lapply the processors look like this.

"Scotty, We need more power!" When parallel processing with the parLapply, the processors look like this, pushing my laptop to the limit.

Benchmark times for 1,000 records appending gender on first name with 8 cores and using the code above:

Normal Processing: 6 minute 25 seconds
Parallel Processing: 1 minute 44 seconds

Download this sample R code here.

In part 3, I will show you how to put both of these in Tableau and link from a name field. This probably won't be useful for more than a few hundred names, but since we have the code we might as well adapt it to Tableau, especially if we can utilize parallel processing.

I hope you find this information useful. If you have any questions feel free to email me at Jeff@DataPlusScience.com

Jeffrey A. Shaffer

Follow on Twitter @HighVizAbility