Kathryn T. Stolee

ESEM 2011 Artifacts

In our ESEM 2011 paper, we analyze a sample of 32,887 artifacts from the Yahoo! Pipes repository. We make the data used in our analysis available here. If you are interested in the JSON representations of the artifacts, please contact me directly.

Artifact Summaries

This data was used for the majority of the tables and figures in the paper. Specifically, the data for Tables II, III, IV, V, VI, VII, VIII, IX, XII, XIII and Figure 3 comes from the artifact summaries. This data set includes the following information for each row, where the left column corresponds to the column in the data set provided below, next to a description of its value:

id:	The pipe's identifier, as defined by Yahoo! Note that some of the pipes may have changed since we ran our scraper, but the id can be used to view the pipe's structure and content on Yahoo!'s servers
author:	An identifier for the author of the pipe
prol:	A boolean value, 1 or 0, indicating if the author was flagged as a prolific author for the analysis
authpipes:	The number of pipes within our sample created by the author
createdate:	The date on which the pipe was created
days:	The number of days since the author's earliest pipe in the sample that the current pipe was created
config:	The number of user-setter modules in the pipe
modules:	The number of modules in the pipe
clones:	The number of times the pipe had been cloned at the time it was scraped from the repository
l0:	Considering the community, this is the size of the cluster at level 0, where each pipe is in its own cluster (See Table I in the paper for cluster level descriptions)
l1:	Considering the community, this is the size of the cluster containing this pipe at level 1
l2:	Considering the community, this is the size of the cluster containing this pipe at level 2
l3:	Considering the community, this is the size of the cluster containing this pipe at level 3
l4:	Considering the community, this is the size of the cluster containing this pipe at level 4
l5:	Considering the community, this is the size of the cluster containing this pipe at level 5
l6:	Considering the community, this is the size of the cluster containing this pipe at level 6
l7:	Considering the community, this is the size of the cluster containing this pipe at level 7
clustered:	The minimum level at which this pipe joined at least one other pipe in a cluster
l0self:	Within-author clustering, this is the size of the cluster at level 0, where each pipe is in its own cluster. Note that all the within-author clusterings were only performed for the most prolific authors
l1self:	Within-author clustering, this is the size of the cluster containing this pipe at level 1 (only computed for the most prolific authors)
l2self:	Within-author clustering, this is the size of the cluster containing this pipe at level 2 (only computed for the most prolific authors)
l3self:	Within-author clustering, this is the size of the cluster containing this pipe at level 3 (only computed for the most prolific authors)
l4self:	Within-author clustering, this is the size of the cluster containing this pipe at level 4 (only computed for the most prolific authors)
l5self:	Within-author clustering, this is the size of the cluster containing this pipe at level 5 (only computed for the most prolific authors)
l6self:	Within-author clustering, this is the size of the cluster containing this pipe at level 6 (only computed for the most prolific authors)
l7self:	Within-author clustering, this is the size of the cluster containing this pipe at level 7 (only computed for the most prolific authors)
clusteredself:	Within-author clustering, this is the minimum level at which this pipe joined at least one other pipe in a cluster (only computed for the most prolific authors)

Download the data set here: (csv)

Rolling Cluster Analysis

We perform a rolling diversity analysis within each of the most prolific authors, as described in Section VII: Analysis of the Most Prolific Authors. The data for Figure 2 and Tables X, XI comes from the rolling cluster analysis. For each author, we sort the pipes they created by date, and then compare the level at which each pipe was clusterd considering all the pipes created before it. Each row in the data set contains the following information:

level:	The level at which pipe2 was clustered
author:	The author who created the pipes
pipe1:	The ID of the pipe that was added just before pipe2 was added
pipe2:	The ID of the pipe being clustered
date1:	The creation date of pipe1
date2:	The creation date of pipe2
between:	The days between the creation dates of pipe1 and pipe2

Download the data set here: (csv)