In my previous post I provided an outline of how the KM4Dev SNA will be conducted. Phase One, the analysis of the discussion groups, has commenced. A fortnight or so ago I was provided post data in a XML format for the main group. This was quite exciting because I have 10 complete years of data, and two years of incomplete data to analyse. It’s also a daunting task, because before any analysis can be done 10,576 rows and 7 columns had to be cleaned and manipulated! My tool of choice for a dataset of this size, at least for the initial cleaning and manipulation stage, is Microsoft Excel. Excel has some very good capabilities including a =CLEAN command to remove non-printable hidden characters that cause problems in analysis tools.
The dataset contained 10,354 posts. 7,238 were reply posts. Of these "Anonymous" posted 1,999 replies to 1,374 posts. This represents about 18% of all posts. However, it was necessary to remove "Anonymous" from the dataset, because "Anonymous" is almost certainly not a single person, and to leave them in would distort the results. Similarly, identified pseudonyms, aliases, and duplicate names, along with “self-replies” and no answers were removed. Ultimately this process left 703 identified individuals in the network. These people comprise the node-set for the public bounded or contained network, for which activity and various network measures can be applied.
One of the first measures applied was Gloor’s Contribution Index (messages sent – messages received)/(messages sent + messages received). It is interpreted as follows:
Coupling the index with the frequency of posting allows an individual’s “role type” to be determined as shown below. There are other indices that could be used, including those developed by Derek Hansen, Ben Shneiderman and Marc Smith, but I find Peter Gloor’s Contribution Index sufficient for this stage of the analysis.
The next diagram shows the results for the KM4Dev main discussion group. The active or key participant group comprises 113 individuals, and deeper analysis shows they are active over almost all the years in the dataset. I still need to do further analysis, but this approach provides a way of partitioning the dataset later on.
A common heuristic that can be used to determine the size of the network and predict the number of “lurkers” is the 90-9-1 rule. A 2010 study by Dr Michael Wu, using ten years of data from more than 200 online communities, found that:
The pie chart below presents data for 2010 and 2011 for the main discussion group.
Using this heuristic the predicted size of the KM4Dev main discussion group is 2,420 people. Note there is a very close correlation between Gloor’s Expediters and Wu’s Hyper-Contributors. I don’t know what the membership of the group was in 2011 and 2012, but based on the assignment brief I received it looks pretty close. What do you think?
In my next post I will provide some time analysis, and after that we will get into the social network analysis proper. I look forward to your discussion, questions, and insights.