This will be the first in what I hope to be a series of posts describing how I began dealing with my dissertation data. I am writing these posts in order to help others embark on data-driven projects, as well as to document my process.
It is important to disclose that I am new to working with data in MySQL, and that my dataset of over 3000 course sites would probably be intimidated to even a seasoned expert. But I strongly believe that we should constantly engage in work that challenges our abilities if we want to continue to grow as scholars. My committee members – Matt Gold, Sondra Perl, and David Greetham – are excellent examples of academics who never stop exploring and experimenting, and the results speak for themselves.
To begin, I received my data from Joe Ugoretz, the Associate Dean of Teaching, Learning and Technology at Macaulay Honors College. Joe happens to be an incredible visionary and an amazing mentor to me and the other Instructional Technology Fellows he leads. Joe and I worked with Boone Gorges and John Boy to figure out a way to remove any private information from the SQL dump of data generated from the backend of the Macaulay ePortfolio site. This was an important step, especially in terms of the Internal Review Board, to ensure that any information posted to the course sites that was marked as private remained so in my study. Similarly, all of the participants are identified by node number (based on when they joined the system), not by name.
If you have never seen a MySQL dump from the backend of a WordPress site, trust me when I say that is it extremely interesting, but also very messy. In the table that is generated from just one site you may find over a dozen columns of metadata, and that content may also contain difficult to export html markup. Multiply that by 3000 sites, and I have a lot of material to organize. In order to deal with this onslaught of data, I am using Sequel Pro (at the suggestion of Joe), which is essentially a dynamic content management system through which I can store, view, and manipulate my data. Note that to use this program I set up my own server using MAMP, which is a good first step for anyone embarking on digital work.
Once in Sequel Pro, the data can be exported in a variety of formats using programs such as Excel or TextWrangler. Micki Kaufman – one of my fellow Provost’s Digital Innovation Grant recipients, who I consider both a friend and a scholar of incredible capability – has been helping me sort through the next steps of this process. First, we met to look through my data and play with small sets in Gephi. In fact, Micki led an informative workshop on Gephi for her position as a Digital Fellow at the GC and used my dataset as an example. In our most recent meeting we began to experiment with answering some basic questions using the data. Our first test was to take one course site that contains posts from over a dozen users to see how frequently each user posted. We exported one set of posts as both a CSV and XML table. It turned out that XML worked better because the content of HTML broke the lines in CSV. In XML table we cleaned up the data by eliminating the html tags and spaces. To do this we used grep codes and tr. Even after cleaning the data we had some ugly rows of corrupted text, however, because of the nature of the content and the low number of the node I was able to identify that these rows were composed by the Instructional Technology Fellow assigned to the course, not one the students, and therefore could be deleted without affecting the results of my inquiry.
Before plugging the numbers into a graph on Excel, we went back to old technology: paper. We sketched out a few potential graphs on paper in order to visualize what we wanted the graph to show us. Micki calls this the “napkin sketch” and reminded me of the importance of taking a step back to avoid huge up front delays trying to clean all of the data when you should be focusing on the data you need to accomplish your goal.
The next steps went as follows:
- Converted the table to a tab delimited range of content
- Exported the table to Excel
- In Excel, turned the html field names into the column titles
- Deleted all the unnecessary columns (ie information not relevant to the inquiry
- Added a new sheet and created pivot tables of node ID, number of posts, and characters per posts
- Then we added a column in sheet 1 to calculate words per post (see formula here: http://office.microsoft.com/en-us/excel-help/count-the-number-of-words-in-a-cell-or-range-HA001034625.aspx)
- Added the words per post column to the second sheet
- Created a scatter plot chart with the node ID as points, the x – axis as the number of words, and y – axis as number of posts
This chart allowed us to visually represent the arc of activity on the course site. As one would expect, there are students who are significantly more prolific than others both in terms of number of posts and words per post, and then there are some that fall short of the median.
One small annoyance is that each node has to be entered separately in the schema, which takes a lot of time. But you can see from this shot of the work-in-progress what this kind of graph looks like using the basic Excel presets. It is also important to realize that we were playing with the data here. This example of graphing is a proof of concept to help me shape my ideas and figure out what is possible – this process won’t scale, so we must find another solution for the larger data set.
Again a huge thank you to Micki for her continued support as I work through the next stages of my research.