I spent two weeks at DHSI this year. Week 2 I took Liz Losh’s and Jacque Wernimont’s Feminist DH, which was incredible and I highly recommend to everyone. Check out the #femdh stream on Twitter for details.)
During week 3 of DHSI this year, I took Neal Audenaert’s Topic Modeling, in which we were introduced to using R and then using Mallet in R (following Matt Jockers’ book). I decided to try to topic model the English Revised Standard Version of the Bible, because: 1) I know the material, 2) it was easy to scrape.
I used 1000 character chunks (except for the teeny tiny books like Philemon and some other epistles). And I chose 20 topics (which was too small, but hey, this was my first time out), and Jockers’ stop list (which Neal gave us and I’m guessing is online somewhere). First thing I noticed (besides needing more topics) was that the stop list needs to be expanded. Topic #13 below is basically junk, because of “thee”, “thy”, etc. Thanks to Neal for the help this week!
Here are wordclouds of the top 100 words in each topic. Some make a lot of sense.