Text mining with the “southern problem” and other stand ins

Today I was introduced to Google N-Gram, Bookworm, Voyant, Overview, and MALLET in the context of data mining and topic modeling.  We were asked to write a post about how this “distant reading” might inform future projects.  My next research project is a history of Louisiana State Penitentiary (known as Angola Prison) and frankly text mining does not seem like it would be a particularly useful tool as I begin my history on this penal system.  Much of the work we did yesterday on mapping–particularly geocoding and georectifying–strikes me as a far more useful set of tools for this new project. Angola Prison, nicknamed “the Farm,” is an 18,000 acre prison, located in Angola, Louisiana (I have seen it also listed as Tunica, Louisiana).  In 1880 Confederate Major Samuel James (who had the sole monopoly on convict leasing) purchased an 8,000 acre plantation in West Feliciana Parish and nicknamed it Angola given the location many of the slaves in that area had come from. James housed convicts in the old slave quarters on this plantation.  In 1894, the Major died and his son assumed control of his extremely profitable convict labor system. However, Progressive reformers drew attention to the horrors of the convict lease system and in the face of extreme public pressure the state abolished leasing and took control of the penal system in 1901.  At this point the Board of Control ran the Louisiana penal system (at least until 1916 when the legislature began appointing individuals to head the penitentiary system) and immediately purchased this 8,000 acre plot from the James family.  Later in 1922 the prison purchased an additional 10,000 acres of land adding to the total size at 18,000 acres.  I am looking forward to creating a series of maps and plotting out how to georectify the topography and chart how ownership and the size of the prison shifted over time.  There were also large scale floods, including in 1903, 1912, and 1922 (and in the aftermath of Hurricane Katrina) that I imagine destroyed and/or changed the landscape of the prison in significant ways as well.

The data mining tools are far more relevant to my book The Problem South: Region, Empire, and the New Liberal State, 1880-1930.  I self-identify as a cultural/intellectual historian and The Problem South explores early twentieth century ideas (discourse) about the “southern problem” in the late nineteenth and early twentieth century and the way in which identification of southern backwardness and regional deficiencies contributed to the development of liberalism and the consolidation of the regulatory state. This peaked significantly in the first decade of the twentieth century although there was evidence of interest in the “southern problem,” “Negro problem” or “race problem” preceding and following this decade.  Often times the these phrases were substituted for one another.

My sense is that most of the text mining tools I learned about today were not available in 1999 when I began research on this book. Knowledge of these tools in the mid to late “aughts” might have amplified my conclusions although much of the research was already completed by hand in an extremely laborious fashion.  A substantial portion of my book involved using published sources, many of which are digitized now.  However at the time I read cover to cover more than 30 popular and academic magazines/journals (such as NationCentury MagazineIndependentLiterary DigestOutlookAmerican Journal of Sociology, etc.) between 1880 and 1930. My library retrieved large sets of bound volumes that I went through by hand (and could not take home with me).  This involved hours and hours and hours of flipping through thousands of pages. I would xerox all relevant articles and then file these hard copies in labeled folders.  I also used WCat to identify any and all books published on the US South between the ends of Reconstruction through the 1930s (despite the title of my suggesting an endpoint of 1930).  In addition, I augmented my research an array of manuscript collections of individuals and institutions involved in rehabilitating the Problem South (pouring over correspondence in particular). With more time, experience with scripting language, and learning how to fashion sophisticated algorithms for data mining and scraping (I love that term!) I wonder what else I would have discovered.

On most basic level I can confirm that interest in the “southern problem” and “Negro problem” peaked in this period, especially in that first decade of the twentieth century.  I also could have N-gramed “race problem,” “race question,” “southern question,” and “Negro question” as well.  But I started with four.  Here is what I discovered which tells me I was on the right track analytically.  Remember that although the phrase the “southern problem” seems incidental in this N-gram, there is an uptick beginning in 1885 and many times the phrase “Negro problem,” “Negro question,” “race problem,” and “race question” could stand in for the “southern problem.”

Ngram of southern problem



Source: Text mining with the “southern problem” and other stand ins