Thursday, March 17, 2011

Machine Reading

Bao Nguyen – 03/17/2011
ENGL146DR – Professor Rita Raley

Imperfect Reading

There's a school of thought that supports the notion that deep, focused reading is quickly becoming extinct in an increasingly mechanized world, one where traditional academics are merging with computerized learning in order to accommodate the next generation's educational evolution. There's a certain clash between this romanticized notion of reading and analysis and the move towards a machine-dominated future, but the picture is a lot less black and white. Recent developments in web and computer technology have seemingly increased the capabilities of literary analysis, transforming a traditionally singular, objective experience into a cold, calculated science that supposedly unravels the same experience. There's certainly something to be said about the marriage of technology and analysis but my own experimentation has revealed that machine reading, for all of its toted efficiency, as well as its ability to depict something that print text cannot, simply hasn't evolved to the point where it makes a considerable amount of difference in the experience of literary reading. What I did notice however, is the beginnings of a technological trend that could make a considerable splash in this area of the humanities. That is to say, machine reading has the capabilities to visualize our readings and quantify them, numbers that will supposedly provide some sort of insight into the thematic material in one form or another.

I analyzed Henry James' In the Cage using a few textual analysis tools in this machine reading project, namely the Linguistic Inquiry and Word Count (LIWC), CLAWS Part-of-Speech-Tagger, and Gender Genie. The first tool, the LIWC, calculates the rate at which the author uses self-references, social words, positive and negative emotions, cognitive words, articles, and 'big' words, over six letters. I inputed the entirety of James' novella into the tool and received an efficient chart of the number of each word that James uses in the text, with wildly varying numbers in between categories. While categories such as 'big' words were clearly and accurately displayed, there didn't seem to be much accuracy in determining words that were more vague - “positive” and “negative” emotional words yielded a certain result, but it didn't seem quite accurate. It made a lot more sense that more quantifiable categories such as 'big' words, which have a clear definition, or articles, which are distinctive parts of speech were calculated more accurately. On the tool's informational page, one description notes that “Overall, 29 samples were based on experiments were people were randomly assigned to write either about deeply emotional topics (emotional writing) or about relatively trivial topics such as plans for the day (control writing),” in their accumulation of data. The researchers also note that “LIWC2007 version captures, on average, 86 percent of the words people used in writing or speech,” intimating that there's a considerable margin left unaccounted for. It seems that in terms of capturing emotions or understanding words that can be construed in more than one way, this tool provides little in the way of comprehending the text. Even if it had the ability to accurately spit out a number for positive versus negative words, the user only understands it on a surface level, knows that the text has some degree of emotional inflection, but that in no way provides any insight into what kind of experience that James wishes to communicate with his text. More generally, the reader can understand that there's a certain amount of words in the work, but the tool is reductive – it can only say one way or the other if it's a positive or negative text, with no middle ground. Beyond grasping the numbers behind a text, this sort of machine reading does little to allow the reader to experience what In the Cage, indeed any story, is about.

My experience with CLAWS and Gender Genie also highlighted the inherent flaw in machine reading that LIWC brought up – the at-times mystifying language of computers, as well as a large margin of error. Inputting the entirety of In the Cage led to my browser crashing, so I was only able to load the first ten chapters, which resulted in a nearly endless scroll of text that demarcated nearly every word of the text, while a little designation of the part of speech was displayed on the side. The interesting thing about my experience with this tool was that, at first, I couldn't understand the seemingly infinite abbreviations that the tool used, so I was completely unaware of what CLAWS was displaying. Even from the onset, the description asks for a “tagset,” which requires a bit of digging to know what it meant. This isn't so much a problem if one is willing to search for answers, but in the realm of literary reading, this is a considerable obstacle. It requires somewhat of a different set of skills to process – literary reading has its own, self-contained vernacular that it uses to describe certain aspects of the text, but this computerized textual analysis combines that with the language of the Internet and computers, two distinct cultural entities with their own elements. CLAWS provides a singular example of having to search for the technical definitions to make sense of the reading, an entirely separate skill set. It can certainly be taught, but the toted efficiency of machine reading is hampered somewhat by its problematic vocabulary. In more difficult cases, this can lead to errors, which was certainly the case with the Gender Genie. This tool determines the gender of the author based on the usage of masculine and feminine words, and though James wrote the story from the perspective of a female clerk, the tool determined In the Cage to have be written by a female. This is a fundamental disconnect at the basest level and illustrates the biggest error that these tools fall prey to – the simple human element. By adhering to to analysis based on factual and quantifiable aspects of the text, as well as assigning arbitrary masculinity or femininity to certain words, the machine misses out or completely misunderstands the text on the most basic level.

Reading takes on an entirely new identity when put through the filter of machine analysis. There's a definite shift from the abstract and the objective to a sort of calculated precision that the humanities isn't necessarily used to. Despite the errors and vagueness in the tools that I used for the project, there's still a sense that this new mode of reading can maintain a harmonious relationship with traditional modes of reading, in the sense that, if both were distinctly separated, the experience would be enriched. Literary reading produces its own pleasure of discovery and understanding, and its laid on an unshakeable foundation that doesn't need to be modified. But after that, the machine reading can provide an insight into the technical aspects of the piece that may be overlooked otherwise. The frequency of a word may hint at something that was previously unnoticed on the first reading, or the compilation of the word “book” may provide some insight into Victorian tastes, in the case of In the Cage. Beyond the aforementioned tools, utilities such as ManyEyes, word cloud generators, and graph-based programs provide a distinctive look into aspects of the text that aren't necessarily noticed or emphasized in the traditional models. Word clouds, for example, can provide a visualization of the most-used words in a text, allowing the reader to extrapolate their own conclusions about the thematic material from whatever shows up, at least in theory. In execution, much like the other tools, the visualization that the user gets is mostly without any sort of context, merely a spreadsheet or a bunch of words grouped closely together.The fact remains, however, that stories will be read first for their content, and that content is delivered within a certain context. The governing assumption of a book remains such that a reader will pick it up and try to enjoy it, not consider the ways to visualize the adjectives for example. It must gain considerable cultural relevance, evolve to the point where available tools allow for the same level of deep, focused analysis that traditional reading does, while marrying it to its own exclusive abilities that amplify the experience in new ways. An integration into classrooms would further this union of tradition and experimentation, but that, again, would require a powerful tool that changes the way texts are read and analyzed. The argument can be made that machine reading, as of now, is a fringe aspect of textual analysis, a tool on the side used only for very specific purposes that don't yet apply to the general population of English students. Literary reading has maintained its staying power because it's universal across all texts – even discourse on alternative forms of media such as television and video games contains some literary language. Machine reading is geared towards something very particular, and until it reaches a point where it can be applied in a practical and relevant manner, it will only bask in the shadow of its own potential.