Developing a NLP-driven research method

I am at the early stages of developing a new research strategy for analyzing historical materials. Given the enormous volume of digitized but otherwise unstructured textual data stored in online databases such as Factiva it is easy to feel overwhelmed when beginning to research a new topic. I envision a method for digesting these texts. Using named entity recognition I want to parse the text into a structured dataset of entities and their concordances. Likely, the system would have two passes. First, using named entity recognition, the system would extract relevant entities from the text and store them in a list. Second, the system would use the list to produce a dataset with each name and its concordance in the text. At the outset, it seems like the minimal aspects of the data that should be extracted for the purpose of historical research are: persons, organizations, locations, times, sources. Ideally, this data could be used to visualize networks of association or geographical transformations over time. Clearly the difficulty of the project is how to formalize relevant notions of ‘association’ and how to handle homonyms or heteronyms. Because of the nature of news articles, it is unclear what the relevance of co-occurrence in one document is. For example, being interviewed by a journalist for the same story means that two interviewees have some sort of relation to one another. Nevertheless, it is impossible to infer from this fact alone what the content of that relation is. A clearer notion of association may be accessible through the collection of multiple instances of occurrence. If two entities are often mentioned in news articles, perhaps there is an important relationship between them. I doubt that an algorithmic procedure can elucidate what the relationship is without human interpretation. However, I do believe that an algorithm can help direct a researcher’s attention to important relationships. Of course, the utility of such a procedure is threatened by the fact that many different entities share the same name, and sometimes one entity can change its name over time. These simple facts of history tend to escape the purview of computational approaches to the text. The goal though is not to use methods of digital humanities to derive research conclusions, but rather as a tool to gain entry into a new and nebulous topic

Automated Futures

Leave a Reply Cancel reply