Week 7: The Dangerous Art of Text Mining: A Methodology for Digital History

Week 7: The Dangerous Art of Text Mining: A Methodology for Digital History

Text Mining for the Researcher: The Main Point

The Dangerous Art of Text Mining: A Methodology for Digital History is a book published in September 2023 by Jo Guldi that explores the usage and understanding of text mining as it pertains to the field of historical researcher and analysis; Jo Guldi sets out to view how text mining, counting words themselves, helps us understand the frequency of language and the dangers still present within analysis by digital tools. The main theme of her work is creating a map to understand how researchers can take a digital, quantitative approach to history that creates unique interpretation, and instead provides a robustly accurate, original, and profound dimension to this complex discipline.

The Distinctiveness of Certain Eras

One of the chapters that I found most interesting was Guldi's chapter 8 in Part II: The Distinctiveness of Certain Eras. Within the chapter, Guldi wishes to understand how the development of language can be traced by the computing machines used for bit mining; Can a computer also discern and describe the differences of individual blocks of time (Guldi, 229)? 

Figure 8.3,Temporally adjusted tf-idf or tf-ipf.
Found on page 238.
Guldi takes an approach to evaluate this concept by analyzing language of the English parliament and short term change with long term implications. She introduces an equation called an “term frequency-inverse document frequency” (tf-idf) to rank term-document combinations and find regularity in periods of verbiage (Guldi, 236-237). The equations marks where the period is set as “day,” allowing the tf-ipf algorithm assigns its highest ranking to words that appear in frequency section by section. This translates to the understanding of 19th century English parliament by allowing the reader to summarize and understand the views taken by progressive and conservative politicians. Specifically, she noted how the discussion of nations like Ireland and India in the maniuscripts help historians understand British global affairs and the intensity of the debates surrounding the two minor nations (Guldi, 253-254). 

By engaging with the mathematics of distinction, Guldi pinpoints the material most useful for understanding the relative time at which forces of reason worked in the past to understand the social dynamics unfolding in discourse. The unique thing about the tf-ipf model, however, is that is models distiction as well, rather than just frequency. 
Found on page 243.


Guldi asks the question "If tf-ipf truly measures “significance” of a term within a period, how can a highly ranked word turn out to have been quite scarce in its period" (Guldi, 242)? The answer is found in the knowledge of analyzing the raw data. For example, the word "boycotting" is not a frequent word used in manuscripts surrounding the 19th century, but it is distinct in encapsulating tendency and understanding, giving it the highest ipf scoring (Guldi, 243).

The TF-IPF model is unique to me in the idea that it doesn't just look at frequency of words, but it also understands the demographic and intensity surrounding the discussion on a larger historical context. A tool like this would be extremely beneficial in an evaluation of history from the top, due to the lack of sources from "the below", or lower classes. 

References

Guldi, Jo. The Dangerous Art of Text Mining: A Methodology for Digital History. Cambridge University Press, 2023.

Comments

Popular posts from this blog

Week 14: Reflections on the History Harvest: Democratizing the Past Through the Digitization Of Community History

Week 10: Focus on Digital Methods @ UCF's Florida Historical Society Symposium

Week 11: Interactive Visualization: Insight Through Inquiry / Bill Ferster's ASSERT Model