Curating Your Corpus for Better Analysis

From Paper Machines Wiki
Jump to: navigation, search

Curating Your Corpus for Better Analysis

  • Getting good results depends on pre-curating the texts in Zotero so as to ask intelligent questions. For instance, Paper Machines can compare a series of folders. What should those folders be? If you want to compare French and English novels, then there should be one folder of French novels and another of English. If you want to compare seven units at the World Bank, then each unit should get a folder. If you are looking for change over time, it might make sense to divide up your texts into decades or centuries, pre-civil-war and post-civil-war, so that the two sets can be easily compared.
  • Also, because the topic-modeler handles texts as a bucket-of-words, better results may come from splitting up big pdf's. Break up full-text novels or whole World Bank Reports into several different documents, each a book chapter or section.
  • Think about what the topic modeler is doing -- it is looking for words that frequently appear together. By default, the topic modeler is set to notice words that appear within the same document, but sometimes you want to see words that appear together within the same paragraph. You can change how the Topic Modeler works by looking in Paper Machines' drop-down menu under Topic Modeling > by Time and clicking the button marked "Advanced Settings."

Using Advanced Settings to Get Better Results

Q: Is there any way to adjust the preferences to exclude certain words?

A: You absolutely can adjust the preferences to exclude certain words. Right click on the folder in question, and choose from the menu, "Paper Machines Preferences." At the very bottom of the menu, choose "Paper Machines Preferences." The last tab is called "stop words." Here you can input words you wish to exclude from the search. Please feel free to write about your experience with stop words in the response paragraph and to bring up your adventures in class.


Q: How can I improve my results?

A: Many OCR'd pdf's have terrible text recognition. Check the quality of your inputs, for instance by copy-and-pasting the contents of a PDF into a word document. Can you think of a way to improve the accuracy of this OCR? Perhaps you can edit it by hand. On the other hand, perhaps you can hack your data to auto-correct common false id's. The world of the Digital Humanities is full of tips and tricks -- feel free to share links here!


Q: What else should I play with to get better results?

A: Consider reading a little about some of the other options that appear on the "Advanced" menus of Paper Machines: tf*idf filtering, stemming, Latent Dirichelet Allocation , and ngrams.

Navigating Inside the Topic Model Visualization

Q: I'm trying to better understand how to read the topic models generated by Paper Machines. Why are topics defined by groups of three words?

A: The three words listed up top are just the first three from the larger "topic." Click on that topic up top to see more of it displayed, including the other stem words.