Working with JSTOR Data for Research

From Paper Machines Wiki
Jump to: navigation, search

[JSTOR Data for Research] provides basic bibliographic information and word counts for a large sample of the texts from their collection. This data can be fed into machine learning algorithms such as the topic models included in [MALLET]. At present, only the topic modeling functionality within Paper Machines supports DfR input.

To obtain and analyze a DfR dataset:

  1. You must first [register] for a DfR account. You can then search for articles based on keywords, years of publication, specific journals, and so on. Note that if your query returns more than 1,000 results, you will receive a random sample of 1,000 documents.
  2. Once the query has been refined to your liking, go to the Dataset Requests menu at the upper right and click "Submit New Request."
  3. Check the "Citations" and "Word Counts" boxes, select CSV output format, and enter a short job title that describes your query. For example: Jstor dfr options.png
  4. Once you click "Submit Job", you will be taken to a history of your submitted requests. You will be e-mailed once the dataset is complete.
  5. Click "Download (#### docs)" in the Full Dataset column, and a zip file timestamped with the request time will be downloaded. This file (or several files with related queries, e.g several searches divided up by decade) may then be incorporated into a model.
  6. Paper Machines typically operates only on Zotero collections with full-text documents. In order to use DfR datasets, the easiest method is to create a new folder in Zotero and add one empty text note, then do "Extract Text" from the Paper Machines context menu. This will in effect "trick" the software into thinking you have a suitable full-text collection for analysis.
  7. Once the collection is extracted, you can create a topic model by opening the context menu and selecting Topic Modeling -> "By Time (With JSTOR DFR)." If you select multiple zip files, they will be merged and duplicates discarded before analysis begins.

Be warned, the analysis may take a considerable amount of time before it begins to show progress (~15-30 minutes).