Getting Started with Textual Data in Python 3-Part Series
Feb. 14, 2022, noon - Feb. 18, 2022, 2 p.m.
DataLab: Data Science and Informatics
This three-part workshop series covers the basics of text mining with Python. We will focus primarily on unstructured text data, discussing how to format and clean text to enable the discovery of significant patterns in collections of documents. Sessions will introduce participants to core terminology in text mining/natural language processing and will walk through different methods of ranking terms and documents. We will conclude by using these methods to classify texts and to build models of "topics." Basic familiarity with Python is required. We welcome students, postdocs, faculty, and staff from a variety of research domains, ranging from health informatics to the humanities. This workshop occurs during UC Love Data Week, and all members of the University of California system are welcome to register.
Workshop dates are February 14, February 16, and February 18, 2022, 12:00 PM–2:00 PM.
By the end of this series, you will be able to:
- Prepare textual data for analysis using a variety of cleaning processe
- Recognize and explain how these cleaning processes impact research findings
- Explain key terminology in text mining, including "tokenization," "n-grams," "dependency parsing," and "stylometry
- Use special data structures such as document-term matrices to efficiently analyze multiple texts
- Use statistical measures (pointwise mutual information, tf-idf) to identify significant patterns in text
- Classify texts on the basis of such measures
- Produce statistical models of "topics" from/about a collection of texts.
Instructors will distribute a zipped directory of notebooks and files the week prior to the workshop. Participants are required to load this data into their Google Drive account before our first session. We also ask that participants read the first two sections of the workshop reader in advance to prepare for the series.
In addition to this prep work, a basic knowledge of Python is required. Specifically, participants should be able to:
- Load text data into Python
- Load Python libraries
- Work with different Python data structures (strings, lists, dictionaries)
- Implement control flow with for loops
- Use Pandas dataframes (primarily: indexing and subsetting.
Instructors: Tyler Shoemaker, Carl Stahmer
Tyler Shoemaker is a Postdoctoral Scholar at the DataLab, where he develops and implements methods for text analysis and natural language processing across a variety of research projects, ranging from the digital humanities to environmental and health sciences.
Carl Stahmer is a digital humanist.He is the Executive Director of UC Davis DataLab and Professor of English. He leverages his expertise as a computer programmer and system architect to tackle complex problems in the humanities and beyond.
Python; Google Colab (instructors will provide notebooks and data)