Getting Started with Textual Data in Python 3-Part Series

Feb. 14, 2022, noon - Feb. 18, 2022, 2 p.m.

Organizer -

DataLab: Data Science and Informatics

Contact -

datalab-training@ucdavis.edu

Location -

Zoom

Description

This three-part workshop series covers the basics of text mining with Python. We will focus primarily on unstructured text data, discussing how to format and clean text to enable the discovery of significant patterns in collections of documents. Sessions will introduce participants to core terminology in text mining/natural language processing and will walk through different methods of ranking terms and documents. We will conclude by using these methods to classify texts and to build models of "topics." Basic familiarity with Python is required. We welcome students, postdocs, faculty, and staff from a variety of research domains, ranging from health informatics to the humanities. This workshop occurs during UC Love Data Week, and all members of the University of California system are welcome to register.

Workshop dates are February 14, February 16, and February 18, 2022, 12:00 PM–2:00 PM.

Learning Objectives

By the end of this series, you will be able to:

Prepare textual data for analysis using a variety of cleaning processe
Recognize and explain how these cleaning processes impact research findings
Explain key terminology in text mining, including "tokenization," "n-grams," "dependency parsing," and "stylometry
Use special data structures such as document-term matrices to efficiently analyze multiple texts
Use statistical measures (pointwise mutual information, tf-idf) to identify significant patterns in text
Classify texts on the basis of such measures
Produce statistical models of "topics" from/about a collection of texts.

Prerequisites

Instructors will distribute a zipped directory of notebooks and files the week prior to the workshop. Participants are required to load this data into their Google Drive account before our first session. We also ask that participants read the first two sections of the workshop reader in advance to prepare for the series.

In addition to this prep work, a basic knowledge of Python is required. Specifically, participants should be able to:

Load text data into Python
Load Python libraries
Work with different Python data structures (strings, lists, dictionaries)
Implement control flow with for loops
Use Pandas dataframes (primarily: indexing and subsetting.

Instructors: Tyler Shoemaker, Carl Stahmer

Instructors’ Biographies

Tyler Shoemaker is a Postdoctoral Scholar at the DataLab, where he develops and implements methods for text analysis and natural language processing across a variety of research projects, ranging from the digital humanities to environmental and health sciences.

Carl Stahmer is a digital humanist.He is the Executive Director of UC Davis DataLab and Professor of English. He leverages his expertise as a computer programmer and system architect to tackle complex problems in the humanities and beyond.

Software

Python; Google Colab (instructors will provide notebooks and data)