Python for Linguists
The os module is designed to circumvent a lot of these issues.
From the documentation: "This module provides a portable way of using operating system dependent functionality."
Most importantly, os let’s us systematically navigate all files in a directory.
On ILIAS, there is a zipped folder containing the tagged Brown corpus, separated into 15 files, one each per corpus text category. See the manual (also on ILIAS) for information about what the text categories are.
Unzip the files and save them in one directory, which should contain no other files or sub-directories. Using the os module, write a script that opens each corpus file and counts how often the word "love" occurs as a noun and a verb. Print results.
Scroll down for solution.
Assuming our text files are in the folder "Brown"
corpus_files = os.listdir(Brown)
for fn in corpus_files:
fullpath = os.path.join(Brown, fn)
with open(fullpath, "r") as f:
text = f.read().lower() # lowercase to catch sentence-initial 'love'.
N = text.count("love_n")
V = text.count("love_v")
print(fn + "\tLOVE as NOUN: " + str(N) + "\tLOVE as VERB: " + str(V))
path = '/home/User/Desktop/file.txt'
os.path.isfile(path)
corpus_dir = '/home/User/Corpora/Brown'
os.path.isdir(corpus_dir)
This is useful, e.g. for checking whether the results file we plan to open already exists, and to avoid overwriting previous files.
In more complex tasks, there may be various intermediate results that yield their own output. Checking which of these is already done saves us from having to start everythin from scratch.
On ILIAS, you will find the zip-folder ARCHER_3-2_TXT that contains all corpus files of A Representative Corpus of Historical English Registers (ACHER). ARCHER has files nested in sub-directories in a way that a simple call to os.listdir() is not enough to process all corpus texts.
Use the os.walk() function to get a list of all ARCHER filepaths.
As a challenge, create two separate lists from the one above, one for all the British and one for all the American text samples. Use list comprehensions for this task.
### Assuming the variable ARCHER contains the string that specifies
### the top-level directory:
ARCHER_files = []
for root, dirs, files in os.walk(ARCHER):
for name in files:
ARCHER_files.append(os.path.join(root, name))
ARCHER_US = [fp for fp in ARCHER_files if "/am_" in fp]
ARCHER_BR = [fp for fp in ARCHER_files if "/br_" in fp]
Bohmann: Python for Linguists, 2023