The os module

Exercise

On ILIAS, there is a zipped folder containing the tagged Brown corpus, separated into 15 files, one each per corpus text category. See the manual (also on ILIAS) for information about what the text categories are.

Unzip the files and save them in one directory, which should contain no other files or sub-directories. Using the os module, write a script that opens each corpus file and counts how often the word "love" occurs as a noun and a verb. Print results.

Scroll down for solution.

Assuming our text files are in the folder "Brown"

corpus_files = os.listdir(Brown)
for fn in corpus_files:
    fullpath = os.path.join(Brown, fn)
    with open(fullpath, "r") as f:
        text = f.read().lower() # lowercase to catch sentence-initial 'love'.
        N = text.count("love_n")
        V = text.count("love_v")
        print(fn + "\tLOVE as NOUN: " + str(N) + "\tLOVE as VERB: " + str(V))

Dealing with unwanted files

Sometimes we do not want to process every file in a directory (e.g. corpus manual, etc.)

Solution: use list comprehensions:

filenames = os.listdir(MyCorpus)
## Only process ".txt" files:
corpus_files = [x for x in filenames if x.endswith(".txt")]
## Only files that are named in a certain pattern,
## e.g. that start with the corpus name:
corpus_files = [x for x in filenames if x.startswith("BROWN")]

Checking whether a or directory exists

path = '/home/User/Desktop/file.txt'
os.path.isfile(path)
corpus_dir = '/home/User/Corpora/Brown'
os.path.isdir(corpus_dir)

This is useful, e.g. for checking whether the results file we plan to open already exists, and to avoid overwriting previous files.

In more complex tasks, there may be various intermediate results that yield their own output. Checking which of these is already done saves us from having to start everythin from scratch.

Walking through sub-directories

Sometimes, our data have a more complicated structure
E.g.: A folder ICE with sub-folders for each national corpus of the ICE family, each of which has a sub-folder for each of the ICE text categories.
Ideally, we want to be able to start from the top-level directory and get at all the .txt files at the lower levels.

ICE = '/home/User/Corpora/ICE'
ICE_files = []
for root, dirs, files in os.walk(ICE):
   for name in files:
      ICE_files.append(os.path.join(root, name))

The os module

Navigating directories is hard

import os!

Processing an entire corpus

joining files and parent directories

Let’s practice

Exercise

Dealing with unwanted files

Checking whether a or directory exists

Walking through sub-directories

Let’s practice

Exercise

Stay classy!

Image Credits

Thank you