The os module

Python for Linguists

Axel Bohmann

Navigating directories is hard

  • If we have more than one file, it is tedious to repeat lengthy directory paths
  • Different conventions between operating systems make sharing code error-prone
  • Those annoying back slashes on Windows…

import os!

The os module is designed to circumvent a lot of these issues.

From the documentation: "This module provides a portable way of using operating system dependent functionality."

Most importantly, os let’s us systematically navigate all files in a directory.

Processing an entire corpus

  • Get a list of all file names
import os
SOU = "/home/axel/Corpora/SOU"
filenames = os.listdir(SOU)
  • Read in and do something with each file:
for fn in filenames:
    with open(SOU + "/" + fn, "r") as f:
        text = f.read()
        freedomcount = text.count("freedom")
        print(fn + "includes " + str(freedomcount) + " mentions of 'freedom'.")

joining files and parent directories

  • In the previous example, we used SOU + "/" + fn to create the full path to our corpus files. Print the results.
  • This is suboptimal: It is tedious to write and creates a string that will not work for all operating systems.
  • The solution: os.path.join()
for fn in filenames:
    fullpath = os.path.join(SOU, fn)
    with open(fullpath, "r") as f:
        text = f.read()
        freedomcount = text.count("freedom")
        print(fn + "includes " + str(freedomcount) + " mentions of 'freedom'.")

Let’s practice

Exercise

On ILIAS, there is a zipped folder containing the tagged Brown corpus, separated into 15 files, one each per corpus text category. See the manual (also on ILIAS) for information about what the text categories are.

Unzip the files and save them in one directory, which should contain no other files or sub-directories. Using the os module, write a script that opens each corpus file and counts how often the word "love" occurs as a noun and a verb. Print results.

Scroll down for solution.


Assuming our text files are in the folder "Brown"

corpus_files = os.listdir(Brown)
for fn in corpus_files:
    fullpath = os.path.join(Brown, fn)
    with open(fullpath, "r") as f:
        text = f.read().lower() # lowercase to catch sentence-initial 'love'.
        N = text.count("love_n")
        V = text.count("love_v")
        print(fn + "\tLOVE as NOUN: " + str(N) + "\tLOVE as VERB: " + str(V))


Dealing with unwanted files

  • Sometimes we do not want to process every file in a directory (e.g. corpus manual, etc.)
  • Solution: use list comprehensions:
filenames = os.listdir(MyCorpus)
## Only process ".txt" files:
corpus_files = [x for x in filenames if x.endswith(".txt")]
## Only files that are named in a certain pattern,
## e.g. that start with the corpus name:
corpus_files = [x for x in filenames if x.startswith("BROWN")]

Checking whether a or directory exists


path = '/home/User/Desktop/file.txt'
os.path.isfile(path)
corpus_dir = '/home/User/Corpora/Brown'
os.path.isdir(corpus_dir)

This is useful, e.g. for checking whether the results file we plan to open already exists, and to avoid overwriting previous files.

In more complex tasks, there may be various intermediate results that yield their own output. Checking which of these is already done saves us from having to start everythin from scratch.

Walking through sub-directories

  • Sometimes, our data have a more complicated structure
  • E.g.: A folder ICE with sub-folders for each national corpus of the ICE family, each of which has a sub-folder for each of the ICE text categories.
  • Ideally, we want to be able to start from the top-level directory and get at all the .txt files at the lower levels.
ICE = '/home/User/Corpora/ICE'
ICE_files = []
for root, dirs, files in os.walk(ICE):
   for name in files:
      ICE_files.append(os.path.join(root, name))

Let’s practice

Exercise

On ILIAS, you will find the zip-folder ARCHER_3-2_TXT that contains all corpus files of A Representative Corpus of Historical English Registers (ACHER). ARCHER has files nested in sub-directories in a way that a simple call to os.listdir() is not enough to process all corpus texts.

Use the os.walk() function to get a list of all ARCHER filepaths.

As a challenge, create two separate lists from the one above, one for all the British and one for all the American text samples. Use list comprehensions for this task.


### Assuming the variable ARCHER contains the string that specifies
### the top-level directory:
ARCHER_files = []
for root, dirs, files in os.walk(ARCHER):
   for name in files:
      ARCHER_files.append(os.path.join(root, name))
ARCHER_US = [fp for fp in ARCHER_files if "/am_" in fp]
ARCHER_BR = [fp for fp in ARCHER_files if "/br_" in fp]


Stay classy!

Image Credits

  • Title slide and slide background: Coiled snake png sticker illustration, image in the public domain (CC0)
  • Sides 6, and 11: Yellow rock python snake photo, image in the public domain (CC0)
  • Slide 13: Laocoön group, a small sculpture modelled after an antique statue, 1838 - 1894, original public domain image from Finnish National Gallery (CC0)

Thank you


For the presentation slides, visit

https://pylx-10-os.netlify.app/

or scan QR code.


Bohmann: Python for Linguists, 2023

1 / 15
The os module Python for Linguists Axel Bohmann

  1. Slides

  2. Tools

  3. Close
  • The os module
  • Navigating directories is hard
  • import os!
  • Processing an entire corpus
  • joining files and parent directories
  • Let’s practice
  • Exercise
  • Dealing with unwanted files
  • Checking whether a or directory exists
  • Walking through sub-directories
  • Let’s practice
  • Exercise
  • Stay classy!
  • Image Credits
  • Thank you
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • ? Keyboard Help