pycnet.file package
Submodules
pycnet.file.image module
Defines functions for processing image files (spectrograms).
Functions:
- checkImageFile
Verify that an image file can be loaded.
- checkImages
Use a set of worker processes to check all the image files in a directory tree.
- findImages
Locate PNG files in a directory tree and optionally check if they can be loaded.
Classes:
- ImageChecker
Worker process that checks whether image files can be loaded.
- class pycnet.file.image.ImageChecker(in_queue, bad_queue)
Bases:
ProcessWorker that checks for bad image files.
Fetches image paths from self.in_queue and checks them using checkImageFiles(). Paths of image files that could not be loaded are placed in self.bad_queue. Process will run until its in_queue is empty.
- in_queue
Queue containing paths to image files to be checked.
- Type:
multiprocessing.JoinableQueue
- bad_queue
Queue to hold paths to image files that could not be loaded.
- Type:
multiprocessing.Queue
- run()
Method to be run in sub-process; can be overridden in sub-class
- pycnet.file.image.checkImageFile(image_path)
Verify that TensorFlow can load an image file.
- Parameters:
image_path (str) – Path to the image file that will be loaded.
- Returns:
True if the image loaded successfully, otherwise False.
- Return type:
bool
- pycnet.file.image.checkImages(top_dir, n_workers=0, verbose=True)
Check all the .png images in a folder.
- Parameters:
top_dir (str) – Path to the root of the directory tree containing image files to be checked.
n_workers (int) – Number of worker processes to use. Defaults to the number of logical CPU cores.
verbose (bool) – Print the results of the image checking.
- Returns:
A sorted list of paths to image files that could not be loaded for any reason.
- Return type:
list
- pycnet.file.image.findImages(top_dir, check_images=False)
Locate PNG image files in a directory tree.
- Parameters:
top_dir (str) – Path to the root of the directory tree containing image files.
check (bool) – Verify that image files can be loaded.
- Returns:
A dict containing paths to image files found in the directory. If check_images==True, “good_images” is a list of paths of image files that were loaded successfully, and “bad_images” is a list of paths of image files that could not be loaded. “images_checked” is a bool indicating the value of check_images.
- Return type:
dict
pycnet.file.wav module
Defines functions and classes for processing audio files.
Functions:
- getWavLength
Measure the duration of a .wav file.
- getFlacLength
Measure the duration of a .flac file.
- getAudioFileInfo
Get general audio file information from SoX.
- makeSoxCmds
Build a set of commands to be executed by SoX to generate spectrograms from a .wav file.
- makeSpectroDirList
Map a set of audio files to a set of temporary output directories where spectrograms will be generated.
Classes:
- WaveWorker
Worker process that generates spectrograms from audio files using SoX.
- class pycnet.file.wav.WaveWorker(in_queue, done_queue, output_dir)
Bases:
ProcessWorker process to generate spectrograms from wave files.
When running, the worker will fetch the next available item from in_queue, consisting of a .wav file and an output directory. It will create a set of sox commands to generate a set of spectrograms from the .wav file in the output directory, then execute those commands sequentially using os.system. When finished, the path to the .wav file will be placed in done_queue.
- in_queue
Queue containing tuples in the format (wav_path, output_dir).
- Type:
Multiprocessing.JoinableQueue
- done_queue
Queue to hold paths to .wav files that have already been processed.
- Type:
Multiprocessing.Queue
- output_dir
Path to the directory where spectrograms should be generated.
- Type:
str
- run()
Method to be run in sub-process; can be overridden in sub-class
- pycnet.file.wav.getAudioFileInfo(file_path)
Return a string containing information about an audio file.
This function just runs sox –i [file_path], captures the output from stdout and converts it from a bytestring to UTF-8 text for parsing by other functions.
- Parameters:
file_path (str) – Path to the audio file.
- Returns:
String containing SoX output.
- Return type:
str
- pycnet.file.wav.getFlacLength(flac_path, mode='h')
Return the duration of a .flac file in hours or seconds.
First 38 bytes of any FLAC file contain a FLAC signature (b’fLaC’) and a “stream info” metadata block listing number of channels, sample rate, etc. Unfortunately some of these values are encoded in strings of bits that do not divide nicely into bytes, which necessitates some clever bitwise calculations to convert them into numeric values. The calculations seen here are copied from the
flacsubmodule of themutagenpackage: https://github.com/quodlibet/mutagen/blob/main/mutagen/flac.py- Parameters:
flac_path (str) – Path to the .flac file.
mode (str) – Units for the return value. Default is ‘h’ (hours). Set mode=’s’ to return duration in seconds.
- Returns:
Duration of the .flac file in hours or seconds, or 0 if the file’s duration could not be determined.
- Return type:
float
- pycnet.file.wav.getWavLength(wav_path, mode='h')
Return the duration of a .wav file in hours or seconds.
- Parameters:
wav_path (str) – Path to the .wav file.
mode (str) – Units for the return value. Default is ‘h’ (hours). Set mode=’s’ to return duration in seconds.
- Returns:
Duration of the .wav file in hours or seconds, or 0 if the file’s duration could not be determined.
- Return type:
float
- pycnet.file.wav.makeSoxCmds(file_path, output_dir, clip_length=12)
Generate SoX commands to create spectrograms from an audio file.
Generates a list of SoX commands to create spectrograms representing segments of audio from the file provided.
- Parameters:
file_path (str) – Path to an audio (.wav or .flac) file.
output_dir (str) – Path to the folder where spectrogram files will be generated.
- Returns:
A list of commands to be executed by SoX.
- Return type:
list[str]
- pycnet.file.wav.makeSpectroDirList(wav_list, image_dir, n_chunks)
Map input .wav files to multiple output directories.
Divides the full list of .wav files to be processed into several chunks and designates a folder to hold spectrograms from each chunk to facilitate parallel processing.
- Parameters:
wav_list (list) – List of paths to .wav files from which spectrograms will be generated.
image_dir (str) – Path to the directory where spectrograms will be generated (in subfolders as needed).
n_chunks (int) – The number of subfolders to create.
- Returns:
A list of tuples (wav_path, output_dir), each containing the path to a .wav file and the folder where spectrograms from that file will be generated.
- Return type:
list
Module contents
Defines functions for various file handling tasks.
Functions:
- buildFilename
Construct a filename using a prefix and a timestamp.
- buildFilePrefix
Construct a prefix for a file using one or more wildcards based on its location in a directory.
- findFiles
List all paths with a given extension in a directory tree.
- getFileSize
Return the size of a file in human-readable units.
- getFolder
Return the location of a file relative to a higher-level folder.
- inventoryFolder
Inventory .wav files in a folder and write the info to a file.
- makeFileInventory
Build a table of basic attributes for a list of files.
- massRenameFiles
Rename files with a given extension in a directory tree (or undo this operation if previously performed).
- readInventoryFile
Read a .wav file inventory from a CSV file.
- removeSpectroDir
Recursively remove temporary files and folders.
- renameFiles
Rename files based on values in a DataFrame.
- summarizeInventory
Summarize a table of info on .wav files in human-readable form.
- pycnet.file.buildAudioFileDF(top_dir, file_types=['wav', 'flac'], n_workers=None)
Get information from a set of audio files in a directory tree.
Officially the only supported file types are WAV and FLAC. This function uses
sox --ito get information on each file, so it should work with any audio file format that includes a self-describing header.This is not currently used for anything specific in pycnet; it returns slightly more information than makeFileInventory but is much slower. We may try to make it more efficient / useful in future, but don’t hold your breath.
- Parameters:
top_dir (str) – Root of the directory tree to be searched.
file_types (list) – List of file extensions to search for.
n_workers (int) – Number of worker processes to use.
- Returns:
DataFrame containing basic info on each file.
- Return type:
pandas.DataFrame
- pycnet.file.buildFilePrefix(file_path, prefix_string)
Create a prefix for a file which may be based on its location.
Valid wildcards include:
%p, the name of the file’s parent folder;%g, the name of the file’s “grandparent” folder (the parent folder of the parent folder),%c, the partial parent folder, i.e. the last component of the parent folder’s name split up by underscores, and%h, the partial grandparent folder, i.e. the last component of the grandparent folder’s name split up by underscores.- Parameters:
file_path (str) – Full path to the file for which a prefix will be generated.
prefix_string (str) – Prefix to use, which may include one or more wildcards that will be replaced with components of the file’s path.
- Returns:
The prefix created by substituting the appropriate path components for their corresponding wildcards in the prefix string provided.
- Return type:
str
- pycnet.file.buildFilename(file_path, prefix='')
Construct a filename using a prefix and a timestamp.
If no prefix is provided, a prefix will be constructed based on the two lowest-level directories containing the file (i.e., the file’s grandparent and parent directories).
If the filename does not already have a timestamp in the right format, it will be generated based on the file’s modification time.
- Parameters:
file_path (str) – Full path to the file to be renamed.
prefix (str) – Prefix component of the filename to be generated. Can include wildcards, which will be interpreted by buildFilePrefix().
- Returns:
Path to the file following renaming.
- Return type:
str
- pycnet.file.findFiles(top_dir, file_ext, ignore_case=True)
List all files with a given extension in a directory tree.
- Parameters:
top_dir (str) – Path to the root of the directory tree to be searched.
file_ext (str) – File extension of files to look for. A leading dot (.) is not necessary but will not hurt anything.
ignore_case (bool) – Treat upper- and lowercase file extensions the same. (File paths are case-sensitive on Unix-based systems but not on Windows.)
- Returns:
A sorted list of strings representing paths of files with extension file_ext in the directory tree rooted at top_dir.
- Return type:
list
- pycnet.file.getFileSize(file_path, units='gb')
Return the size of a file in human-readable units.
By default the file size will be returned in GB (gibibytes); other options include MB, KB, and plain bytes.
- Parameters:
file_path (str) – path to the target file.
units (str) – units to use when reporting file size (‘gb’ for gibibytes, ‘mb’ for mebibytes, ‘kb’ for kibibytes, and ‘b’ for bytes).
- Returns:
The size of the target file in the units specified.
- Return type:
float
- pycnet.file.getFolder(file_path, top_dir)
Return the location of a file relative to a higher-level folder.
- Parameters:
file_path (str) – Path to the target file.
top_dir (str) – Path to the folder relative to which the file’s location will be reported.
- Returns:
Path to the folder containing file_path relative to top_dir.
- Return type:
str
- pycnet.file.inventoryFolder(target_dir, write_file=True, print_summary=True, flac_mode=False)
Inventory audio files in a folder and write the info to a file.
- Parameters:
target_dir (str) – Path of the top-level directory containing .wav files.
write_file (bool) – Whether to write the inventory table to a CSV file.
print_summary (bool) – whether to use summarizeInventory to print a human-readable summary of the .wav files that were found.
flac_mode (bool) – Inventory .flac files rather than .wav files.
- Returns:
DataFrame listing each audio file in the directory tree, its path relative to target_dir, the size of the file, and its duration in seconds.
- Return type:
Pandas.DataFrame
- pycnet.file.makeFileInventory(path_list, top_dir, use_abs_path=False)
Build a table of basic attributes for a list of files.
- Parameters:
path_list (list) – List of paths of files to be examined.
top_dir (str) – Path to the folder that will be used to create relative paths.
use_abs_path (bool) – Whether to list the full path of the folder containing each file in the Folder column of the resulting DataFrame.
- Returns:
DataFrame with one row for each .wav file listing its folder (absolute or relative to top_dir), filename, size in bytes, and duration in seconds.
- Return type:
Pandas.DataFrame
- pycnet.file.massRenameFiles(top_dir, extension, prefix=None)
Rename all files with a given extension in a directory tree.
Runs in ‘undo mode’ if a file called [Folder name]_rename_log.csv already exists in the directory provided.
- Parameters:
top_dir (str) – Path to the root of the directory tree containing files to be renamed.
extension (str) – File extension of files to be found and renamed.
prefix (str) – A prefix to use when constructing filenames.
- Returns:
Nothing.
- pycnet.file.readInventoryFile(inventory_path)
Read a .wav file inventory dataframe from a CSV file.
- Parameters:
inventory_path (str) – Path to a CSV file containing information on audio files within a folder.
- Returns:
DataFrame with one row for each .wav file listing its folder (absolute or relative to top_dir), filename, size in bytes, and duration in seconds.
- Return type:
pandas.DataFrame
- pycnet.file.removeSpectroDir(target_dir, spectro_dir=None)
Recursively remove temporary files and folders.
- Parameters:
target_dir (str) – Path to the folder containing audio data from which spectrograms were generated.
spectro_dir (str) – Path to the folder where the temporary spectrogram folder was created (defaults to target_dir).
- Returns:
Nothing.
- pycnet.file.renameFiles(rename_log_df, revert=False)
Rename files based on values in a DataFrame.
If the values in the New_Name column are not unique, the renaming operation will be aborted so as not to produce duplicate filenames (including duplicate filenames in different folders). This is indicated by a return value of -1.
- Parameters:
rename_log_df (Pandas.DataFrame) – DataFrame listing the directory, current filenames, and future filenames for a set of files.
revert (bool) – Whether to run in “undo mode” to reverse a previous renaming operation.
- Returns:
The number of files that were successfully renamed, or -1 if the renaming operation was aborted due to duplicate filenames.
- Return type:
int
- pycnet.file.summarizeInventory(inventory_df, ext='.wav')
Summarize a table of info on audio files in human-readable form.
- Parameters:
inventory_df (Pandas.DataFrame) – DataFrame containing information on audio files in a directory tree, as produced by makeFileInventory.
ext (str) – Extension of the audio files to be summarized.
- Returns:
A human-readable summary of the audio dataset.
- Return type:
str