FileTree tutorial ================= .. toctree:: :maxdepth: 3 :caption: Contents: Introduction ------------ The goal of file-tree is to make it easier to define the input and output filenames of a data processing pipeline. This tutorial shows the many features of filetree through simple pipeline examples. For an overview of the main API see the :class:`FileTree docstring `. The goal of filetree is to define the directory structure for your input/output files separate from the actual pipeline code. The directory structure will be defined in a tree file (typically using ".tree" extension) using a format like: :: paper_name = my-first-project, a-better-project, best-work-ever papers references.bib {paper_name} manuscript.md {paper_name}.pdf (output) Note that the text between curly brackets (e.g., "{paper_name}") is a placeholder for which the possible values are given at the top. Given such a filetree definition, we could write a pipeline like: .. code-block:: python from file_tree import FileTree from subprocess import run tree = Filetree.read("") for paper_tree in tree.iter("manuscript"): run([ "pandoc", paper_tree.get("manuscript"), "-f", "markdown", "--bibliography", paper_tree.get("references"), "-t", "pdf", "-o", paper_tree.get("output"), ]) The pipeline above iterates over three papers (named "my-first-project", "a-better-project", and "best-work-ever") and runs pandoc on each to convert the markdown manuscript to a pdf output. Here we assume all papers use a shared bibliography available as "papers/references.bib". So before running the pipeline, the directory structure might look like this: :: papers ├── a-better-project │ └── manuscript.tex ├── best-work-ever │ └── manuscript.tex ├── my-first-project │ └── manuscript.tex └── references.bib Afterwards it will look like: :: papers ├── a-better-project │ ├── a-better-project.pdf │ └── manuscript.tex ├── best-work-ever │ ├── best-work-ever.pdf │ └── manuscript.tex ├── my-first-project │ ├── my-first-project.pdf │ └── manuscript.tex └── references.bib One of the advantages of defining the input/output directories separate from the pipeline, is that it becomes much easier to change the input/output filenames. For example, if every manuscript has its own bibtex file rather than a single shared one, we could simply rewrite the filetree definition to: :: paper_name = my-first-project, a-better-project, best-work-ever papers {paper_name} manuscript.md references.bib {paper_name}.pdf (output) The same pipeline code will work on this. The same code will even work if we have multiple versions for each paper: :: paper_name = my-first-project, a-better-project, best-work-ever version = first, final, real-final, after-feedback, final-final, last papers {paper_name} version-{version} manuscript.md references.bib {paper_name}.pdf (output) This latter code will iterate through all possible permutations of "paper_name" and "version" (i.e., 3 paper names x 6 versions = 18 runs). Note that here the pipeline code will fail if we don't have the exact same versions for each paper. We will see ways to deal with that below. In this tutorial we will go through the individual steps in defining a pipeline like this. .. note:: To make it easier to copy/paste the code examples we will define the filetrees as strings in this tutorial, which can be read using :meth:`FileTree.from_string `. In practice, I recommend storing the FileTree definition in a separate file, which can be loaded using :meth:`FileTree.read `. FileTree interactivity ---------------------- There are several tools available to explore the output after running the pipeline or when analysing someone else's output. Given a text file describing the directory structure one can run from the command line: .. code-block:: bash file-tree -d .. note:: This feature requires `textual ` to be installed. You can get this using `pip/conda install textual`. This will open an interactive terminal-based app illustrating the FileTree and which of the files defined in the FileTree actually exist on disk. .. image:: images/app.png :alt: Illustration of the terminal-based app Within python this same app can be accessed by running :meth:`FileTree.run_app `: .. code-block:: python from file_tree import FileTree tree = Filetree.read().run_app() Because this app runs fully within the terminal, it will still run when connected to a remote machine. For neuroimaging applications the FileTree can also be used to visualise any images in the pipeline using `FSLeyes `_. FileTree indendation -------------------- The FileTrees are defined in a simple to type format, where indendation is used to indicate subdirectories, for example: :: # Any text following a #-character can be used for comments parent file1.txt child file2 file3.txt file4.txt In the top-level directory this represents one file ("file4.txt") and one directory ("parent"). The directory contains two files ("file1.txt" and "file3.txt") and one directory ("child") which contains a single file ("file2"). Hence, each line in this file corresponds to a different path. We refer to each such path as a template. Individual aspects of this format are defined in more detail below. Template keys ------------- Each template (i.e., directory and file path) in the FileTree is assigned a key for convenient access. For example, for the FileTree above, we can access the individual path templates using: .. code-block:: python from file_tree import FileTree tree = FileTree.from_string(""" parent file1.txt child file2 file3.txt file4.txt """ ) print(tree.get('file2')) # 'parent/child/file2' print(tree.get('child')) # 'parent/child' These filenames will be returned whether the underlying file exists or not (see :func:`FileTree.get `). By default the key will be the name of the file or directory without extension (i.e., everything the first dot). The key can be explicitly set by including it in round brackets behind the filename, so ``left_hippocampus_segment_from_first.nii.gz (Lhipp)`` will have the key "Lhipp" rather than "left_hippocampus_segment_from_first". Matching the keys between the filetree definitions and the pipeline code is crucial to prevent bugs. .. note:: Having the same key refer to multiple templates will lead to an error when accessing that template key. Usually, this error arises because you have multiple directories/files with identical names (e.g., "data") in multiple locations. You can fix this by giving these directories/files a custom key using the round brackets. Placeholders ------------ FileTrees can have placeholders for variables such as the paper name or version in the example above. Any part of the directory or file names contained within curly brackets will have to be filled when getting the path: .. code-block:: python from file_tree import FileTree tree = FileTree.from_string(""" {paper_name} references.bib manuscript_V{version}.md (manuscript) """) tree.get('references') # ValueError raised, because paper_name is undefined paper_tree = tree.update(paper_name="my-paper") print(paper_tree.get('references')) # 'my-paper/references.bib' print(paper_tree.get('manuscript')) # ValueError raised, because version is undefined print(paper_tree.update(version="0.1").get('manuscript')) # 'my-paper/manuscript_V0.1.md' Placeholders can be undefined, have a single value or have a sequence of possible values. The latter can be used for iteration over those values. Placeholder types ^^^^^^^^^^^^^^^^^ Filling in placeholder values uses python formatting under the hood. This means that the full `Python format string syntax `_) can be used. For example, the following expects the version to be an integer: .. code-block:: python from file_tree import FileTree tree = FileTree.from_string(""" {paper_name} references.bib manuscript_V{version:02d}.md (manuscript) """, paper_name="my-paper") print(tree.update(version=2).get("manuscript")) # 'my-paper/manuscript_V02.md' print(tree.update(version='3').get("manuscript")) # 'my-paper/manuscript_V03.md' print(tree.update(version='alpha').get("manuscript")) # raises an error, because the version can not be converted to a string Note that the placeholder formatting is slightly more forgiving than the python string formatting. In this case, "version" does not need to be an integer, it only needs to be convertable into an integer. .. _iteration: Placeholder iteration ^^^^^^^^^^^^^^^^^^^^^ In pipelines you will typically want to iterate over multiple parameter values. We used this in the initial code example to iterate over all manuscripts (and optionally over their multiple versions): .. code-block:: python for paper_tree in tree.iter("manuscript"): run([ "pandoc", paper_tree.get("manuscript"), "-f", "markdown", "--bibliography", paper_tree.get("references"), "-t", "pdf", "-o", paper_tree.get("output"), ]) There are two methods for this in FileTree, namely :meth:`FileTree.iter ` and :meth:`FileTree.iter_vars ` The former expects a template key and iterates over all placeholders in that template that have multiple possisble values. For the latter you need to explicitly provide the placeholder names you want to iterate over. In either case, the iteration returns a series of FileTree objects with the same templates, but different singular values for the placeholders you are iterating over. There are a few more examples of this iteration in the section below. If you want to see all possible values for a template without iterating over it, you can use :meth:`FileTree.get_mult `. Setting placeholder values ^^^^^^^^^^^^^^^^^^^^^^^^^^ There are five ways to define placeholder values: - Within the filetree definition. Multiple values are separated by comma's. The following example shows how to set a single value (for paper_name) or multiple values (for version) within the filetree definition: :: paper_name = my-paper version = alpha, beta - When loading the FileTree definition you can set any placeholder values directly in the constructor (overriding any in the filetree definition): .. code-block:: python from file_tree import FileTree tree = FileTree.from_string(""" paper_name = my-paper version = alpha, beta """, paper_name='other_paper') print(tree.placeholders['paper_name']) # 'other_paper' - The example above also shows you can access the `placeholders` attribute directly, which can be used to update its values (e.g., `tree.placeholders['paper_name'] = 'new_value'`). - :meth:`FileTree.update ` can be used to update the placeholder values inplace or return a new filetree with the updated values. - :meth:`FileTree.update_glob ` can be used to identify all possible placeholder values based on which input files already exist on disk. For example, the following iterates over all "papers/\*/manuscript.md" files on disk and produces the corresponding output pdf: .. code-block:: python from file_tree import FileTree from subprocess import run tree = FileTree.from_string(""" papers {paper_name} manuscript.md {paper_name}.pdf (output) """) for paper_tree in tree.update_glob("manuscript").iter("manuscript"): run([ "pandoc", paper_tree.get("manuscript"), "-f", "markdown", "-t", "pdf", "-o", paper_tree.get("output"), ]) This also works for multiple placeholders. In the following we iterate over all permutations of paper_name and version: .. code-block:: python from file_tree import FileTree from subprocess import run tree = FileTree.from_string(""" papers {paper_name} V-{version:d} manuscript.md {paper_name}.pdf (output) """) for paper_tree in tree.update_glob("manuscript", link=[("paper_name", "version")]).iter("manuscript"): run([ "pandoc", paper_tree.get("manuscript"), "-f", "markdown", "-t", "pdf", "-o", paper_tree.get("output"), ]) The type formatting for "version" will ensure that it will only match integers, so "papers/my-paper/V-alpha/manuscript.md" is not a match, but "papers/other-work/V-80/manuscript.md" is. Using the `link` keyword argument in `update_glob` we indicate that the placeholders covary. The default behaviour of `update_glob` is to identify all possible values for "paper_name" and "version" separately. When iterating over them later, we might get invalid combinations of "paper_name" and "version". Setting the `link` keyword changes this behaviour to instead link the values of "paper_name" and "version" together (see the section "Linked placeholder values" below). This ensure the later iteration (`...iter("manuscript")`) will only return valid combinations of "paper_name" and "version" for which "manuscript" exists. Linked placholder values ^^^^^^^^^^^^^^^^^^^^^^^^ Occasionally we might not want to iterate over all possible combination of values for some placeholders. For example, let us consider a case where we have two different bibliography files (one for marine biology and the other for history) and each paper only use one of these. We could enforce this using: .. code-block:: python from file_tree import FileTree tree = FileTree.from_string(""" {reference}.bib (reference) {paper_name} manuscript.md {paper_name}.pdf (output) """) tree.placeholders[("paper_name", "reference")] = [("first-biology", "first-history", "another-biology"), ("marine-biology", "history", "marine-biology")] for paper_tree in tree.iter("manuscript"): print(f"processing {paper_tree.get('manuscript')} with {paper_tree.get('reference')} into {paper_tree.get('output')}") # processing first-biology/manuscript.md with marine-biology.bib into first-biology/first-biology.pdf # processing first-history/manuscript.md with history.bib into first-history/first-history.pdf # processing another-biology/manuscript.md with marine-biology.bib into another-biology/another-biology.pdf Rather than running 3x2=6 combinations for all three papers and two reference files, here we just iterate over three steps with the linked "paper_name" and "reference" values. If you have already set the placeholder values you can use :meth:`FileTree.placeholders.link ` or :meth:`FileTree.placeholders.unlink ` to link or unlink them. Optional Placeholders --------------------- Normally having undefined placeholders will lead to an error being raised. This can be avoided by putting these placeholders in square brackets, indicating that they can simply be skipped if undefined. For example for the FileTree: .. code-block:: python from file_tree import FileTree tree = FileTree.from_string(""" {paper_name} [V-{version}] manuscript.tex """, paper_name='my-paper') tree.get('manuscript') # 'my-paper/manuscript.tex' tree.update(version='final').get('manuscript') # 'my-paper/V-final/manuscript.tex' Note that if any placeholder within the square brackets is undefined, any text within those square brackets is omitted. An example with extensive use of optional placeholders can be found in the `FileTree of the BIDS raw data format `_. Wildcards --------- So far, we have assumed a well organised directory structure, where we can determine the full path name after filling in the placeholder values. However, sometimes paths contain some additional text (e.g., a random identifier) that is not part of our well-organised structure. To support this we can add unix-like wildcards (``*`` or ``?``) into our template paths. For example, let's consider that for some papers we have some co-authors and we choose to include their name in the paper. So, on disk we might have a directory structure looking like: :: . ├── my-paper │ └── manuscript.tex └── shared-paper └── manuscript-with-Pete.tex Presuming that for the later analysis we do not care about the co-authors, we can describe this in a file-tree like: :: {paper_name} manuscript*.tex If we run :meth:`FileTree.get ` using this file-tree, we will get the following: .. code-block:: python tree.update(paper_name="my-paper").get('manuscript') # 'my-paper/manuscript.tex' tree.update(paper_name="shared-paper").get('manuscript') # 'shared-paper/manuscript-with-Pete.tex' When wildcards (``*`` or ``?``) are included, by default file-tree will assume that there is only a single matching file on disk, which will be returned. If there are no or multiple matches a ``FileNotFoundError`` will be raised. This default behaviour can be changed by setting the ``glob`` parameter. This parameter can be set as a keyword argument when creating the ``FileTree``, it can be updated by setting ``tree.glob = value``, or it can be set as a keyword to a specific call to :meth:`FileTree.get ` or :meth:`FileTree.get_mult `. Possible values for this keyword are: - ``False``: do not do any pattern matching. This was the default behaviour before wildcards were introduced in v1.6. Use this to get the raw string including any ``*`` or ``?`` characters. - ``True``/"default": return filename if there is a single match. Raise an error otherwise. This is the default behaviour. - "first"/"last": return the first or last match (based on alphabetical ordering). An error is raised if there are no matches. - function/callable: return the match returned by the function. The input to the function is a list of all the matching filenames (possibly of zero length). Wildcards can be explicitly encoded in the file-tree definition or be included in a placeholder value (e.g., ``tree.update(session="01*")``). Note that using wildcards only makes sense for input filenames that already exist on disk. Output filenames should be fully defined after filling in the placeholder values. If you want the part of the path covered by the wildcard in the input filename to be part of the output filename, you will have to use a fully-fledged placeholder rather than using wildcards. Sub-trees --------- FileTrees can include other FileTrees within their directory structure. This allows for the efficient reuse of existing trees. For example, let's consider we have already defined a tree like in a file called "child.tree": :: manuscript.tex references.bib {paper_name}.pdf (output_pdf) We can then include these files as part of a larger directory structure like the following: .. code-block:: python from file_tree import FileTree tree = FileTree.from_string(""" versioned_papers {paper_name} V-{version} ->child (versioned) unversioned_papers {paper_name} ->child (unversioned) """, paper_name='my-paper', version='alpha') tree.get("versioned/manuscript") # 'versioned_papers/my-paper/V-alpha/manuscript.tex' tree.get("unversioned/manuscript") # 'unversioned_papers/my-paper/manuscript.tex' Note that we mark where the sub-tree should be inserted using "->". The general format of this is: ``-> [=, ...] ()`` In this example we use the precursor to distinguish between two different uses of the same sub-tree. We need to use this precursor to access the template keys (i.e., template key in the parent tree is "/