{ "cells": [ { "cell_type": "markdown", "source": [ "Command line tutorial\n", "=====================\n", "\n", "This tutorial will walk you through the usage of the ``lib5c`` command-line\n", "tools." ], "metadata": { "id": "czq_NDhQQ1WI", "colab_type": "text" } }, { "cell_type": "markdown", "source": [ "Follow along in Google colab\n", "-----------------\n", "\n", "You can run and modify the cells in this notebook tutorial live using Google colaboratory by clicking the link below:\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/thomasgilgenast/lib5c-tutorials/blob/master/commandline_tutorial.ipynb)\n", "\nTo simply have all the cells run automatically, click `Runtime > Run all` in the colab toolbar." ], "metadata": { "id": "qkuDJ5ewg-cQ", "colab_type": "text" } }, { "cell_type": "markdown", "source": [ "Make sure `lib5c` is installed\n", "------------------------------\n", "\nInside a fresh virtual environment, run" ], "metadata": { "id": "I5IDf7zRQact", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!pip install lib5c" ], "outputs": [], "execution_count": null, "metadata": { "id": "FRBRPr68QQG5", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 1088 }, "outputId": "ed8f15cb-805a-45af-9179-0a7275143f8c" } }, { "cell_type": "markdown", "source": [ "Make a directory and get data\n", "-----------------------------\n", "\n", "If you haven't completed the [pipeline tutorial](pipeline_tutorial.ipynb) yet,\n", "make a directory for the tutorial:\n", "\n", "```\n", "$ mkdir lib5c-tutorial\n", "$ cd lib5c-tutorial\n", "```\n", "\n", "and prepare the example data in `lib5c-tutorial/input` as shown in the\n", "[pipeline tutorial](pipeline_tutorial.ipynb)." ], "metadata": { "id": "w8UPgHWrQ6PA", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!mkdir input\n", "!wget -qO- 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE68582&format=file&file=GSE68582%5FBED%5FES%2DNPC%2DiPS%2DLOCI%5Fmm9%2Ebed%2Egz' | gunzip -c > input/BED_ES-NPC-iPS-LOCI_mm9.bed\n", "!wget -qO- 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM1974095&format=file&file=GSM1974095%5Fv65%5FRep1%2Ecounts%2Etxt%2Egz' | gunzip -c > input/v65_Rep1.counts\n", "!wget -qO- 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM1974096&format=file&file=GSM1974096%5Fv65%5FRep2%2Ecounts%2Etxt%2Egz' | gunzip -c > input/v65_Rep2.counts\n", "!wget -qO- 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM1974099&format=file&file=GSM1974099%5FpNPC%5FRep1%2Ecounts%2Etxt%2Egz' | gunzip -c > input/pNPC_Rep1.counts\n", "!wget -qO- 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM1974100&format=file&file=GSM1974100%5FpNPC%5FRep2%2Ecounts%2Etxt%2Egz' | gunzip -c > input/pNPC_Rep2.counts\n", "!sed -i '/Sox2\\|Klf4/!d' input/BED_ES-NPC-iPS-LOCI_mm9.bed" ], "outputs": [], "execution_count": null, "metadata": { "id": "NybQQVz4RpvO", "colab_type": "code", "colab": {} } }, { "cell_type": "markdown", "source": [ "Note for Docker image users\n", "---------------------------\n", "\n", "If you are using `lib5c` from the Docker image, run:\n", "\n", " $ docker run -it -v :/lib5c-tutorial creminslab/lib5c:latest\n", " root@:/# cd /lib5c-tutorial\n", "\nand continue running all tutorial commands in this shell." ], "metadata": { "id": "Y4lRc5wdRxi3", "colab_type": "text" } }, { "cell_type": "markdown", "source": [ "Trimming low-quality primers\n", "----------------------------\n", "\nWe will first trim away low-quality primers by running" ], "metadata": { "id": "XVJZoWkCV233", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c trim -p input/BED_ES-NPC-iPS-LOCI_mm9.bed -t trimmed/%s.counts trimmed/primers_trimmed.bed 'input/*.counts'" ], "outputs": [], "execution_count": null, "metadata": { "id": "byiHjyRjV7Bx", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "outputId": "f556ff5c-39e1-4063-a3f4-86a02baa9e3a" } }, { "cell_type": "markdown", "source": [ "Normalization\n", "-------------\n", "\n", "For this tutorial, we will first use Knight-Ruiz matrix balancing to correct our\n", "data for primer effects by running" ], "metadata": { "id": "T7vs-MDMSJ9r", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c kr trimmed/pNPC_Rep2.counts kr/pNPC_Rep2.counts" ], "outputs": [], "execution_count": null, "metadata": { "id": "jAU8pCOQSNDI", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "outputId": "14480757-69c8-4b8e-a41b-bd78a2b7479e" } }, { "cell_type": "markdown", "source": [ "`lib5c kr` specifies that the specific subcommand we want is `kr`, the\n", "subcommand for Knight-Ruiz balancing. To see all the available subcommands, you\n", "can run" ], "metadata": { "id": "zAVSbQ2ZSVir", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c -h" ], "outputs": [], "execution_count": null, "metadata": { "id": "C0WcnZ3cSdRK", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 493 }, "outputId": "e8c17915-4887-4137-dc69-1ab8d27b642b" } }, { "cell_type": "markdown", "source": [ "To see detailed help for the `kr` subcommand, you can run" ], "metadata": { "id": "povT05eYSecL", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c kr -h" ], "outputs": [], "execution_count": null, "metadata": { "id": "hx6lzA-jSgmG", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 459 }, "outputId": "ff369b19-6808-44c0-c951-b14f9a684ebe" } }, { "cell_type": "markdown", "source": [ "The two required positional command line arguments for `lib5c kr` are\n", "`infile` (the countsfile we want to balance) and `outfile` (where to write\n", "the balanced countsfile)." ], "metadata": { "id": "bK7JzU5eSicu", "colab_type": "text" } }, { "cell_type": "markdown", "source": [ "### Processing multiple input files\n", "\nIf we want to process a lot of countsfiles, we can use a pattern like this" ], "metadata": { "id": "khmQcrn7Stt4", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c kr 'trimmed/*.counts' kr/%s.counts" ], "outputs": [], "execution_count": null, "metadata": { "id": "IXkerlkWSxdJ", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 289 }, "outputId": "402f4877-a010-4e1f-ecc2-a25d01fd6b13" } }, { "cell_type": "markdown", "source": [ "Here instead of a single `infile`, we have put in a glob-expandable pattern\n", "that will match all the countsfiles in the `input` directory. We have to quote\n", "this pattern to prevent the shell from expanding it. If we are processing\n", "multiple input countsfiles, then we expect multiple countsfiles as the output.\n", "To specify this easily, we include `%s` in the `outfile`, which will be\n", "replaced with the replicate name (derived automatically from the input\n", "filenames).\n", "\n", "If you're running this in a cluster environment with the LSF (aka `bsub`) job\n", "scheduler available, you can install the `bsub` Python package via\n", "\n", " $ pip install bsub\n", "\n", "and then all the input files will be processed in parallel via `bsub`.\n", "\n", "If you aren't using the LSF job scheduler, you can still specify multiple input\n", "files, but they will be processed in series." ], "metadata": { "id": "OmN2syRySwyJ", "colab_type": "text" } }, { "cell_type": "markdown", "source": [ "Plotting heatmaps\n", "-----------------\n", "\n", "Let's take a look at the countsfiles we just generated.\n", "\n", "One way to visualize countsfiles is to plot heatmaps of them. To do this, we can\n", "run" ], "metadata": { "id": "MCLCDbsaS-KY", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c plot heatmap -p trimmed/primers_trimmed.bed --region Sox2 kr/pNPC_Rep2.counts kr/pNPC_Rep2_Sox2.png" ], "outputs": [], "execution_count": null, "metadata": { "id": "5gdULSoMTAXD", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "outputId": "b0232af4-2b8e-4ba7-e53e-d929e574117b" } }, { "cell_type": "markdown", "source": [ "The resulting image should look something like this" ], "metadata": { "id": "plniOdxPTK2B", "colab_type": "text" } }, { "cell_type": "code", "source": [ "from IPython.display import Image\n", "Image(filename='kr/pNPC_Rep2_Sox2.png', width=400)" ], "outputs": [], "execution_count": null, "metadata": { "id": "Yl6zQ75nTTSc", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 417 }, "outputId": "21c89c4e-256c-4b7a-8c8b-60ee942d4216" } }, { "cell_type": "markdown", "source": [ "The `lib5c plot heatmap` subcommand is a lot like the `lib5c kr` subcommand.\n", "In fact, most `lib5c` subcommands work in almost exactly the same way, so\n", "learning to use one can help you understand the others." ], "metadata": { "id": "CtNGMpqkTPej", "colab_type": "text" } }, { "cell_type": "markdown", "source": [ "### Specifying a primerfile/binfile with `-p/--primerfile`\n", "\n\n", "Almost every `lib5c` subcommand that takes countsfiles as its input also\n", "requires a primerfile or a bedfile that it uses to make sure the countsfiles get\n", "parsed correctly. This is accomplished with the `-p` flag. We didn't have to\n", "use the `-p` flag in the `lib5c kr` example above because the primerfile we\n", "wanted to use was already in the same directory as the countsfiles, so `lib5c`\n", "was able to find it automatically. In general though, we might be trying to\n", "process countsfiles that are separated (directory-wise) from the primerfile that\n", "describes their genomic loci, in which case we would have to use the `-p` flag\n", "as shown above." ], "metadata": { "id": "rzL4iBewTc-f", "colab_type": "text" } }, { "cell_type": "markdown", "source": [ "### Multiple output files (one per region)\n", "\n", "Some `lib5c` subcommands generate multiple output files (typically one per region). To create\n", "an output file for only one region, use the `--region` (shorter form `-r`)\n", "as shown above. Alternatively, if we want to generate output for all regions, we\n", "can skip the `-r/--region` flag and include a `%r` in the `outfile`\n", "instead, which will be replaced with the region name. Combining this with the\n", "\"multiple input files\" idea from above, we can write something like" ], "metadata": { "id": "hO0OcQZaTsFc", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c plot heatmap -p trimmed/primers_trimmed.bed 'kr/*.counts' kr/%s_%r.png" ], "outputs": [], "execution_count": null, "metadata": { "id": "TRSsEpekT_px", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 289 }, "outputId": "69a4a6fd-df6e-457c-8d9c-30e6bd7e3a08" } }, { "cell_type": "markdown", "source": [ "Binning\n", "-------\n", "\n", "Let's bin our Knight-Ruiz balanced data!\n", "\n", "Before we do this, we need to generate some bins to tile our 5C regions. To do\n", "this, we run" ], "metadata": { "id": "L59oXM5QUDKe", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c determine-bins -w 4000 trimmed/primers_trimmed.bed bedfiles/4kb_bins.bed" ], "outputs": [], "execution_count": null, "metadata": { "id": "FYdDtyQlUE8Z", "colab_type": "code", "colab": {} } }, { "cell_type": "markdown", "source": [ "The `-w/--bin_width` flag specifies the width of each bin in base pairs, and\n", "`bedfiles/4kb_bins.bed` specifies the name of the newly-created bin bedfile.\n", "\n", "Now that we have a bin bedfile describing the bins, we can figure out what the\n", "counts values for each bin should be by running" ], "metadata": { "id": "LZUjIiN3UFwd", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c bin -w 24000 -p trimmed/primers_trimmed.bed -b bedfiles/4kb_bins.bed kr/pNPC_Rep2.counts binned/pNPC_Rep2.counts" ], "outputs": [], "execution_count": null, "metadata": { "id": "CmNUq0N3UKd9", "colab_type": "code", "colab": {} } }, { "cell_type": "markdown", "source": [ "Notice that the `lib5c bin` subcommand needs both a primerfile (specified with\n", "`-p`) and a bin bedfile (specified with `-b`). It needs the primerfile to\n", "parse the input files, but it also needs a bin bedfile to know where the bins\n", "should be and to write the output files. The last two arguments are just the\n", "input file and the output file, just like in the previous subcommands.\n", "\nWe can plot a heatmap of our binned data by running" ], "metadata": { "id": "8wgrLji_UPG6", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c plot heatmap -p bedfiles/4kb_bins.bed -r Sox2 -R -g mm9 binned/pNPC_Rep2.counts binned/pNPC_Rep2_Sox2.png" ], "outputs": [], "execution_count": null, "metadata": { "id": "lnHsu0_PUVsM", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "outputId": "b536ce60-a14b-451d-d1e8-0fa1244753aa" } }, { "cell_type": "markdown", "source": [ "The result should look something like this" ], "metadata": { "id": "SrstJiOBUbUt", "colab_type": "text" } }, { "cell_type": "code", "source": [ "Image(filename='binned/pNPC_Rep2_Sox2.png', width=500)" ], "outputs": [], "execution_count": null, "metadata": { "id": "nii0pvHSUb-g", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 534 }, "outputId": "e4f5385c-0c53-4f0b-ff1e-83a91bf33755" } }, { "cell_type": "markdown", "source": [ "Expected modeling\n", "-----------------\n", "\n", "In order to identify \"enriched\" interactions in our freshly-binned data, we\n", "first need to answer the question: \"Enriched with respect to what?\" The answer\n", "to this question is an expected model, which represents our understanding of\n", "what we would expect the interaction frequency between any two bins to be. This\n", "expected value depends both on the linear genomic separation between the bins,\n", "as well as the effects of local contact domain structure.\n", "\nWe can generate an expected model by running" ], "metadata": { "id": "R9KNzt6xXAis", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c expected -p bedfiles/4kb_bins.bed -RED binned/pNPC_Rep2.counts expected/pNPC_Rep2.counts" ], "outputs": [], "execution_count": null, "metadata": { "id": "YV41hWppVjiH", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 119 }, "outputId": "a6da9bec-cfe4-45b3-a61b-e3f6ac902e9c" } }, { "cell_type": "markdown", "source": [ "Note that for consistency, the flag used to specify the bin bedfile\n", "`bedfiles/4kb_bins.bed` is still `-p`. Most `lib5c` subcommands work on\n", "both fragment-level and bin-level data, with essentially no changes to the\n", "parameters.\n", "\n", "We are using the `-R/--regression`, `-E/--exclude_near_diagonal` and\n", "`-D/--donut_correction` flags to indicate that we want to start with a\n", "regional expected model fitted with a simple log-log regression which ignores\n", "points too close to the diagonal, and then apply donut correction to account for\n", "local variations in contact domain structure." ], "metadata": { "id": "d3DW8T3LXIVX", "colab_type": "text" } }, { "cell_type": "markdown", "source": [ "Variance modeling\n", "-----------------\n", "\n", "Armed with an expected model, we can now readily identify which pairs of bins\n", "interact more frequently than expected. However, what we really want to know is\n", "which pairs of bins interact *significantly* more frequently than expected. In\n", "order to understand the extent of insignificant, noise-driven variations at each\n", "point, we need to construct a variance model - a statistical variance estimate\n", "for each pair of bins.\n", "\nWe can generate a variance model by running" ], "metadata": { "id": "0oEYiulIYSCz", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c variance -p bedfiles/4kb_bins.bed binned/pNPC_Rep2.counts expected/pNPC_Rep2.counts variance/pNPC_Rep2.counts" ], "outputs": [], "execution_count": null, "metadata": { "id": "YbO4nRWGXXp-", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "outputId": "dd89f8ca-7927-4503-b96a-28ea290bd112" } }, { "cell_type": "markdown", "source": [ "By default, this will fit a log-normal deviation-based distance-variance relationship to\n", "the observed data." ], "metadata": { "id": "XYPhqUNvYY7S", "colab_type": "text" } }, { "cell_type": "markdown", "source": [ "P-value calling\n", "---------------\n", "\nFinally, we will call p-values for the interactions in our dataset" ], "metadata": { "id": "90k1nRnAYePR", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c pvalues -p bedfiles/4kb_bins.bed binned/pNPC_Rep2.counts expected/pNPC_Rep2.counts variance/pNPC_Rep2.counts -L norm pvalues/pNPC_Rep2.counts" ], "outputs": [], "execution_count": null, "metadata": { "id": "7NhcVcwYYXtl", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 85 }, "outputId": "046a4843-e1d6-433a-c82e-d6a993e88d58" } }, { "cell_type": "markdown", "source": [ "Calling p-values simply parametrizes a distribution of a specified family (here\n", "we chose `-L norm` for the lognormal distribution) with mean value taken from\n", "the expected model and variance taken from the variance model, and then calls a\n", "right-tail p-value for the observed value in the binned data." ], "metadata": { "id": "VrClSwmQYkFs", "colab_type": "text" } }, { "cell_type": "markdown", "source": [ "### P-value heatmaps\n", "\n\n", "To visualize p-values, we can use the `lib5c plot heatmap` subcommand as\n", "above, but we need to invoke it like" ], "metadata": { "id": "1SaBce62YnzU", "colab_type": "text" } }, { "cell_type": "code", "source": [ "!lib5c plot heatmap -p bedfiles/4kb_bins.bed -r Sox2 -PR -g mm9 pvalues/pNPC_Rep2.counts pvalues/pNPC_Rep2_Sox2.png" ], "outputs": [], "execution_count": null, "metadata": { "id": "j5gZTrpKYin4", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "outputId": "ffa2f2c8-7256-4ed0-90bc-6c2658007144" } }, { "cell_type": "markdown", "source": [ "Here we are using the `-P/--pvalue` flag to indicate that we are trying to\n", "visualize p-values, the `-R/--rulers` flag to indicate that we want to add\n", "genomic coordinate rulers to the heatmap, and the `-g/--genes` flag to\n", "indicate that we want to add gene tracks from the mm9 reference genome.\n", "\nThe results should look something like this:" ], "metadata": { "id": "OPSDV-eGYwKs", "colab_type": "text" } }, { "cell_type": "code", "source": [ "Image(filename='pvalues/pNPC_Rep2_Sox2.png', width=500)" ], "outputs": [], "execution_count": null, "metadata": { "id": "FCtHsKmWYutI", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 534 }, "outputId": "6ade3aef-c044-4975-d030-151f7865333b" } }, { "cell_type": "markdown", "source": [ "Next steps\n", "----------\n", "\n", "Go ahead and explore the other `lib5c` subcommands! Remember that you can list\n", "all the subcommands and get detailed help for a particular subcommand with the\n", "`-h/--help` flag." ], "metadata": { "id": "n7Z_8KroY6F0", "colab_type": "text" } } ], "metadata": { "colab": { "name": "commandline_tutorial.ipynb", "version": "0.3.2", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python2", "language": "python", "display_name": "Python 2" }, "kernel_info": { "name": "python2" }, "language_info": { "mimetype": "text/x-python", "nbconvert_exporter": "python", "name": "python", "pygments_lexer": "ipython2", "version": "2.7.12", "file_extension": ".py", "codemirror_mode": { "version": 2, "name": "ipython" } }, "nteract": { "version": "0.11.9" } }, "nbformat": 4, "nbformat_minor": 0 }