Loading a dataset
-----------------

Once you have found a dataset of interest on https://cells.ucsc.edu, it is
very easy to load it into your favorite analysis environment. (Let us know if 
the commands below do not work in your environment.)

First, download the expression matrix and the meta data, usually in a Unix terminal::

    mkdir adultPancreas
    cd adultPancreas
    wget https://cells.ucsc.edu/adultPancreas/exprMatrix.tsv.gz
    wget https://cells.ucsc.edu/adultPancreas/meta.tsv

Replace "quakePancreas" above with the dataset name of interest, it is shown in
the URL when you open a dataset after "ds=" or in the download instructions or on the dataset
page as the "CellBrowser dataset identifier".

Then open your favorite tool (e.g. RStudio or Jupyter) and follow the instructions below.

Seurat
^^^^^^

Run these commands if you have downloaded the file as above::

    require(Seurat)
    require(data.table)
    setwd("adultPancreas")
    mat <- fread("exprMatrix.tsv.gz")
    meta <- read.table("meta.tsv", header=T, sep="\t", as.is=T, row.names=1)
    genes = mat[,1][[1]]
    genes = gsub(".+[|]", "", genes)
    mat = data.frame(mat[,-1], row.names=genes, check.names=FALSE)
    so <- CreateSeuratObject(counts = mat, project = "adultPancreas", meta.data=meta)

Or you can download directly into R, without wget, by replacing the fread and read.table commands above in line 4 and 5 with these::

    mat <- fread("https://cells.ucsc.edu/adultPancreas/exprMatrix.tsv.gz")
    meta <- data.frame(fread("https://cells.ucsc.edu/adultPancreas/meta.tsv"), row.names=1)

If your version of data.tables does not support .gz yet, the fread commands can be changed to this::
 
    # from current directory
    mat <- fread("zcat < exprMatrix.tsv.gz")
    # or direct download:
    mat <- fread("curl https://cells.ucsc.edu/adultPancreas/exprMatrix.tsv.gz | zcat")

If the matrix name is not ``exprMatrix.tsv.gz`` but ``matrix.mtx``, you have to
use Seurat's MTX loader.  In addition to ``matrix.mtx``, make sure to also
download the files ``barcodes.tsv`` and ``genes.tsv`` sometimes
called ``features.tsv``.  If you downloaded these three files and ``meta.tsv`` into a directory ``downloadDir``, 
load them like this::

    require(Seurat)
    setwd("downloadDir")
    mat = Read10X(".")
    meta = read.table("meta.tsv", header=T, sep="\t", as.is=T, row.names=1)
    so <- CreateSeuratObject(counts = mat, project = "myProjectName", meta.data=meta)
    
Scanpy
^^^^^^

To create an anndata object in Scanpy if the expression matrix is a .tsv.gz file::

    import scanpy as sc
    import pandas as pd
    ad = sc.read_text("exprMatrix.tsv.gz")
    ad = ad.T
    meta = pd.read_csv("meta.tsv", sep="\t", index_col=0)
    ad.obs = meta

If the expression matrix is an MTX file::

    import scanpy as sc
    import pandas as pd
    ad = sc.read_mtx("matrix.mtx.gz")
    meta = pd.read_csv("meta.tsv", sep="\t")
    ad.var = meta

Some datasets use the format identifier|symbol for the ad.obs gene names (e.g. "ENSG0123123.3|HOX3"). To keep only the symbol:

    ad.obs.index = [x.split("|")[1] for x in ad.obs.index.tolist()]