This package imports data from Agilent automated electrophoresis systems (Bioanalyzer, TapeStation, Fragment Analyzer, ZAG DNA Analyzer, Femto Pulse) and includes functions to graph and analyze the data. To download or contribute to the package, please see its page on GitHub.
ggplot2
) to draw professional-quality electropherograms with any combination of variables, faceting, and aesthetics, including elegant sparklines.ggplot2
, allows you to customize the presentation in any way you can imagine.In the 2100 Expert software, open your data file (.xad
) in the “Data” context. Select “File->Export…” from the top menu. Check the “Export to XML” box and no others. Click “Export” and then save the file wherever you like.
Exporting TapeStation data requires version 4.1 or higher of the TapeStation Analysis software. Open your data file (.D1000
, .HSD1000
, .RNA
., .gDNA
, etc.). You need to export both the metadata (in XML format) and electropherogram data (CSV), so you will generate two similarly named output files from each input file.
Select “File->Export Data->Export to XML”. You do not need to export the gel image or individual EPG images. Select your destination and then click “Export” to save the file.
Select “File->Export Data->Export to CSV”. Unselect all files except “Electropherogram Data”, and unselect “Aligned”. The other settings do not matter but the filename must match the XML file. The software will append _Electropherogram.csv
to the name you enter.
The data must be exported with a specific configuration. Unfortunately this precludes the use of Batch Processing mode.
... Electropherogram.csv
)... Peak Table.csv
)... Size Calibration.csv
)... Smear Analysis Result.csv
)... Quality Table.csv
)To be imported, the required CSV files must be together in the same directory or ZIP archive with the original filenames generated by the ProSize software. Only the CSV files are required.
This package can read XML files compressed with gzip
. Although the Agilent software does not automatically compress its exported XML files, it may be helpful to compress them yourself for long-term storage, particularly the large Bioanalyzer XML files that contain all the raw data.
Uncompressed XML files are assumed to have the extension .xml
and compressed ones .xml.gz
. TapeStation gel image files are assumed to have the extension .png
.
The filename you import can be a local path or a URL, so in principle you can directly open data stored on a remote server. However, opening from URLs works with some servers and fails with others because of problems with file handling in the XML
package, which are outside the scope of this package to fix.
This package includes the read.bioanalyzer
, read.tapestation
, and read.prosize
functions to import data from the different platforms, but it is easier to use the wrapper function read.electrophoresis
, which automatically determines the type of data and can read multiple files into a single object.
For a demonstration, we can use some pre-exported example data from the Agilent software. This package’s extdata
subdirectory contains the pre-exported data from every supported demo file included with the Bioanalyzer 2100 Expert and TapeStation Analysis software (with some typos in the sample names corrected so they are easy to parse):
find.package("bioanalyzeR")
#> [1] "/tmp/Rtmp3ed50Y/temp_libpath402f76dfb8b2/bioanalyzeR"
list.files(paste0(find.package("bioanalyzeR"), "/extdata"), recursive = TRUE)
#> [1] "bioanalyzer/Demo DNA 1000 Series II.xml.gz"
#> [2] "bioanalyzer/Demo DNA 12000 Series II.xml.gz"
#> [3] "bioanalyzer/Demo DNA 7500 Series II.xml.gz"
#> [4] "bioanalyzer/Demo Eukaryote Total RNA Nano Series II.xml.gz"
#> [5] "bioanalyzer/Demo Eukaryote Total RNA Pico Series II.xml.gz"
#> [6] "bioanalyzer/Demo High Sensitivity DNA.xml.gz"
#> [7] "bioanalyzer/Demo mRNA Nano Series II.xml.gz"
#> [8] "bioanalyzer/Demo mRNA Pico Series II.xml.gz"
#> [9] "bioanalyzer/Demo Plant RNA Nano.xml.gz"
#> [10] "bioanalyzer/Demo Plant RNA Pico.xml.gz"
#> [11] "bioanalyzer/Demo Prokaryote Total RNA Nano Series II.xml.gz"
#> [12] "bioanalyzer/Demo Prokaryote Total RNA Pico Series II.xml.gz"
#> [13] "femtopulse/FP-1002 gDNA 165 kb/Yeast Extractions 165Kb EXT F12102 12-41-55.zip"
#> [14] "femtopulse/FP-1002 gDNA 165 kb/Yeast Extractions 165Kb Fast F12102 17-27-23.zip"
#> [15] "femtopulse/FP-1003 55 Kb BAC/17-54-18.zip"
#> [16] "femtopulse/FP-1101 US NGS/300 bp dilution series 21-30-31.zip"
#> [17] "femtopulse/FP-1101 US NGS/Smear dilution 15-49-01.zip"
#> [18] "femtopulse/FP-1201 US RNA/FP-1201 Mouse mRNA Dilution Series.zip"
#> [19] "femtopulse/FP-1201 US RNA/FP-1201 Univeral Mouse Ref Total RNA Dilution Series.zip"
#> [20] "fragmentanalyzer/DNF-464 HS Large Fragment/Long read library final QC 18-12-47.zip"
#> [21] "fragmentanalyzer/DNF-467 Genomic 50 kb/16-30-38.zip"
#> [22] "fragmentanalyzer/DNF-467 Genomic 50 kb/21-54-28.zip"
#> [23] "fragmentanalyzer/DNF-468 HS Genomic 50 kb/Human gDNA 38 15-0.2 ng series DNF-468 kit 12Cap 33cm Array 12-26-43.zip"
#> [24] "fragmentanalyzer/DNF-470 Small RNA/17-44-37.zip"
#> [25] "fragmentanalyzer/DNF-471 RNA kit 15nt/DNF-471-33 E. coli Total RNA Dilution Series 12-22-09.zip"
#> [26] "fragmentanalyzer/DNF-472 HS RNA kit 15nt/DNF-472M Mouse & Rat mRNA dilutions.zip"
#> [27] "fragmentanalyzer/DNF-472 HS RNA kit 15nt/RNA Degradation Study 12-03-13.zip"
#> [28] "fragmentanalyzer/DNF-472 HS RNA kit 15nt/sample dilution series 20-07-57.zip"
#> [29] "fragmentanalyzer/DNF-473 NGS Fragment kit (1-6000)/NGS Libraries SS NGS kit 19-10-40.zip"
#> [30] "fragmentanalyzer/DNF-474 HS NGS (1-6000)/DNA smear dilution series 10-39-02.zip"
#> [31] "fragmentanalyzer/DNF-474 HS NGS (1-6000)/Final NGS library QC 11-32-01.zip"
#> [32] "fragmentanalyzer/DNF-474 HS NGS (1-6000)/SureSelect library with primer dimer 16-38-33.zip"
#> [33] "fragmentanalyzer/DNF-476 Small Fragment (1-1500)/DNF-476 SS Small Fragment Kit sRNA Lib Demo Data 17-23-42.zip"
#> [34] "fragmentanalyzer/DNF-477 HS Small Fragment (1-1500)/cfDNA 15-22-09.zip"
#> [35] "fragmentanalyzer/DNF-477 HS Small Fragment (1-1500)/Small DNA smear 19-18-22.zip"
#> [36] "fragmentanalyzer/DNF-910CP CRISPR Gel Discovery/2017 02 08 12H 18M.zip"
#> [37] "fragmentanalyzer/DNF-915 dsDNA (35-5000)/Arabidopsis genotyping 16-33-12.zip"
#> [38] "fragmentanalyzer/DNF-920 dsDNA (75-15000)/10-34-17.zip"
#> [39] "fragmentanalyzer/DNF-930 dsDNA (35-5000)/14-58-36.zip"
#> [40] "tapestation/cfDNA-Plate-96-cfDNA_Electropherogram.csv.gz"
#> [41] "tapestation/cfDNA-Plate-96-cfDNA.xml.gz"
#> [42] "tapestation/cfDNA-Tubes-16-cfDNA_Electropherogram.csv.gz"
#> [43] "tapestation/cfDNA-Tubes-16-cfDNA.xml.gz"
#> [44] "tapestation/D1000-Plate-96-D1000_Electropherogram.csv.gz"
#> [45] "tapestation/D1000-Plate-96-D1000.xml.gz"
#> [46] "tapestation/D1000-Plate-96.annotations.csv"
#> [47] "tapestation/D1000-Tubes-16-D1000_Electropherogram.csv.gz"
#> [48] "tapestation/D1000-Tubes-16-D1000.xml.gz"
#> [49] "tapestation/D5000-Plate-96-D5000_Electropherogram.csv.gz"
#> [50] "tapestation/D5000-Plate-96-D5000.xml.gz"
#> [51] "tapestation/D5000-Tubes-16-D5000_Electropherogram.csv.gz"
#> [52] "tapestation/D5000-Tubes-16-D5000.xml.gz"
#> [53] "tapestation/Eukaryotic RNA-Plate-64-RNA_Electropherogram.csv.gz"
#> [54] "tapestation/Eukaryotic RNA-Plate-64-RNA.xml.gz"
#> [55] "tapestation/Eukaryotic RNA-Tubes-16-RNA_Electropherogram.csv.gz"
#> [56] "tapestation/Eukaryotic RNA-Tubes-16-RNA.xml.gz"
#> [57] "tapestation/gDNA-Plate-96-gDNA_Electropherogram.csv.gz"
#> [58] "tapestation/gDNA-Plate-96-gDNA.xml.gz"
#> [59] "tapestation/gDNA-Tubes-16-gDNA_Electropherogram.csv.gz"
#> [60] "tapestation/gDNA-Tubes-16-gDNA.xml.gz"
#> [61] "tapestation/High Sensitivity D1000-Tubes-16-HSD1000_Electropherogram.csv.gz"
#> [62] "tapestation/High Sensitivity D1000-Tubes-16-HSD1000.xml.gz"
#> [63] "tapestation/High Sensitivity D5000-Tubes-16-HSD5000_Electropherogram.csv.gz"
#> [64] "tapestation/High Sensitivity D5000-Tubes-16-HSD5000.xml.gz"
#> [65] "tapestation/High Sensitivity Eukaryotic RNA-Plate-96-HSRNA_Electropherogram.csv.gz"
#> [66] "tapestation/High Sensitivity Eukaryotic RNA-Plate-96-HSRNA.xml.gz"
#> [67] "tapestation/High Sensitivity Eukaryotic RNA-Tubes-16-HSRNA_Electropherogram.csv.gz"
#> [68] "tapestation/High Sensitivity Eukaryotic RNA-Tubes-16-HSRNA.xml.gz"
#> [69] "tapestation/HSD1000-HaloplexHS-4-HSD1000_Electropherogram.csv.gz"
#> [70] "tapestation/HSD1000-HaloplexHS-4-HSD1000.xml.gz"
#> [71] "tapestation/HSD5000-SureSelectQXT-3-HSD5000_Electropherogram.csv.gz"
#> [72] "tapestation/HSD5000-SureSelectQXT-3-HSD5000.xml.gz"
#> [73] "tapestation/Prokaryotic RNA-Plate-32-RNA_Electropherogram.csv.gz"
#> [74] "tapestation/Prokaryotic RNA-Plate-32-RNA.xml.gz"
#> [75] "zag/ZAG-105 dsDNA (1-500)/10-23-09.zip"
#> [76] "zag/ZAG-105 dsDNA (1-500)/11-34-22.zip"
#> [77] "zag/ZAG-110 dsDNA (35-5000)/10-25-24.zip"
#> [78] "zag/ZAG-130 dsDNA (75-20000)/ZAG-130 GP 20-53-59.zip"
#> [79] "zag/ZAG-135 dsDNA (1-1500)/09-45-24.zip"
So let’s import a file and see what we have:
read.electrophoresis(system.file(
dna1000 <-"extdata",
"bioanalyzer",
"Demo DNA 1000 Series II.xml.gz",
package = "bioanalyzeR"
))$assay.info
dna1000#> $`Demo DNA 1000 Series II`
#> $`Demo DNA 1000 Series II`$file.name
#> [1] "C:\\Program Files (x86)\\Agilent\\2100 bioanalyzer\\2100 expert\\data\\samples\\demo\\electrophoresis\\Demo DNA 1000 Series II.xad"
#>
#> $`Demo DNA 1000 Series II`$creation.date
#> [1] "2005-12-14T08:03:40.000"
#>
#> $`Demo DNA 1000 Series II`$assay.name
#> [1] "DNA 1000 Series II"
#>
#> $`Demo DNA 1000 Series II`$assay.type
#> [1] "DNA"
#>
#> $`Demo DNA 1000 Series II`$length.unit
#> [1] "bp"
#>
#> $`Demo DNA 1000 Series II`$concentration.unit
#> [1] "ng/µl"
#>
#> $`Demo DNA 1000 Series II`$molarity.unit
#> [1] "nM"
#>
#> $`Demo DNA 1000 Series II`$method
#> [1] "hyman"
$samples dna1000
batch | well.number | sample.name | sample.observations | sample.comment | ladder.well |
---|---|---|---|---|---|
Demo DNA 1000 Series II | 1 | Ladder 1 | Ladder 1 | 13 | |
Demo DNA 1000 Series II | 2 | Ladder 2 | Ladder 2 | 13 | |
Demo DNA 1000 Series II | 3 | Ladder 3 | Ladder 3 | 13 | |
Demo DNA 1000 Series II | 4 | DNA 1000 ladder | Ladder DNA 1000 | 13 | |
Demo DNA 1000 Series II | 5 | Ladder 5 | Ladder 5 | 13 | |
Demo DNA 1000 Series II | 6 | Ladder 6 | Ladder 6 | 13 | |
Demo DNA 1000 Series II | 7 | Ladder 1 | Ladder 1 | 13 | |
Demo DNA 1000 Series II | 8 | Ladder 2 | Ladder 2 | 13 | |
Demo DNA 1000 Series II | 9 | Ladder 3 | Ladder 3 | 13 | |
Demo DNA 1000 Series II | 10 | DNA 1000 ladder | Ladder DNA 1000 | 13 | |
Demo DNA 1000 Series II | 11 | Ladder 5 | Ladder 5 | 13 | |
Demo DNA 1000 Series II | 12 | Ladder 6 | Ladder 6 | 13 | |
Demo DNA 1000 Series II | 13 | Ladder | Ladder DNA 1000 | 13 |
Or we can read several files into one object:
read.electrophoresis(
several.files <-system.file(
"extdata",
"bioanalyzer",
"Demo DNA 1000 Series II.xml.gz",
package = "bioanalyzeR"
),system.file(
"extdata",
"bioanalyzer",
"Demo DNA 7500 Series II.xml.gz",
package = "bioanalyzeR"
),system.file(
"extdata",
"bioanalyzer",
"Demo DNA 12000 Series II.xml.gz",
package = "bioanalyzeR"
)
)$samples several.files
batch | well.number | sample.name | sample.observations | sample.comment | ladder.well |
---|---|---|---|---|---|
Demo DNA 1000 Series II | 1 | Ladder 1 | Ladder 1 | 13 | |
Demo DNA 1000 Series II | 2 | Ladder 2 | Ladder 2 | 13 | |
Demo DNA 1000 Series II | 3 | Ladder 3 | Ladder 3 | 13 | |
Demo DNA 1000 Series II | 4 | DNA 1000 ladder | Ladder DNA 1000 | 13 | |
Demo DNA 1000 Series II | 5 | Ladder 5 | Ladder 5 | 13 | |
Demo DNA 1000 Series II | 6 | Ladder 6 | Ladder 6 | 13 | |
Demo DNA 1000 Series II | 7 | Ladder 1 | Ladder 1 | 13 | |
Demo DNA 1000 Series II | 8 | Ladder 2 | Ladder 2 | 13 | |
Demo DNA 1000 Series II | 9 | Ladder 3 | Ladder 3 | 13 | |
Demo DNA 1000 Series II | 10 | DNA 1000 ladder | Ladder DNA 1000 | 13 | |
Demo DNA 1000 Series II | 11 | Ladder 5 | Ladder 5 | 13 | |
Demo DNA 1000 Series II | 12 | Ladder 6 | Ladder 6 | 13 | |
Demo DNA 1000 Series II | 13 | Ladder | Ladder DNA 1000 | 13 | |
Demo DNA 7500 Series II | 1 | ladder 1 | Ladder 1 | 13 | |
Demo DNA 7500 Series II | 2 | ladder 2 | Ladder 2 | 13 | |
Demo DNA 7500 Series II | 3 | ladder 3 | Ladder 3 | 13 | |
Demo DNA 7500 Series II | 4 | ladder DNA 7500 | DNA 7500 ladder | 13 | |
Demo DNA 7500 Series II | 5 | ladder 5 | Ladder 5 | 13 | |
Demo DNA 7500 Series II | 6 | ladder 6 | Ladder 6 | 13 | |
Demo DNA 7500 Series II | 7 | ladder 1 | Ladder 1 | 13 | |
Demo DNA 7500 Series II | 8 | ladder 2 | Ladder 2 | 13 | |
Demo DNA 7500 Series II | 9 | ladder 3 | Ladder 3 | 13 | |
Demo DNA 7500 Series II | 10 | ladder DNA 7500 | DNA 7500 ladder | 13 | |
Demo DNA 7500 Series II | 11 | ladder 5 | Ladder 5 | 13 | |
Demo DNA 7500 Series II | 12 | ladder 6 | Ladder 6 | 13 | |
Demo DNA 7500 Series II | 13 | Ladder | DNA 7500 ladder | 13 | |
Demo DNA 12000 Series II | 1 | Ladder 3 | Ladder 3 | 13 | |
Demo DNA 12000 Series II | 2 | Ladder x | 13 | ||
Demo DNA 12000 Series II | 3 | Ladder 6 | Ladder 6 | 13 | |
Demo DNA 12000 Series II | 4 | Ladder 2 | Ladder 2 | 13 | |
Demo DNA 12000 Series II | 5 | Ladder 1 | 13 | ||
Demo DNA 12000 Series II | 6 | Ladder 4 | Ladder 4 | 13 | |
Demo DNA 12000 Series II | 7 | Ladder 3 | Ladder 3 | 13 | |
Demo DNA 12000 Series II | 8 | Ladder x | 13 | ||
Demo DNA 12000 Series II | 9 | Ladder 6 | Ladder 6 | 13 | |
Demo DNA 12000 Series II | 10 | Ladder 2 | Ladder 2 | 13 | |
Demo DNA 12000 Series II | 11 | Ladder 1 | Ladder 1 | 13 | |
Demo DNA 12000 Series II | 12 | Ladder 4 | Ladder 4 | 13 | |
Demo DNA 12000 Series II | 13 | Ladder | DNA 12000_Ladder | 13 |
electrophoresis
classLet’s have a closer look at what we get when we open a data file:
read.electrophoresis(system.file(
dna1000 <-"extdata",
"bioanalyzer",
"Demo DNA 1000 Series II.xml.gz",
package = "bioanalyzeR"
))class(dna1000)
#> [1] "electrophoresis"
names(dna1000)
#> [1] "data" "assay.info" "samples" "peaks" "regions"
#> [6] "calibration"
head(dna1000$data)
sample.index | time | fluorescence | aligned.time | length | concentration | molarity |
---|---|---|---|---|---|---|
1 | 30.00 | -0.0004425 | 33.42105 | NA | NA | NA |
1 | 30.05 | -0.2573700 | 33.47368 | NA | -0.0008645 | NA |
1 | 30.10 | -0.0841217 | 33.52632 | NA | -0.0011433 | NA |
1 | 30.15 | -0.1443634 | 33.57895 | NA | -0.0007638 | NA |
1 | 30.20 | -0.1733398 | 33.63158 | NA | -0.0010604 | NA |
1 | 30.25 | -0.2204742 | 33.68421 | NA | -0.0013123 | NA |
head(dna1000$samples)
batch | well.number | sample.name | sample.observations | sample.comment | ladder.well |
---|---|---|---|---|---|
Demo DNA 1000 Series II | 1 | Ladder 1 | Ladder 1 | 13 | |
Demo DNA 1000 Series II | 2 | Ladder 2 | Ladder 2 | 13 | |
Demo DNA 1000 Series II | 3 | Ladder 3 | Ladder 3 | 13 | |
Demo DNA 1000 Series II | 4 | DNA 1000 ladder | Ladder DNA 1000 | 13 | |
Demo DNA 1000 Series II | 5 | Ladder 5 | Ladder 5 | 13 | |
Demo DNA 1000 Series II | 6 | Ladder 6 | Ladder 6 | 13 |
head(dna1000$peaks)
sample.index | peak.observations | length | time | aligned.time | lower.time | upper.time | lower.aligned.time | upper.aligned.time | area | concentration | molarity | lower.length | upper.length |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Lower Marker | 15.0000 | 39.10 | 43.00000 | 37.55 | 40.50 | 41.36842 | 44.47368 | 63.487170 | 4.2000000 | 424.242000 | 17.52598 | 18.71392 |
1 | Possible Co-Migration of 4 Peaks | 48.7340 | 43.15 | 47.26316 | 42.55 | 44.20 | 46.63158 | 48.36842 | 7.176994 | 0.5163261 | 16.052680 | 40.86092 | 60.78989 |
1 | Possible Co-Migration of 2 Peaks | 103.0442 | 49.95 | 54.42105 | 48.80 | 50.95 | 53.21052 | 55.47368 | 11.265350 | 0.6640159 | 9.763611 | 94.99825 | 109.75669 |
1 | Questionable Peak | 144.7399 | 55.10 | 59.84211 | 54.30 | 55.35 | 59.00000 | 60.10526 | 2.983823 | 0.1636650 | 1.713260 | 137.19186 | 146.44150 |
1 | 149.5976 | 55.70 | 60.47368 | 55.35 | 56.50 | 60.10526 | 61.31579 | 5.005250 | 0.2723388 | 2.758294 | 146.44150 | 156.71105 | |
1 | Possible Co-Migration of 2 Peaks | 200.6069 | 61.35 | 66.42105 | 60.40 | 62.80 | 65.42105 | 67.94736 | 27.163960 | 1.2741050 | 9.623101 | 191.98248 | 213.74946 |
head(dna1000$regions)
#> NULL
Each electrophoresis
object contains the main data in its $data
member and has several other members with metadata like the sample names and software-reported peaks and regions of interest (though this particular example had no regions in it).
You’ll notice that many of the values at the beginning of dna1000$data
are NA
. This is because the mobility model (the standard curve relating migration speed to molecule length) does not extrapolate to observations below the lower marker or above the upper marker, and the estimates of concentration and molarity depend on the length. We can see some better examples if we look farther down:
head(subset(dna1000$data, ! is.na(length)))
sample.index | time | fluorescence | aligned.time | length | concentration | molarity | |
---|---|---|---|---|---|---|---|
183 | 1 | 39.10 | 95.47542 | 43.00000 | 15.00000 | 0.4897247 | 52.83526 |
184 | 1 | 39.15 | 94.49809 | 43.05263 | 15.00377 | 0.4953069 | 53.42430 |
185 | 1 | 39.20 | 89.56355 | 43.10526 | 15.01523 | 0.4793073 | 51.65979 |
186 | 1 | 39.25 | 81.28615 | 43.15789 | 15.03459 | 0.4443601 | 47.83254 |
187 | 1 | 39.30 | 70.84047 | 43.21053 | 15.06207 | 0.3951816 | 42.46250 |
188 | 1 | 39.35 | 59.44926 | 43.26316 | 15.09788 | 0.3380439 | 36.23833 |
electrophoresis
objectBecause the electrophoresis
class is complex, it has its own special subset
method (subset.electrophoresis
) to simplify subsetting. The principle is that you request a subset of the samples, and all members are automatically updated.
read.electrophoresis(system.file(
dna1000 <-"extdata",
"bioanalyzer",
"Demo DNA 1000 Series II.xml.gz",
package = "bioanalyzeR"
))$samples dna1000
batch | well.number | sample.name | sample.observations | sample.comment | ladder.well |
---|---|---|---|---|---|
Demo DNA 1000 Series II | 1 | Ladder 1 | Ladder 1 | 13 | |
Demo DNA 1000 Series II | 2 | Ladder 2 | Ladder 2 | 13 | |
Demo DNA 1000 Series II | 3 | Ladder 3 | Ladder 3 | 13 | |
Demo DNA 1000 Series II | 4 | DNA 1000 ladder | Ladder DNA 1000 | 13 | |
Demo DNA 1000 Series II | 5 | Ladder 5 | Ladder 5 | 13 | |
Demo DNA 1000 Series II | 6 | Ladder 6 | Ladder 6 | 13 | |
Demo DNA 1000 Series II | 7 | Ladder 1 | Ladder 1 | 13 | |
Demo DNA 1000 Series II | 8 | Ladder 2 | Ladder 2 | 13 | |
Demo DNA 1000 Series II | 9 | Ladder 3 | Ladder 3 | 13 | |
Demo DNA 1000 Series II | 10 | DNA 1000 ladder | Ladder DNA 1000 | 13 | |
Demo DNA 1000 Series II | 11 | Ladder 5 | Ladder 5 | 13 | |
Demo DNA 1000 Series II | 12 | Ladder 6 | Ladder 6 | 13 | |
Demo DNA 1000 Series II | 13 | Ladder | Ladder DNA 1000 | 13 |
subset(dna1000, sample.name == "Ladder 1")
dna1000.ladder1 <-$samples dna1000.ladder1
batch | well.number | sample.name | sample.observations | sample.comment | ladder.well |
---|---|---|---|---|---|
Demo DNA 1000 Series II | 1 | Ladder 1 | Ladder 1 | 13 | |
Demo DNA 1000 Series II | 7 | Ladder 1 | Ladder 1 | 13 |
$peaks dna1000.ladder1
sample.index | peak.observations | length | time | aligned.time | lower.time | upper.time | lower.aligned.time | upper.aligned.time | area | concentration | molarity | lower.length | upper.length |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Lower Marker | 15.00000 | 39.10 | 43.00000 | 37.55 | 40.50 | 41.36842 | 44.47368 | 63.487170 | 4.2000000 | 424.242000 | 17.52598 | 18.71392 |
1 | Possible Co-Migration of 4 Peaks | 48.73400 | 43.15 | 47.26316 | 42.55 | 44.20 | 46.63158 | 48.36842 | 7.176994 | 0.5163261 | 16.052680 | 40.86092 | 60.78989 |
1 | Possible Co-Migration of 2 Peaks | 103.04420 | 49.95 | 54.42105 | 48.80 | 50.95 | 53.21052 | 55.47368 | 11.265350 | 0.6640159 | 9.763611 | 94.99825 | 109.75669 |
1 | Questionable Peak | 144.73990 | 55.10 | 59.84211 | 54.30 | 55.35 | 59.00000 | 60.10526 | 2.983823 | 0.1636650 | 1.713260 | 137.19186 | 146.44150 |
1 | 149.59760 | 55.70 | 60.47368 | 55.35 | 56.50 | 60.10526 | 61.31579 | 5.005250 | 0.2723388 | 2.758294 | 146.44150 | 156.71105 | |
1 | Possible Co-Migration of 2 Peaks | 200.60690 | 61.35 | 66.42105 | 60.40 | 62.80 | 65.42105 | 67.94736 | 27.163960 | 1.2741050 | 9.623101 | 191.98248 | 213.74946 |
1 | 247.80470 | 66.55 | 71.89474 | 66.00 | 67.60 | 71.31579 | 73.00000 | 15.190050 | 0.6459550 | 3.949556 | 242.70977 | 257.19972 | |
1 | 298.17930 | 72.10 | 77.73684 | 71.40 | 73.25 | 77.00000 | 78.94736 | 22.842370 | 0.8833409 | 4.488554 | 291.76053 | 308.72767 | |
1 | 399.03180 | 81.45 | 87.57895 | 80.65 | 82.65 | 86.73684 | 88.84211 | 37.251460 | 1.1971060 | 4.545490 | 387.93289 | 416.15243 | |
1 | 500.72600 | 87.70 | 94.15790 | 86.75 | 89.10 | 93.15789 | 95.63158 | 60.457780 | 1.7825220 | 5.393745 | 483.77716 | 525.90295 | |
1 | 599.27960 | 92.35 | 99.05263 | 91.25 | 93.95 | 97.89474 | 100.73680 | 48.356040 | 1.3856860 | 3.503410 | 566.61142 | 623.00421 | |
1 | 691.47530 | 96.70 | 103.63160 | 95.70 | 97.25 | 102.57890 | 104.21050 | 51.946870 | 1.4504850 | 3.178280 | 663.98870 | 703.98308 | |
1 | 733.62730 | 97.80 | 104.78950 | 97.25 | 99.55 | 104.21050 | 106.63160 | 85.765930 | 2.3990110 | 4.954644 | 703.98308 | 799.04957 | |
1 | 886.34890 | 100.55 | 107.68420 | 99.55 | 101.35 | 106.63160 | 108.52630 | 78.290610 | 2.1990900 | 3.759187 | 799.04957 | 970.04909 | |
1 | 1062.70300 | 102.10 | 109.31580 | 101.35 | 104.50 | 108.52630 | 111.84210 | 111.507300 | 3.0126590 | 4.295300 | 970.04909 | 1371.69818 | |
1 | Upper Marker | 1500.00000 | 105.60 | 113.00000 | 104.60 | 107.00 | 111.94740 | 114.47370 | 81.828600 | 2.1000000 | 2.121210 | 1384.01736 | 1630.29054 |
2 | Lower Marker | 15.00000 | 38.10 | 43.00000 | 36.50 | 39.80 | 41.22925 | 44.88142 | 60.471110 | 4.2000000 | 424.242000 | 17.86830 | 21.40719 |
2 | Possible Co-Migration of 4 Peaks | 49.37152 | 42.00 | 47.31620 | 41.45 | 43.00 | 46.70751 | 48.42292 | 7.704427 | 0.4866122 | 14.933520 | 41.82392 | 61.32174 |
2 | Possible Co-Migration of 2 Peaks | 104.15310 | 48.55 | 54.56522 | 47.45 | 49.40 | 53.34782 | 55.50593 | 12.328830 | 0.6406598 | 9.319897 | 95.82338 | 109.98622 |
2 | Questionable Peak | 144.58620 | 53.30 | 59.82213 | 52.50 | 53.55 | 58.93676 | 60.09881 | 3.557921 | 0.1724338 | 1.806971 | 136.66853 | 146.38707 |
2 | 149.69360 | 53.90 | 60.48617 | 53.55 | 54.70 | 60.09881 | 61.37154 | 5.903490 | 0.2836979 | 2.871497 | 146.38707 | 157.18610 | |
2 | Possible Co-Migration of 2 Peaks | 200.96380 | 59.30 | 66.46245 | 58.45 | 60.45 | 65.52174 | 67.73518 | 29.403260 | 1.2173120 | 9.177821 | 192.85111 | 211.92376 |
2 | 247.24690 | 64.15 | 71.83004 | 63.60 | 64.90 | 71.22134 | 72.66008 | 17.541650 | 0.6596678 | 4.042501 | 241.89770 | 254.27345 | |
2 | 297.82420 | 69.45 | 77.69566 | 68.40 | 70.45 | 76.53360 | 78.80238 | 25.624100 | 0.8758825 | 4.455963 | 287.71502 | 307.45631 | |
2 | 398.68740 | 78.35 | 87.54546 | 77.40 | 79.55 | 86.49407 | 88.87353 | 41.578320 | 1.1809730 | 4.488104 | 384.92614 | 416.60226 | |
2 | Possible Co-Migration of 2 Peaks | 501.28730 | 84.35 | 94.18578 | 83.60 | 85.75 | 93.35574 | 95.73518 | 93.347700 | 2.4307890 | 7.347105 | 487.08608 | 527.70940 |
2 | 595.99160 | 88.60 | 98.88933 | 87.90 | 90.25 | 98.11463 | 100.71540 | 60.722940 | 1.5385310 | 3.911305 | 570.73404 | 622.55101 | |
2 | 689.58220 | 92.80 | 103.53760 | 91.80 | 93.40 | 102.43080 | 104.20160 | 66.136880 | 1.6321430 | 3.586145 | 660.54011 | 703.75347 | |
2 | 734.58020 | 93.95 | 104.81030 | 93.40 | 95.50 | 104.20160 | 106.52570 | 98.885480 | 2.4436170 | 5.040221 | 703.75347 | 792.58413 | |
2 | 875.31400 | 96.45 | 107.57710 | 95.55 | 97.25 | 106.58100 | 108.46250 | 96.875570 | 2.4111390 | 4.173631 | 795.92295 | 962.87061 | |
2 | 1046.80200 | 97.90 | 109.18180 | 97.25 | 100.50 | 108.46250 | 112.05930 | 128.535800 | 3.0735300 | 4.448652 | 962.87061 | 1396.98969 | |
2 | Upper Marker | 1500.00000 | 101.35 | 113.00000 | 100.55 | 102.75 | 112.11460 | 114.54940 | 92.634490 | 2.1000000 | 2.121210 | 1403.35261 | 1635.65406 |
You can see that in addition to dna1000.ladder1$samples
, $peaks
and other members have also been reduced to data from the remaining samples even though they don’t contain the sample.observations
variable themselves. Instead, the sample.index
column has been updated to point to the new row numbers of those samples in the $samples
table.
electrophoresis
objectsThe electrophoresis
class also has a special method for combining multiple instances into one, which is rbind
(rbind.electrophoresis
) since most members are data frames and multiple instances should have the same columns. However, this special rbind
method automatically updates the sample.index
columns and concatenates the other members that are not data frames.
read.electrophoresis(system.file(
dna1000 <-"extdata",
"bioanalyzer",
"Demo DNA 1000 Series II.xml.gz",
package = "bioanalyzeR"
)) read.electrophoresis(system.file(
dna7500 <-"extdata",
"bioanalyzer",
"Demo DNA 7500 Series II.xml.gz",
package = "bioanalyzeR"
))unique(dna1000$data$sample.index)
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13
unique(dna7500$data$sample.index)
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13
rbind(dna1000, dna7500)
combined.batches <-unique(combined.batches$data$sample.index)
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26
$assay.info
combined.batches#> $`Demo DNA 1000 Series II`
#> $`Demo DNA 1000 Series II`$file.name
#> [1] "C:\\Program Files (x86)\\Agilent\\2100 bioanalyzer\\2100 expert\\data\\samples\\demo\\electrophoresis\\Demo DNA 1000 Series II.xad"
#>
#> $`Demo DNA 1000 Series II`$creation.date
#> [1] "2005-12-14T08:03:40.000"
#>
#> $`Demo DNA 1000 Series II`$assay.name
#> [1] "DNA 1000 Series II"
#>
#> $`Demo DNA 1000 Series II`$assay.type
#> [1] "DNA"
#>
#> $`Demo DNA 1000 Series II`$length.unit
#> [1] "bp"
#>
#> $`Demo DNA 1000 Series II`$concentration.unit
#> [1] "ng/µl"
#>
#> $`Demo DNA 1000 Series II`$molarity.unit
#> [1] "nM"
#>
#> $`Demo DNA 1000 Series II`$method
#> [1] "hyman"
#>
#>
#> $`Demo DNA 7500 Series II`
#> $`Demo DNA 7500 Series II`$file.name
#> [1] "C:\\Program Files (x86)\\Agilent\\2100 bioanalyzer\\2100 expert\\data\\samples\\demo\\electrophoresis\\Demo DNA 7500 Series II.xad"
#>
#> $`Demo DNA 7500 Series II`$creation.date
#> [1] "2005-12-14T09:56:52.000"
#>
#> $`Demo DNA 7500 Series II`$assay.name
#> [1] "DNA 7500 Series II"
#>
#> $`Demo DNA 7500 Series II`$assay.type
#> [1] "DNA"
#>
#> $`Demo DNA 7500 Series II`$length.unit
#> [1] "bp"
#>
#> $`Demo DNA 7500 Series II`$concentration.unit
#> [1] "ng/µl"
#>
#> $`Demo DNA 7500 Series II`$molarity.unit
#> [1] "nM"
#>
#> $`Demo DNA 7500 Series II`$method
#> [1] "hyman"
The rbind
method is automatically used to combine multiple batches in read.electrophoresis
, so if you use that function to import the batches at the same time, you probably will not need to rbind
them later.
qplot.electrophoresis
The members of an electrophoresis
object that contain graphable data are data frames compatible with ggplot2
. However, the metadata by sample are in the $samples
member while the actual electrophoresis data are kept separately in the $data
member. To simplify graphing, this package includes the qplot.electrophoresis
function, which is analogous to ggplot2::qplot
but has slightly different syntax: in particular, the first argument is the electrophoresis
object, and the x- and y-variables have defaults.
read.electrophoresis(system.file(
d1000 <-"extdata",
"tapestation",
"D1000-Tubes-16-D1000.xml.gz",
package = "bioanalyzeR"
))qplot.electrophoresis(d1000, title = "TapeStation D1000")
This produces plots analogous to the electropherograms in the Agilent software. However, with the default settings there are several differences; these settings are described in the next sections.
qplot.electrophoresis
displays the software-reported peaks as filled area under the curve. You can stop displaying peaks by setting show.peaks = "none"
:
qplot.electrophoresis(
d1000,show.peaks = "none",
title = "TapeStation D1000 with no peaks"
)
Or you can include the markers with include.markers = TRUE
and fill only the marker peaks with show.peaks = "markers"
, though it’s difficult to see them without changing the axis variables (explained later):
qplot.electrophoresis(
d1000,x = "relative.distance",
y = "fluorescence",
include.markers = TRUE,
show.peaks = "markers",
title = "TapeStation D1000 with marker peaks"
)
Reported regions of interest are shown as semitransparent gray rectangles; you can modify the transparency of these or set it to NA to stop displaying regions:
qplot.electrophoresis(
d1000,region.alpha = 0.5,
title = "TapeStation D1000 with darker regions"
)
qplot.electrophoresis(
d1000,region.alpha = NA,
title = "TapeStation D1000 with no regions"
)
Finally, the readings from the samples themselves are plotted by default as a continuous curve (ggplot2::geom_line
). The other supported geom is geom_area
, which (as in qplot
) you get by setting geom = "area"
:
read.electrophoresis(system.file(
eukrna <-"extdata",
"tapestation",
"Eukaryotic RNA-Tubes-16-RNA.xml.gz",
package = "bioanalyzeR"
))qplot.electrophoresis(
eukrna,y = "concentration",
xlim = c(100, NA),
title = "TapeStation Eukaryotic RNA with geom_line"
)
qplot.electrophoresis(
eukrna,y = "concentration",
xlim = c(100, NA),
geom = "area",
title = "TapeStation Eukaryotic RNA with geom_area"
)
You can easily zoom in on an interesting feature by setting the xlim
or ylim
arguments (note that you can leave a limit as NA
to let the software choose):
qplot.electrophoresis(
eukrna,y = "concentration",
xlim = c(100, 2000),
ylim = c(NA, 0.08),
title = "TapeStation Eukaryotic RNA zoomed in"
)#> Warning: Removed 730 rows containing missing values (position_stack).
#> Warning: Removed 1428 row(s) containing missing values (geom_path).
Or you can use the scales
argument, which is passed to ggplot2::facet_wrap
or ggplot2::facet_grid
, to allow different facets to automatically get different axis scales:
qplot.electrophoresis(
eukrna,y = "concentration",
xlim = c(100, 2000),
scales = "free_y",
title = "TapeStation Eukaryotic RNA with free y-scales"
)
Another difference from the electropherograms in the software is that, by default, the x-axis (molecule length) is in a linear scale. In the Agilent software, data points are simply graphed as they appeared to the instrument and the x-axis labels are roughly (but not exactly; see stdcrv.mobility
below) logarithmic. You can log-scale the axes with the log
argument that behaves the same as in ggplot2::qplot
:
qplot.electrophoresis(
eukrna,y = "concentration",
xlim = c(100, 2000),
log = "x",
title = "TapeStation Eukaryotic RNA with log x-scale"
)#> Scale for 'x' is already present. Adding another scale for 'x', which will
#> replace the existing scale.
qplot.electrophoresis(
eukrna,y = "concentration",
xlim = c(100, 2000),
log = "xy",
title = "TapeStation Eukaryotic RNA with log x- and y-scales"
)#> Scale for 'x' is already present. Adding another scale for 'x', which will
#> replace the existing scale.
Of course, log-scaling the y-axis gives weird results when the values are fractional.
By now you have probably noticed that the y-axis displays molarity or concentration rather than fluorescence. In fact it displays molarity per length (the molarity estimate for each point is scaled with the differential.scale
function) so that the area under a curve, between any two x-values, is directly proportional to the molarity of molecules in that range.
If you change the y-variable to y = "fluorescence"
you can see something closer to the electropherograms from the Agilent software:
subset(d1000, sample.name == "ladder")
ladder1 <-qplot.electrophoresis(
ladder1,y = "fluorescence",
title = "TapeStation D1000 ladder fluorescence"
)
This is still not quite the same as the electropherograms from the Agilent software because the x-axis is linear. You could log-scale it as above, but to get a truly analogous graph you can simply change the x-variable to relative.distance
, which for TapeStation data is the migration distance normalized to the markers (the capillary instruments have aligned.time
instead):
qplot.electrophoresis(
ladder1,x = "relative.distance",
y = "fluorescence",
title = "TapeStation D1000 ladder fluorescence vs. distance"
)
Note that the x-axis is automatically reversed when the x-variable is distance, to keep the plots in the same orientation.
Compare fluorescence with concentration:
qplot.electrophoresis(
ladder1,y = "concentration",
title = "TapeStation D1000 ladder concentration"
)
Even though differential-scaled concentration is more directly informative than raw fluorescence, for many experiments the real variable of interest is molarity. Rather than the total mass of molecules, we want to know the number of molecules. The problem is that the mass concentration, and therefore the fluorescence, depends on the length of the molecule: a 2 kb fragment has approximately twice the mass and twice the fluorescence of a 1 kb fragment, even if the copy number is the same. So that is the default y-value:
qplot.electrophoresis(
ladder1,title = "TapeStation D1000 ladder molarity"
)
If our variable of interest was molarity, the concentration graph would have been very misleading! And the original fluorescence vs. distance graph even moreso.
Or sometimes the absolute variable is irrelevant and what we really want to know is the size distribution within each sample. For that you can set normalize = "total"
, which uses the normalize.proportion
function to scale the variable of interest to a proportion of the sample’s total, i.e. the total area under every curve (between the markers) is 1. Notice how the different dilutions of the same sample become roughly equal after normalization, until very low dilutions:
.96 <- read.electrophoresis(system.file(
d1000"extdata",
"tapestation",
"D1000-Plate-96-D1000.xml.gz",
package = "bioanalyzeR"
))qplot.electrophoresis(
subset(d1000.96, startsWith(as.character(sample.name), "frag. lambda DNA")),
facets = NULL,
xlim = c(100, 800),
title = "TapeStation D1000"
)
qplot.electrophoresis(
subset(d1000.96, startsWith(as.character(sample.name), "frag. lambda DNA")),
facets = NULL,
xlim = c(100, 800),
normalize = "total",
title = "TapeStation D1000 normalized"
)
Normalized values are proportions of the total, so it may be intuitive to label them as percentages, using existing commands from the ggplot2
and scales
packages:
qplot.electrophoresis(
subset(d1000.96, startsWith(as.character(sample.name), "frag. lambda DNA")),
facets = NULL,
xlim = c(100, 800),
normalize = "total",
title = "TapeStation D1000 normalized as percentages"
+ scale_y_continuous(label = percent) )
What if you care about only the distribution of molecules within a certain range? For example, you might have sequencing libraries with some shorter adapter dimers, but what you really care about is the distribution of sequenceable fragments. In that case you can set normalize = "window"
, which calls normalize.proportion
on only the data in the window between your plot’s x-limits.
subset(read.electrophoresis(system.file(
libqc <-"extdata",
"fragmentanalyzer",
"DNF-473 NGS Fragment kit (1-6000)",
"NGS Libraries SS NGS kit 19-10-40.zip",
package = "bioanalyzeR"
startsWith(as.character(sample.name), "NGS"))
)), qplot.electrophoresis(
libqc,facets = NULL,
xlim = c(100, 1000),
title = "Original molarity"
)
qplot.electrophoresis(
libqc,facets = NULL,
xlim = c(100, 1000),
normalize = "total",
title = "Molarity normalized by total"
)
qplot.electrophoresis(
libqc,facets = NULL,
xlim = c(350, 1000),
normalize = "total",
title = "Molarity normalized by total, window of interest"
)
qplot.electrophoresis(
libqc,facets = NULL,
xlim = c(350, 1000),
normalize = "window",
title = "Molarity normalized within window of interest"
)
In other words, when normalize = "total"
, the area under each sample’s entire curve (between markers) is equal, no matter how much of the curve is actually shown in the plot; when normalize = "window"
, the area under each curve in the plot is equal and data outside the limits of the plot are ignored.
Sometimes it is more interesting to overlay samples in the same plot rather than side-by-side in small multiples. The facets
argument, as in ggplot2::qplot
, automatically infers the appropriate faceting function and facets = NULL
skips faceting and plots all data in one area. In that case, the samples are automatically color-coded by name:
read.electrophoresis(system.file(
d5000 <-"extdata",
"tapestation",
"D5000-Tubes-16-D5000.xml.gz",
package = "bioanalyzeR"
))qplot.electrophoresis(
d5000,xlim = c(NA, 2000),
facets = NULL,
title = "TapeStation D5000 with no faceting"
)
Or if you are using geom = "area"
, they are color-coded and also semitransparent (controlled with the area.alpha
argument):
qplot.electrophoresis(
d5000,xlim = c(NA, 2000),
facets = NULL,
geom = "area",
title = "TapeStation D5000 with geom_area and no faceting"
)
There may be some pattern among your samples that you want to explore. You can add new graphable variables to an electrophoresis
object by adding columns to the $samples
data frame. Let’s consider a larger batch:
levels(d1000.96$samples$sample.name)
#> [1] "Ladder" "frag. lambda DNA" "frag. lambda DNA 1:1"
#> [4] "frag. lambda DNA 1:2" "frag. lambda DNA 1:4" "frag. lambda DNA 1:8"
#> [7] "frag. lambda DNA 1:16" "frag. lambda DNA 1:32" "DNA ladder 2"
#> [10] "100bp fragment" "100bp fragment 1:1" "100bp fragment 1:2"
#> [13] "100bp fragment 1:4" "100bp fragment 1:8" "100bp fragment 1:16"
#> [16] "100bp fragment 1:32" "TE buffer" "750bp fragment"
#> [19] "750bp fragment 1:1" "750bp fragment 1:2" "750bp fragment 1:4"
#> [22] "750bp fragment 1:8" "750bp fragment 1:16" "750bp fragment 1:32"
#> [25] "DNA library" "DNA library 1:1" "DNA library 1:2"
#> [28] "DNA library 1:4" "DNA library 1:8" "DNA library 1:16"
#> [31] "DNA library 1:32"
This batch has a few different samples that are diluted to different degrees. We can turn those into variables. First we’ll take the subset with :
in the sample name to find the dilutions, then we’ll parse the sample name and the dilution into variables.
subset(d1000.96, grepl(":", sample.name))
dilutions <-$samples$sample.type <- factor(sub(' 1:.*', "", dilutions$samples$sample.name))
dilutions$samples$dilution <- factor(as.integer(sub('.*:', "", dilutions$samples$sample.name)))
dilutionsunique(dilutions$samples[,c("sample.name", "sample.type", "dilution")])
sample.name | sample.type | dilution |
---|---|---|
frag. lambda DNA 1:1 | frag. lambda DNA | 1 |
frag. lambda DNA 1:2 | frag. lambda DNA | 2 |
frag. lambda DNA 1:4 | frag. lambda DNA | 4 |
frag. lambda DNA 1:8 | frag. lambda DNA | 8 |
frag. lambda DNA 1:16 | frag. lambda DNA | 16 |
frag. lambda DNA 1:32 | frag. lambda DNA | 32 |
100bp fragment 1:1 | 100bp fragment | 1 |
100bp fragment 1:2 | 100bp fragment | 2 |
100bp fragment 1:4 | 100bp fragment | 4 |
100bp fragment 1:8 | 100bp fragment | 8 |
100bp fragment 1:16 | 100bp fragment | 16 |
100bp fragment 1:32 | 100bp fragment | 32 |
750bp fragment 1:1 | 750bp fragment | 1 |
750bp fragment 1:2 | 750bp fragment | 2 |
750bp fragment 1:4 | 750bp fragment | 4 |
750bp fragment 1:8 | 750bp fragment | 8 |
750bp fragment 1:16 | 750bp fragment | 16 |
750bp fragment 1:32 | 750bp fragment | 32 |
DNA library 1:1 | DNA library | 1 |
DNA library 1:2 | DNA library | 2 |
DNA library 1:4 | DNA library | 4 |
DNA library 1:8 | DNA library | 8 |
DNA library 1:16 | DNA library | 16 |
DNA library 1:32 | DNA library | 32 |
Or, if we’ve already prepared the new variables in some other table (like a spreadsheet file), we can use the annotate.electrophoresis
function to add them easily:
system.file(
annotation.file <-"extdata",
"tapestation",
"D1000-Plate-96.annotations.csv",
package = "bioanalyzeR"
) annotate.electrophoresis(dilutions, annotation.file)
dilutions <-unique(dilutions$samples[,c("sample.name", "sample.type2", "dilution2")])
sample.name | sample.type2 | dilution2 |
---|---|---|
frag. lambda DNA 1:1 | frag. lambda DNA | 1 |
frag. lambda DNA 1:2 | frag. lambda DNA | 2 |
frag. lambda DNA 1:4 | frag. lambda DNA | 4 |
frag. lambda DNA 1:8 | frag. lambda DNA | 8 |
frag. lambda DNA 1:16 | frag. lambda DNA | 16 |
frag. lambda DNA 1:32 | frag. lambda DNA | 32 |
100bp fragment 1:1 | 100bp fragment | 1 |
100bp fragment 1:2 | 100bp fragment | 2 |
100bp fragment 1:4 | 100bp fragment | 4 |
100bp fragment 1:8 | 100bp fragment | 8 |
100bp fragment 1:16 | 100bp fragment | 16 |
100bp fragment 1:32 | 100bp fragment | 32 |
750bp fragment 1:1 | 750bp fragment | 1 |
750bp fragment 1:2 | 750bp fragment | 2 |
750bp fragment 1:4 | 750bp fragment | 4 |
750bp fragment 1:8 | 750bp fragment | 8 |
750bp fragment 1:16 | 750bp fragment | 16 |
750bp fragment 1:32 | 750bp fragment | 32 |
DNA library 1:1 | DNA library | 1 |
DNA library 1:2 | DNA library | 2 |
DNA library 1:4 | DNA library | 4 |
DNA library 1:8 | DNA library | 8 |
DNA library 1:16 | DNA library | 16 |
DNA library 1:32 | DNA library | 32 |
Now that we’ve added new variables, we can facet on them:
qplot.electrophoresis(
dilutions,facets = sample.type ~ dilution,
show.peaks = "none",
region.alpha = NA,
scales = "free_y",
title = "Faceting on sample vs. dilution"
)
Or we could facet on one variable and add the other as a custom aesthetic such as the curve color: the ...
collects additional arguments and passes them to the aes
argument of ggplot
.
qplot.electrophoresis(
dilutions,facets = ~ sample.type,
color = dilution,
show.peaks = "none",
region.alpha = NA,
scales = "free_y",
title = "Faceting on sample with dilutions color-coded"
)
Notice that replicates of the same combination of conditions are still plotted separately (the function automatically adds the aesthetic group = sample.index
).
The Agilent software lacks the ability to change the graphing variables to estimated length, concentration, molarity, etc., so if we want to compare our electropherograms between bioanalyzeR and the Agilent software, we can use the rawplot.electrophoresis
function to graph the raw data in the same form, fluorescence vs. aligned migration with the ladder lengths marked:
read.electrophoresis(system.file(
dna1000 <-"extdata",
"bioanalyzer",
"Demo DNA 1000 Series II.xml.gz",
package = "bioanalyzeR"
))rawplot.electrophoresis(dna1000, scales = "free_y") + theme(legend.position = "none")
These graphs are unintuitive because the y-variable is fluorescence rather than a variable of interest such as concentration or molarity, which are not directly proportional, and the x-variable shows migration rather than molecule length, which is neither directly proportional nor exactly log-proportional. The ladder peak positions are shown accurately in the (aligned) raw migration dimension, but they are unevenly spaced and visually interpolating positions between them is not straightforward. However, they can be a useful safety check to make sure the imported data match the data in the Agilent software.
Because the output of qplot.electrophoresis
is a ggplot
object, you can add additonal layers or themes to customize it. One additional configuration is already provided: sparklines, a minimalist line graph. The sparkline.electrophoresis
function simply hardcodes some appropriate settings and additions to the defaults of qplot.electrophoresis
and passes the rest of the arguments to it, so usage is the same.
The default faceting formula draws all sparklines in one vertical column as is customary (facets = sample.index ~ .
). Regions are still drawn and can still be disabled with region.alpha = NA
.
sparkline.electrophoresis(d1000, title = "TapeStation D1000 sparklines")
However, we still have the freedom to use other faceting formulas and aesthetics:
sparkline.electrophoresis(
dilutions,facets = sample.type ~ .,
color = dilution,
region.alpha = NA,
title = "TapeStation D1000 sparklines faceted by sample with dilutions color-coded"
)
Previously we saw how qplot.electrophoresis
makes a detailed electropherogram for individual samples in facets or several samples overlaid. Alternatively, we can use violin.electrophoresis
to draw a violin plot showing several samples’ length distributions side-by-side:
read.electrophoresis(system.file(
rnadeg <-"extdata",
"fragmentanalyzer",
"DNF-472 HS RNA kit 15nt",
"RNA Degradation Study 12-03-13.zip",
package = "bioanalyzeR"
))$samples rnadeg
batch | well.number | well.row | well.col | sample.name | RQN | 28S/18S | ladder.well |
---|---|---|---|---|---|---|---|
RNA Degradation Study 12-03-13 | A1 | A | 1 | 0 min at 70C Universal Mouse Reference Total RNA | 9.5 | 1.9 | A12 |
RNA Degradation Study 12-03-13 | A2 | A | 2 | 2 min at 70C Universal Mouse Reference Total RNA | 9.0 | 1.5 | A12 |
RNA Degradation Study 12-03-13 | A3 | A | 3 | 4 min at 70C Universal Mouse Reference Total RNA | 8.0 | 1.2 | A12 |
RNA Degradation Study 12-03-13 | A4 | A | 4 | 6 min at 70C Universal Mouse Reference Total RNA | 7.3 | 0.8 | A12 |
RNA Degradation Study 12-03-13 | A5 | A | 5 | 8 min at 70C Universal Mouse Reference Total RNA | 6.1 | 0.4 | A12 |
RNA Degradation Study 12-03-13 | A6 | A | 6 | 10 min at 70C Universal Mouse Reference Total RNA | 5.7 | 0.2 | A12 |
RNA Degradation Study 12-03-13 | A7 | A | 7 | 12 min at 70C Universal Mouse Reference Total RNA | 5.5 | 0.1 | A12 |
RNA Degradation Study 12-03-13 | A8 | A | 8 | 14 min at 70C Universal Mouse Reference Total RNA | 4.2 | 0.0 | A12 |
RNA Degradation Study 12-03-13 | A9 | A | 9 | 16 min at 70C Universal Mouse Reference Total RNA | 3.4 | 0.0 | A12 |
RNA Degradation Study 12-03-13 | A10 | A | 10 | 20 min at 70C Universal Mouse Reference Total RNA | 2.8 | 0.0 | A12 |
RNA Degradation Study 12-03-13 | A11 | A | 11 | 18 min at 70C Universal Mouse Reference Total RNA | 3.5 | 0.0 | A12 |
RNA Degradation Study 12-03-13 | A12 | A | 12 | DNF-386 HS RNA Ladder | 6.0 | 0.8 | A12 |
$samples$degradation <- sub(" Universal.*", "", as.character(rnadeg$samples$sample.name)) # shorten the names for graphing
rnadeg$samples$degradation <- factor(rnadeg$samples$degradation, levels = paste(0:10 * 2, "min at 70C")) # manually set order of levels since the samples are loaded out of order
rnadegviolin.electrophoresis(rnadeg, x = "degradation", title = "Violin plot of RNA degradation") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) # rotate axis labels since they're still a little long
This shows us nicely how the RNA degrades with heat treatment; in particular the 28S ribosome (4.7 kb) degrades faster than the 18S (1.9 kb), presumably since it is larger and has more places to break.
Maybe it would be more intuitive to plot it horizontally?
violin.electrophoresis(rnadeg, x = "degradation", flip = T, title = "Horizontal violin plot of RNA degradation")
To accentuate the relationship of the samples we can add a fill color, though a legend would be redundant with the existing sample labels:
violin.electrophoresis(rnadeg, x = "degradation", flip = T, fill = degradation, title = "Colored horizontal violin plot of RNA degradation") + theme(legend.position = "none")
But in this case our samples compare a quantitative variable, not a categorical one, so we can convert our factor to integers:
violin.electrophoresis(rnadeg, x = "degradation", flip = T, fill = as.integer(degradation), title = "Quantitatively colored horizontal violin plot of RNA degradation") + theme(legend.position = "none")
What if we have replicates of the same condition? Then violin.electrophoresis
groups them together:
read.electrophoresis(system.file(
libqc <-"extdata",
"fragmentanalyzer",
"DNF-473 NGS Fragment kit (1-6000)",
"NGS Libraries SS NGS kit 19-10-40.zip",
package = "bioanalyzeR"
))$samples libqc
batch | well.number | well.row | well.col | sample.name | ladder.well |
---|---|---|---|---|---|
NGS Libraries SS NGS kit 19-10-40 | B1 | B | 1 | NGS Lib #1 | B12 |
NGS Libraries SS NGS kit 19-10-40 | B2 | B | 2 | NGS Lib #1 | B12 |
NGS Libraries SS NGS kit 19-10-40 | B3 | B | 3 | NGS Lib #2 | B12 |
NGS Libraries SS NGS kit 19-10-40 | B4 | B | 4 | NGS Lib #2 | B12 |
NGS Libraries SS NGS kit 19-10-40 | B5 | B | 5 | NGS Lib #3 | B12 |
NGS Libraries SS NGS kit 19-10-40 | B6 | B | 6 | NGS Lib #3 | B12 |
NGS Libraries SS NGS kit 19-10-40 | B7 | B | 7 | NGS Lib #4 | B12 |
NGS Libraries SS NGS kit 19-10-40 | B8 | B | 8 | NGS Lib #4 | B12 |
NGS Libraries SS NGS kit 19-10-40 | B9 | B | 9 | DM | B12 |
NGS Libraries SS NGS kit 19-10-40 | B10 | B | 10 | DM | B12 |
NGS Libraries SS NGS kit 19-10-40 | B11 | B | 11 | Ladder | B12 |
NGS Libraries SS NGS kit 19-10-40 | B12 | B | 12 | Ladder | B12 |
subset(libqc, startsWith(as.character(sample.name), "NGS")) # remove extraneous samples
libqc <-violin.electrophoresis(libqc, title = "Violin plot of sequencing library QC")
Let’s add colors to accentuate the pairs of replicates, and zoom in:
violin.electrophoresis(libqc, fill = sample.name, ylim = c(100, 1200), title = "Colored violin plot of sequencing library QC") + theme(legend.position = "none")
We can see that libraries 1 and 2 have an extra lump of short molecules, perhaps unwanted byproduct that was not removed by size selection, while libraries 3 and 4 have smoother distributions of longer molecules.
Perhaps we’d like to know the median molecule size of each library? We can use the summarize.custom
function (described later) to get the medians and plot those as a separate geom_point
.
data.frame(
median.frame <-sample.index = seq(nrow(libqc$samples)),
sample.name = libqc$samples$sample.name,
median = summarize.custom(libqc, 100, 1200)[,1]
)violin.electrophoresis(libqc, fill = sample.name, ylim = c(100, 1200), title = "Annotated violin plot of sequencing library QC") +
theme(legend.position = "none") +
geom_point(aes(sample.name, median, group = sample.index), median.frame, position = position_dodge(0.9)) # include group and position_dodge to match the geom_violin
Since we have estimates of concentration and molarity for every observed data point (within the interpolation range of the mobility standard curve), we can compute the total concentration or molarity of any range of molecule lengths, even if those ranges were not reported as peaks or defined as regions of interest in the Agilent software. The function that does this is integrate.custom
:
read.electrophoresis(system.file(
eukrna <-"extdata",
"tapestation",
"Eukaryotic RNA-Tubes-16-RNA.xml.gz",
package = "bioanalyzeR"
))200.5000 <- integrate.custom(
eukrna.concentration.
eukrna,lower.bound = 200,
upper.bound = 5000,
sum.variable = "concentration"
)200.5000 <- integrate.custom(
eukrna.molarity.
eukrna,lower.bound = 200,
upper.bound = 5000,
sum.variable = "molarity"
)data.frame(
name = as.character(eukrna$samples$sample.name),
200.5000,
eukrna.concentration.200.5000
eukrna.molarity. )
name | eukrna.concentration.200.5000 | eukrna.molarity.200.5000 |
---|---|---|
Ladder | 106.7460 | 375.7501 |
rat kidney 0min 94°C | 288.0934 | 474.2043 |
rat kidney 5min 94°C | 298.0230 | 607.7413 |
rat kidney 5min 94°C | 280.4774 | 613.0041 |
rat kidney 10min 94°C | 296.9142 | 805.0676 |
rat kidney 10min 94°C | 298.2865 | 835.4822 |
rat kidney 15min 94°C | 278.7438 | 1044.0679 |
Ladder | 93.9668 | 318.1203 |
rat kidney 0min 94°C | 325.6443 | 518.5113 |
rat kidney 0min 94°C | 334.8533 | 541.8510 |
rat kidney 5min 94°C | 325.6445 | 674.9550 |
rat kidney 5min 94°C | 333.1555 | 692.2904 |
rat kidney 10min 94°C | 317.8392 | 857.1748 |
rat kidney 10min 94°C | 305.6067 | 834.0799 |
rat kidney 15min 94°C | 287.7153 | 1059.4936 |
rat kidney 15min 94°C | 308.0625 | 1109.3588 |
A more useful thing to calculate, which is not provided by the Agilent software, is the ratio of two regions’ sums. For that we have the region.ratio
function:
region.ratio(
eukrna,c(100, 200),
c(200, 5000)
)#> molarity ratio in length 200-5000/100-200
#> [1,] 5.611418
#> [2,] 2.723488
#> [3,] 3.455445
#> [4,] 3.757644
#> [5,] 4.036406
#> [6,] 3.983118
#> [7,] 3.856595
#> [8,] 4.227789
#> [9,] 2.994024
#> [10,] 3.199538
#> [11,] 3.829074
#> [12,] 3.895848
#> [13,] 3.960055
#> [14,] 3.934033
#> [15,] 3.885983
#> [16,] 3.830161
region.ratio(
eukrna,c(100, 200),
c(200, 5000),
sum.variable = "molarity"
)#> molarity ratio in length 200-5000/100-200
#> [1,] 5.611418
#> [2,] 2.723488
#> [3,] 3.455445
#> [4,] 3.757644
#> [5,] 4.036406
#> [6,] 3.983118
#> [7,] 3.856595
#> [8,] 4.227789
#> [9,] 2.994024
#> [10,] 3.199538
#> [11,] 3.829074
#> [12,] 3.895848
#> [13,] 3.960055
#> [14,] 3.934033
#> [15,] 3.885983
#> [16,] 3.830161
The result is a matrix rather than an ordinary vector. That is because the function can compute the ratio of multiple regions; the result is the ratio of each additional region to the first region:
region.ratio(eukrna,
c(100, 200),
c(200, 5000),
c(200, 1000)
)#> molarity ratio in length 200-5000/100-200
#> [1,] 5.611418
#> [2,] 2.723488
#> [3,] 3.455445
#> [4,] 3.757644
#> [5,] 4.036406
#> [6,] 3.983118
#> [7,] 3.856595
#> [8,] 4.227789
#> [9,] 2.994024
#> [10,] 3.199538
#> [11,] 3.829074
#> [12,] 3.895848
#> [13,] 3.960055
#> [14,] 3.934033
#> [15,] 3.885983
#> [16,] 3.830161
#> molarity ratio in length 200-1000/100-200
#> [1,] 4.1730250
#> [2,] 0.8733159
#> [3,] 1.3805657
#> [4,] 1.6133147
#> [5,] 2.1744683
#> [6,] 2.2070307
#> [7,] 2.7484506
#> [8,] 3.3075942
#> [9,] 0.8845996
#> [10,] 0.9568843
#> [11,] 1.5415831
#> [12,] 1.5632393
#> [13,] 2.1323165
#> [14,] 2.1372645
#> [15,] 2.7422197
#> [16,] 2.6672410
One useful application of region ratios is calculating the DV200, the proportion of an RNA sample (by concentration) that is over 200 nt long. This is not completely trivial with region.ratio
because we don’t want to include the marker as part of the sample, so there is a special function to calculate DV200:
dv200(eukrna)
#> [1] 0.8908198 0.8959080 0.8969891 0.8847388 0.8847805 0.8831431 0.8655917
#> [8] 0.8814676 0.9015812 0.9005202 0.9000766 0.9008210 0.8868515 0.8850592
#> [15] 0.8702081 0.8725449
Another useful application of region ratios, for Illumina sequencing libraries, is calculating the molar ratio of library molecules with sufficiently long inserts to molecules containing only the sequencing adapters (adapter dimers). The illumina.library.ratio
function streamlines this by setting only three parameters: the lowest and highest lengths that are sequenceable, and the length threshold that distinguishes good library molecules.
illumina.library.ratio#> function (electrophoresis, min.sequenceable = 100, min.good.insert = 200,
#> max.sequenceable = 700)
#> as.vector(region.ratio(electrophoresis, c(min.sequenceable, min.good.insert),
#> c(min.good.insert, max.sequenceable), sum.variable = "molarity"))
#> <bytecode: 0x56005cd366e0>
#> <environment: namespace:bioanalyzeR>
read.electrophoresis(system.file(
haloplex <-"extdata",
"tapestation",
"HSD1000-HaloplexHS-4-HSD1000.xml.gz",
package = "bioanalyzeR"
))data.frame(
name = haloplex$samples$sample.name,
ratio = illumina.library.ratio(haloplex)
)
name | ratio |
---|---|
Ladder | 1.78768 |
Haloplex HS amplified library 1 1/50 dilution | 15.20856 |
Haloplex HS amplified library 2 1/30 dilution | 12.21276 |
Haloplex HS amplified library 3 1/10 dilution | 22.79188 |
Haloplex HS amplified library 4 1/40 dilution | 13.14959 |
qplot.electrophoresis(
haloplex,y = "molarity",
geom = "area",
region.alpha = NA,
xlim = c(50, 750),
title = "Molarity of Haloplex libraries"
+ geom_vline(xintercept = c(100, 200, 700), linetype = 2) )
Since we have molarity estimates at the fine resolution of the input data, we can do detailed analysis on the molar distribution of molecule lengths. This may tell us what to expect from downstream processing of the samples, e.g. sequencing. The summarize.peak
function computes several summary statistics of peaks reported by the Agilent software:
subset(d1000$peaks, sample.index == 1)
sample.index | peak.observations | peak.comment | length | distance | lower.distance | upper.distance | concentration | molarity | relative.distance | lower.relative.distance | upper.relative.distance | lower.length | upper.length |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Lower Marker | 25 | 0.845 | 0.811 | 0.879 | 7.51 | 462.00 | 1.0000000 | 0.9553219 | 1.0446781 | 16.61281 | 31.20483 | |
1 | 50 | 0.684 | 0.658 | 0.709 | 1.82 | 55.90 | 0.7884363 | 0.7542707 | 0.8212878 | 45.54899 | 55.76753 | ||
1 | 100 | 0.549 | 0.524 | 0.570 | 1.83 | 28.10 | 0.6110381 | 0.5781866 | 0.6386334 | 88.30620 | 116.43826 | ||
1 | 200 | 0.440 | 0.416 | 0.459 | 1.98 | 15.30 | 0.4678055 | 0.4362681 | 0.4927727 | 176.46977 | 233.97405 | ||
1 | 300 | 0.374 | 0.351 | 0.391 | 2.00 | 10.30 | 0.3810775 | 0.3508541 | 0.4034166 | 272.75981 | 337.19228 | ||
1 | 400 | 0.315 | 0.294 | 0.333 | 1.88 | 7.22 | 0.3035480 | 0.2759527 | 0.3272011 | 367.51563 | 441.56256 | ||
1 | 500 | 0.267 | 0.245 | 0.285 | 2.07 | 6.37 | 0.2404731 | 0.2115637 | 0.2641261 | 460.48862 | 551.06073 | ||
1 | 700 | 0.190 | 0.170 | 0.207 | 1.91 | 4.19 | 0.1392904 | 0.1130092 | 0.1616294 | 649.82514 | 767.47937 | ||
1 | 1000 | 0.128 | 0.109 | 0.143 | 2.03 | 3.12 | 0.0578187 | 0.0328515 | 0.0775296 | 896.60895 | 1175.76261 | ||
1 | Upper Marker | 1500 | 0.084 | 0.063 | 0.100 | 6.50 | 6.67 | 0.0000000 | -0.0275953 | 0.0210250 | 1279.31637 | 1869.59284 |
summarize.peak(d1000, 2:9)
peak.summaries <- peak.summaries
Median | Mean | SD | Skewness | Kurtosis |
---|---|---|---|---|
50 | 50.05852 | 1.857899 | 0.4197168 | 3.096576 |
100 | 100.51003 | 4.972354 | 0.5339345 | 3.233967 |
201 | 201.68763 | 10.189002 | 0.5917793 | 3.270801 |
300 | 301.90816 | 11.367185 | 0.4020895 | 3.112369 |
401 | 401.31339 | 13.444372 | 0.3383686 | 3.016750 |
503 | 502.53252 | 16.293470 | 0.2850813 | 3.255636 |
700 | 701.99884 | 21.101349 | 0.4476350 | 3.372945 |
1010 | 1016.34472 | 57.303458 | 0.6853935 | 3.213018 |
qplot.electrophoresis(subset(d1000, well.number == "A1"), include.ladder = T) + geom_vline(xintercept = peak.summaries$Median)
We can see how the ladder peaks’ median molecule lengths stay close to the expected values while the spread increases from short to long molecules. (Note that the graph above shows concentration rather than molarity because the molarities of the ladder peaks are too different to graph together.)
Likewise summarize.region
computes those statistics for regions reported by the Agilent software:
subset(d1000, startsWith(as.character(sample.name), "270 bp smear"))
smear.dilutions <-qplot.electrophoresis(smear.dilutions, facets = NULL, normalize = "total", xlim = c(50, 1000))
$regions smear.dilutions
sample.index | region.comment | lower.length | upper.length | average.length | concentration | molarity | proportion.of.total | lower.relative.distance | upper.relative.distance | lower.distance | upper.distance |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 50 | 1000 | 300 | 88 | 565.0 | 0.9930 | 0.0578187 | 0.7884363 | 0.1593469 | 0.7197306 | |
2 | 50 | 1000 | 299 | 87 | 563.0 | 0.9934 | 0.0578187 | 0.7884363 | 0.1586938 | 0.7234612 | |
3 | 50 | 1000 | 296 | 74 | 482.0 | 0.9915 | 0.0578187 | 0.7884363 | 0.1573469 | 0.7177306 | |
4 | 50 | 1000 | 296 | 75 | 492.0 | 0.9910 | 0.0578187 | 0.7884363 | 0.1561156 | 0.7135769 | |
5 | 50 | 1000 | 294 | 39 | 258.0 | 0.9848 | 0.0578187 | 0.7884363 | 0.1501156 | 0.7075769 | |
6 | 50 | 1000 | 293 | 39 | 260.0 | 0.9852 | 0.0578187 | 0.7884363 | 0.1479422 | 0.7032116 | |
7 | 50 | 1000 | 281 | 5 | 36.8 | 0.8883 | 0.0578187 | 0.7884363 | 0.1278844 | 0.6824231 | |
8 | 50 | 1000 | 274 | 5 | 37.6 | 0.8943 | 0.0578187 | 0.7884363 | 0.1219014 | 0.6640197 |
summarize.region(smear.dilutions)
Median | Mean | SD | Skewness | Kurtosis |
---|---|---|---|---|
212 | 234.1521 | 120.2496 | 1.036231 | 4.250292 |
213 | 234.0738 | 119.8657 | 1.039103 | 4.269314 |
209 | 231.4844 | 119.1347 | 1.038140 | 4.248700 |
208 | 229.9826 | 119.3031 | 1.048020 | 4.283149 |
205 | 228.7098 | 118.3249 | 1.044291 | 4.227130 |
205 | 228.4880 | 118.4257 | 1.053260 | 4.263815 |
189 | 214.1534 | 116.1788 | 1.131430 | 4.531426 |
187 | 210.0892 | 112.3100 | 1.121908 | 4.591676 |
Here we see that when the same sample is diluted several times, the summary statistics of its length distribution change only slightly. The mean fragment lengths reported by bioanalyzeR are different from the average.length
reported by the Agilent software because bioanalyzeR computes the mean by molarity rather than by concentration.
Alternatively, the summarize.custom
function allows us to set a custom region and compare it across all samples:
subset(d1000, startsWith(as.character(sample.name), "300bp fragment"))
fragment.dilutions <-qplot.electrophoresis(fragment.dilutions, facets = NULL, normalize = "window", xlim = c(100, 500))
$samples$sample.name
fragment.dilutions#> [1] 300bp fragment 0.1ng/µL 300bp fragment 0.1ng/µL 300bp fragment 5ng/µL
#> [4] 300bp fragment 25ng/µL 300bp fragment 40ng/µL 300bp fragment 50ng/µL
#> 11 Levels: Ladder 300bp fragment 0.1ng/µL ... 270 bp smear 5ng/µL
integrate.custom(fragment.dilutions, 100, 500, sum.variable = "concentration")
#> [1] 0.1422277 0.1484535 6.1904260 29.0551663 52.4261060 58.3918271
summarize.custom(fragment.dilutions, 100, 500)
Median in 100-500 | Mean in 100-500 | SD in 100-500 | Skewness in 100-500 | Kurtosis in 100-500 |
---|---|---|---|---|
312 | 299.8014 | 49.29212 | -2.583465 | 11.548610 |
310 | 295.4044 | 53.77793 | -2.418188 | 9.089111 |
303 | 308.3248 | 25.94625 | 1.979335 | 14.329785 |
299 | 303.5752 | 25.95738 | 1.952895 | 14.989782 |
296 | 300.6872 | 27.16830 | 1.924749 | 15.592380 |
293 | 297.2472 | 27.52683 | 1.877765 | 15.832759 |
This package faithfully reports the fluorescence signal and peak/region concentrations and molarity given by the Agilent software, but it must recalculate the model of migration speed vs. molecule length and the estimates of concentration and molarity by position. Some streamlined functions are provided to compare this package’s results with those of the Agilent software.
stdcrv.mobility
This function displays the standard curve of molecule length vs. the observed measurement that corresponds to migration speed (marker-aligned read time for Bioanalyzer and ProSize instruments, marker-relative migration distance for TapeStation), using the data from the ladder(s). In addition to the reported main position of each ladder peak, the fluorescence signal within the peak boundaries is also shown.
read.electrophoresis(system.file(
dna1000 <-"extdata",
"bioanalyzer",
"Demo DNA 1000 Series II.xml.gz",
package = "bioanalyzeR"
))stdcrv.mobility(dna1000) + ggtitle("Bioanalyzer DNA 1000 standard curve")
You can see how the spline function bends to accommodate the ladder peaks, which don’t fit perfectly with log-linear regression:
read.electrophoresis(system.file(
dna1000.regression <-"extdata",
"bioanalyzer",
"Demo DNA 1000 Series II.xml.gz",
package = "bioanalyzeR"
method = "loglinear")
), stdcrv.mobility(dna1000.regression) + ggtitle("Bioanalyzer DNA 1000 standard curve from log-linear regression")
qc.electrophoresis
This function compares the calculations from this package with the reported values from the Agilent software, using the reported peaks because they have all the relevant estimates. The calculations you can compare are "length"
, "concentration"
, and "molarity"
.
qc.electrophoresis(dna1000, "length") + ggtitle("Bioanalyzer DNA 1000 length estimation")
qc.electrophoresis(dna1000, "concentration") + ggtitle("Bioanalyzer DNA 1000 concentration estimation")
qc.electrophoresis(dna1000, "molarity") + ggtitle("Bioanalyzer DNA 1000 molarity estimation")
Because some of the demo batches include de novo runs of the ladder, we have example data with peaks of known true length, concentration, and molarity (the correct values are hardcoded in the peak data from the proper ladder sample). We can use these to check the accuracy of the estimates from the Agilent software and from this package. To avoid repetition we’ll wrap a lot of operations into a function, then apply it to several datasets.
function(title, sample.names, ...) {
compare.estimates <- read.electrophoresis(system.file("extdata", ..., package = "bioanalyzeR"))
data <- which(data$peaks$sample.index == which(data$samples$well.number == data$samples$ladder.well)[1] & ! data$peaks$peak.observations %in% c("Lower Marker", "Upper Marker"))
which.ladder.peaks <- which(data$samples$sample.name %in% sample.names & data$samples$well.number != data$samples$ladder.well)
which.reruns <- which.reruns[sapply(which.reruns, function(sample.index) sum(
which.reruns <-$peaks$sample.index == sample.index &
data ! data$peaks$peak.observations %in% c("Lower Marker", "Upper Marker")
== length(which.ladder.peaks))] # exclude reruns with wrong number of peaks
) which.reruns[seq(min(length(which.reruns), 5))] # don't use more than 5 replicates
which.reruns <- data.frame(
ladder.comparison <-replicate = factor(rep(paste("replicate", seq_along(which.reruns)), each = length(which.ladder.peaks))),
source = factor(rep(c("Agilent software", "bioanalyzeR"), each = length(which.reruns) * length(which.ladder.peaks))),
length = data$peaks$length[which.ladder.peaks],
what = factor(rep(c("length", "concentration", "molarity"), each = 2 * length(which.reruns) * length(which.ladder.peaks)), levels = c("length", "concentration", "molarity"))
)$true.value <- c(
ladder.comparisonrep(data$peaks$length[which.ladder.peaks], 2 * length(which.reruns)),
rep(data$peaks$concentration[which.ladder.peaks], 2 * length(which.reruns)),
rep(data$peaks$molarity[which.ladder.peaks], 2 * length(which.reruns))
) which(
which.rerun.peaks <-$peaks$sample.index %in% which.reruns &
data ! data$peaks$peak.observations %in% c("Lower Marker", "Upper Marker") # excluded markers as their results are hardcoded
) get.x.name(data)
x.name <-$estimate <- c(
ladder.comparison$peaks$length[which.rerun.peaks],
data$calibration[[1]][[1]]$mobility.function(data$peaks[[x.name]][which.rerun.peaks]),
data$peaks$concentration[which.rerun.peaks],
dataintegrate.peak(data, which.rerun.peaks, "concentration"),
$peaks$molarity[which.rerun.peaks],
dataintegrate.peak(data, which.rerun.peaks, "molarity")
)$error <- (ladder.comparison$estimate - ladder.comparison$true.value) / ladder.comparison$estimate
ladder.comparisonggplot(ladder.comparison, aes(x = length, y = error, color = replicate)) +
geom_hline(yintercept = 0, color = "darkgray") +
geom_point() +
facet_grid(what ~ source) +
scale_x_log10() +
scale_y_continuous(label = label_percent(accuracy = 1)) +
xlab("length (bp)") +
ggtitle(paste(title, "accuracy comparison"))
}compare.estimates("Bioanalyzer DNA 1000", "DNA 1000 ladder", "bioanalyzer", "Demo DNA 1000 Series II.xml.gz")
compare.estimates("Bioanalyzer DNA 7500", "ladder DNA 7500", "bioanalyzer", "Demo DNA 7500 Series II.xml.gz")
compare.estimates("TapeStation DNA 1000", "ladder", "tapestation", "D1000-Tubes-16-D1000.xml.gz")
compare.estimates("TapeStation DNA 5000", "D5000 Ladder", "tapestation", "D5000-Plate-96-D5000.xml.gz")
compare.estimates("TapeStation DNA 5000", "Ladder", "tapestation", "D5000-Tubes-16-D5000.xml.gz")
compare.estimates("TapeStation Genomic DNA", "Ladder", "tapestation", "gDNA-Tubes-16-gDNA.xml.gz")
#> Warning: Removed 1 rows containing missing values (geom_point).
compare.estimates("TapeStation High Sensitivity D1000", "Ladder", "tapestation", "High Sensitivity D1000-Tubes-16-HSD1000.xml.gz")
compare.estimates("TapeStation High Sensitivity D5000", "Ladder", "tapestation", "High Sensitivity D5000-Tubes-16-HSD5000.xml.gz")
compare.estimates("Fragment Analyzer NGS", "Ladder", "fragmentanalyzer", "DNF-473 NGS Fragment kit (1-6000)", "NGS Libraries SS NGS kit 19-10-40.zip")
compare.estimates("Fragment Analyzer HS Large Fragment", "DNF-365 Ladder", "fragmentanalyzer", "DNF-464 HS Large Fragment", "Long read library final QC 18-12-47.zip")
compare.estimates("Fragment Analyzer Genomic DNA kb", "Ladder", "fragmentanalyzer", "DNF-467 Genomic 50 kb", "21-54-28.zip")
#> Warning: Removed 2 rows containing missing values (geom_point).
compare.estimates("Fragment Analyzer Small RNA", "Ladder", "fragmentanalyzer", "DNF-470 Small RNA", "17-44-37.zip")
compare.estimates("ZAG DNA Analyzer 1-500", "Ladder", "zag", "ZAG-105 dsDNA (1-500)", "10-23-09.zip")
#> H2 H4 H10 H12
#> Warning in read.prosize(unz(prosize.zip, filenames$electropherogram),
#> unz(prosize.zip, : multiple ladders found in wells ; using only H12
compare.estimates("ZAG DNA Analyzer 1-1500", "Ladder", "zag", "ZAG-135 dsDNA (1-1500)", "09-45-24.zip")
compare.estimates("ZAG DNA Analyzer 35-5000", "Ladder", "zag", "ZAG-110 dsDNA (35-5000)", "10-25-24.zip")
compare.estimates("ZAG DNA Analyzer 75-20000", "1kb Plus DNA Ladder", "zag", "ZAG-130 dsDNA (75-20000)", "ZAG-130 GP 20-53-59.zip")
compare.estimates("Femto Pulse NGS", "Ladder", "femtopulse", "FP-1101 US NGS", "300 bp dilution series 21-30-31.zip")
compare.estimates("Femto Pulse RNA", "Ladder", "femtopulse", "FP-1201 US RNA", "FP-1201 Mouse mRNA Dilution Series.zip")
So we see that the length estimates are fairly accurate, as expected because the Agilent software’s linear interpolation and this software’s spline functions are by definition most accurate near the ladder peaks, but there is a wide range of error in the concentration and molarity estimates from both approaches.
Sometimes you may just want to do a quick analysis or graph your data, so it’s not worth the trouble of opening up an interactive R session or scripting repetitive tasks. To automate simple tasks, the command-line script bioanalyze.R
is provided. You can find it under the bin/
subdirectory after installation of the package:
system.file("bin", "bioanalyze.R", package = "bioanalyzeR")
#> [1] "/tmp/Rtmp3ed50Y/temp_libpath402f76dfb8b2/bioanalyzeR/bin/bioanalyze.R"
You may wish to copy it to somewhere in your system path for easy access.
When you run this script on an XML file, it will print a tab-delimited sample table to stdout (or you can put it in a file with the -o
argument). If you provide a filename with the -p
argument, it will save the results of qplot.electrophoresis
as a PDF image.
$ bioanalyze.R -p D1000.pdf D1000-Tubes-16.xml.gz | column -ents $'\t'
writing D1000.pdf
done
batch well.number sample.name sample.observations reagent.id ladder.well
D1000-Tubes-16 A1 D1000 Ladder Ladder 01-S029-160125-05-000010 A1
D1000-Tubes-16 B1 Smear sample (1:2) 01-S029-160125-05-000010 A1
D1000-Tubes-16 C1 Smear sample (1:4) 01-S029-160125-05-000010 A1
D1000-Tubes-16 D1 Smear sample (1:8) 01-S029-160125-05-000010 A1
D1000-Tubes-16 E1 Smear sample (1:32) 01-S029-160125-05-000010 A1
D1000-Tubes-16 F1 Ladder 1 01-S029-160125-05-000010 A1
D1000-Tubes-16 G1 Fragment mix 01-S029-160125-05-000010 A1
D1000-Tubes-16 H1 Ladder 2 01-S029-160125-05-000010 A1
D1000-Tubes-16 A2 300 bp fragment 01-S029-160125-05-000010 A1
D1000-Tubes-16 B2 300 bp fragment (1:2) 01-S029-160125-05-000010 A1
D1000-Tubes-16 C2 300 bp fragment (1:10) 01-S029-160125-05-000010 A1
D1000-Tubes-16 D2 300 bp fragment (1:70) 01-S029-160125-05-000010 A1
D1000-Tubes-16 E2 Ladder 3 01-S029-160125-05-000010 A1
D1000-Tubes-16 F2 150 bp fragment (1:10) 01-S029-160125-05-000010 A1
D1000-Tubes-16 G2 150 bp fragment (1:30) 01-S029-160125-05-000010 A1
D1000-Tubes-16 H2 150 bp fragment (1:60) 01-S029-160125-05-000010 A1
Of course you may want to customize the figure, so command-line arguments are provided for all the arguments to qplot.electrophoresis
. Additionally, you can perform integration analysis from the command line; the results are included as additional columns in the tab-delimited sample table:
$ bioanalyze.R --integrate_region 200-500 --illumina HSD1000-HaloplexHS-4.xml.gz | column -ents $'\t'
batch well.number sample.name sample.observations reagent.id ladder.well concentration in 200-500 Illumina library ratio
HSD1000-HaloplexHS-4 A1 Ladder Ladder 01-S030-170109-01-000059 A1 903.315682531863 1.61037620118153
HSD1000-HaloplexHS-4 B1 Haloplex HS amplified library 1 1/50 dilution 01-S030-170109-01-000059 A1 246.168295764346 8.7614775552482
HSD1000-HaloplexHS-4 C1 Haloplex HS amplified library 2 1/30 dilution 01-S030-170109-01-000059 A1 1079.82552722961 9.41677787478742
HSD1000-HaloplexHS-4 D1 Haloplex HS amplified library 3 1/10 dilution 01-S030-170109-01-000059 A1 1891.94257995375 14.5793009002519
HSD1000-HaloplexHS-4 E1 Haloplex HS amplified library 4 1/40 dilution 01-S030-170109-01-000059 A1 924.678493957236 10.5888841668172
You can see instructions for all the other command-line arguments with bioanalyze.R --help
.
Some of these features would not be difficult to implement but haven’t been because I don’t need them. Please contact me if they would be useful to you and you are willing to help test them.
.xad
files directly rather than require exporting them to XML. The original files are also XML but store their main data in a compressed base64 block and I have not attempted to decode it.rbind.electrophoresis
because of mismatched metadata.qplot.electrophoresis
function should preferably be an S3 method treating ggplot2::qplot
as the generic, so qplot
will simply dispatch to it. But the inheritance is complicated because qplot
is in another namespace and its first argument is a variable name for an aesthetic. It might be possible to fix this with a closer look at how method dispatching works, or maybe qplot.electrophoresis
should simply be renamed because it isn’t completely analogous to qplot
.