3 Data processing

In our WHICEAS case study example, we are interested in estimating density/abundance for 2017 and 2020 only, but we want to use surveys from previous years to help model species detection functions. We will therefore be using a dataset of NOAA Fisheries surveys in the Central North Pacific from 1986 to 2020.

# Local path to DAS file
das_file <- 'data/surveys/CenPac1986-2020_Final_alb.das'

To follow along, this data file can be downloaded here.

You can process your survey data using a single function, process_surveys(), which takes two primary arguments: the filepath(s) to your DAS survey data, and your settings object. For example:

cruz <- process_surveys(das_file, 
                        settings)

That single command will convert your raw DAS data to a “cruz” object, a list of polished datasets that are prepared to be passed to subsequent analyses.

In our case we will use a third argument to apply edits to the DAS data before processing (see previous page for details on those edits):

edits <- readRDS('cnp_1986_2020_edits.RData')
cruz <- process_surveys(das_file, 
                        settings,
                        edits)

Behind the scenes

The process_surveys() function is a wrapper for several discrete stages of data formatting/processing. Behind the scenes, each of those stages is carried out using a specific LTabundR function. The remainder of this page is a detailed step-by-step explanation of the data processing that occurs when you call process_surveys().

Edit cruise data

If the edits input argument is supplied to process_surveys(), temporary copies of the DAS file(s) are made and edited before processing. This step is discussed on the previous page and implemented with the function das_editor_tofile():

das_file <-
    das_editor_tofile(das_file,
                      edits,
                      suffix = '_edited_example',
                      verbose = TRUE)

This function produces new DAS files with the edits applied, and returns an updated set of DAS filenames you can pass to the next stage of data processing.

das_file
[1] "data/surveys/CenPac1986-2020_Final_alb_edited_example.das"

Bring in cruise data

Read in and process your .DAS file using the functions in Sam Woodward’s swfscDAS package. To do so quickly, we built a wrapper function that makes this quick and easy:

das <- das_load(das_file, 
                perform_checks = TRUE,
                print_glimpse = TRUE)

Interpolate cruise data

Run the following function if you wish to interpolate the DAS data to a more frequent time interval, which can be useful for stratum assignment and effort calculations in some circumstances. This is only done in process_surveys() if the settings object passed to it has the interpolate input for load_survey_settings() set to a number (indicating the desired interval in seconds). In this example the interval is set to 120 seconds. Interpolation will only occur for On-Effort rows.

das <- das_interpolate(das, new_interval = 120)

Process strata

Run the following function to add strata and study-area information to each row of DAS data:

das_strata <- process_strata(das, settings)

This function loops through each stratum data.frame you have provided it in settings$strata, formats the stratum, and asks whether each DAS row occurs within it. For each stratum, a column named stratum_<StratumName> is added to the das object; each row in this column is TRUE (included) or FALSE.

Format DAS data into a cruz object

The function das_format() takes care of some final formatting and initiates the cruz object data structure.

cruz <- das_format(das_strata, verbose=TRUE)

This function (1) removes rows with invalid Cruise numbers, times, or locations; (ii) calculates the distance, in km, between each row of data; (iii) adds a ship column to the dataset, with initials for the ship corresponding to each cruise; (iv) creates a new list, cohorts, which copies the cruise data for each cohort specified in your settings; and (v) adds a stratum column to the data in each cohort. That column specifies a single stratum assignment for each row of DAS data in the event of overlapping strata, based upon the cohort setting stratum_overlap_handling.

The cruz object

The function das_format() returns a list, which we have saved in an object named cruz, with several slots:

cruz %>% names
[1] "settings" "strata"   "cohorts" 

The slots strata and study_area provide the area, in square km, of each polygon being used:

cruz$strata
           stratum                area
1           HI_EEZ  2474595.762 [km^2]
2         OtherCNP 33817779.065 [km^2]
3              MHI   212033.063 [km^2]
4          WHICEAS   402948.734 [km^2]
5       Spotted_OU     5102.666 [km^2]
6       Spotted_FI    10509.869 [km^2]
7       Spotted_BI    39454.720 [km^2]
8  Bottlenose_KaNi     2755.024 [km^2]
9  Bottlenose_OUFI    14417.027 [km^2]
10   Bottlenose_BI     4668.072 [km^2]
11            NWHI   449375.569 [km^2]

The slot cohorts is itself a list with one slot for each cohort. The slots are named using the id cohort setting.

cruz$cohorts %>% names
[1] "all"        "bottlenose" "spotted"   

Each cohort slot has a copy of the DAS data with a new stratum column, which contains a stratum assignment tailored to its cohort-specific settings. For instance, the all cohort, whose stratum_overlap_handling is set to "smallest", assigns the smallest stratum in the event of overlapping or nested strata:

cruz$cohorts$all$stratum %>% table(useNA='ifany')
.
  HI_EEZ OtherCNP  WHICEAS 
  117715   126252    85671 

Since the bottlenose cohort uses a different subset of geostrata, its distribution of stratum assignments will also differ:

cruz$cohorts$bottlenose$stratum %>% table(useNA='ifany')
.
  Bottlenose_BI Bottlenose_KaNi Bottlenose_OUFI          HI_EEZ        OtherCNP 
           3415            1495            6862          117715          126252 
        WHICEAS 
          73899 

This list, with these three primary slots, will be referred to from hereon as a cruz object.

Segmentize the data

To allocate survey data into discrete ‘effort segments’, which are used in variance estimation in subsequent steps, run the function segmentize(). This process is controlled by both survey-wide and cohort-specific settings, which are now carried in a slot within the cruz object. The process is outlined in detail in the Appendix on Segmentizing.

cruz <- segmentize(cruz, verbose=FALSE)

This function does not change the high-level structure of the cruz object …

cruz %>% names
[1] "settings" "strata"   "cohorts" 

… or the cohort names in the cohorts slot:

cruz$cohorts %>% names
[1] "all"        "bottlenose" "spotted"   

For each cohorts slot, the list structure is the same:

cruz$cohorts$all %>% names
[1] "segments" "das"     

cruz$cohorts$bottlenose %>% names
[1] "segments" "das"     

cruz$cohorts$spotted %>% names
[1] "segments" "das"     

The segments slot contains summary data for each effort segment, including start/mid/end coordinates, average conditions, and segment distance:

cruz$cohorts$all$segments %>% glimpse
Rows: 1,455
Columns: 39
$ seg_id       <dbl> 2051, 2062, 2055, 2063, 2064, 2056, 2065, 2066, 2067, 205…
$ Cruise       <dbl> 901, 901, 901, 901, 901, 901, 901, 901, 901, 901, 901, 90…
$ ship         <chr> "OES", "OES", "OES", "OES", "OES", "OES", "OES", "OES", "…
$ stratum      <chr> "WHICEAS", "WHICEAS", "WHICEAS", "WHICEAS", "WHICEAS", "W…
$ use          <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
$ Mode         <chr> "C-P", "C", "C", "C-P", "P-C", "C", "C", "C", "C", "C", "…
$ EffType      <chr> "S", "S", "N", "S", "S", "N", "S", "S", "S", "N", "N", "N…
$ OnEffort     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ ESWsides     <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ dist         <dbl> 0.1224221, 145.5464171, 38.3219637, 148.8739186, 151.6705
$ minutes      <dbl> 0.400, 515.450, 136.800, 540.017, 558.767, 40.900, 507.20
$ n_rows       <int> 82, 188, 75, 178, 169, 31, 197, 146, 189, 51, 54, 219, 16…
$ min_line     <int> 424496, 424498, 424734, 424892, 425505, 425680, 425789, 4…
$ max_line     <int> 427047, 424891, 424831, 425504, 425788, 425735, 426303, 4…
$ year         <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 200…
$ month        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ day          <int> 6, 6, 6, 7, 9, 10, 10, 11, 12, 12, 14, 14, 16, 16, 17, 18…
$ lat1         <dbl> 21.998167, 21.998167, 21.669333, 20.541000, 19.628833, 21…
$ lon1         <dbl> -159.1487, -159.1487, -158.0867, -157.5555, -156.2075, -1…
$ DateTime1    <dttm> 2009-02-06 07:33:52, 2009-02-06 07:33:52, 2009-02-06 15:
$ timestamp1   <dbl> 1233905632, 1233905632, 1233932684, 1233999718, 123417162…
$ yday1        <dbl> 37, 37, 37, 38, 40, 41, 41, 42, 43, 43, 45, 45, 47, 47, 4…
$ lat2         <dbl> 19.193833, 20.541000, 21.591500, 19.621833, 21.143500, 21…
$ lon2         <dbl> -156.0575, -157.5555, -157.8283, -156.1827, -158.0830, -1…
$ DateTime2    <dttm> 2009-02-14 08:21:51, 2009-02-07 09:41:58, 2009-02-06 17:
$ timestamp2   <dbl> 1234599711, 1233999718, 1233942729, 1234171624, 123426333…
$ yday2        <dbl> 45, 38, 37, 40, 41, 41, 42, 43, 45, 43, 45, 47, 47, 49, 4…
$ mlat         <dbl> 21.426500, 21.788500, 21.737833, 19.607500, 19.771167, 21…
$ mlon         <dbl> -158.9077, -158.5185, -157.9085, -156.1373, -156.8483, -1…
$ mDateTime    <dttm> 2009-02-10 17:18:39, 2009-02-06 12:30:29, 2009-02-06 16:
$ mtimestamp   <dbl> 1234286319, 1233923429, 1233938494, 1234086957, 123418822…
$ avgBft       <dbl> NaN, 3.620949, 4.431633, 3.496502, 3.708160, 2.000000, 1.
$ avgSwellHght <dbl> NaN, 3.740710, 4.000000, 2.663336, 3.062949, 2.000000, 2.
$ avgHorizSun  <dbl> NaN, 4.758324, 4.073185, 4.842962, 8.780615, 9.000000, 5.
$ avgVertSun   <dbl> NaN, 2.062058, 1.349072, 1.577954, 1.426446, 2.688592, 1.
$ avgGlare     <dbl> NaN, 0.54538502, 0.00000000, 0.11825495, 0.15454666, 0.00
$ avgVis       <dbl> NaN, 6.750312, 6.176895, 6.483022, 5.771184, 6.500000, 6.
$ avgCourse    <dbl> NaN, 106.2314, 113.4356, 208.8357, 288.5192, 205.4784, 21…
$ avgSpdKt     <dbl> NaN, 9.216194, 8.912595, 9.110399, 8.827567, 9.181559, 9.
# Number of segments
cruz$cohorts$all$segments %>% nrow
[1] 1455

# Segment length distribution
x <- cruz$cohorts$all$segments$dist
hist(x,
     breaks = seq(0,ceiling(max(x, na.rm=TRUE)),by=1),
     xlab='Segment lengths (km)',
     main=paste0('Target km: ',settings$survey$segment_target_km))

And the das slot holds the original data.frame of DAS data, modified slightly: the column OnEffort has been modified according to Beaufort range conditions, and the column seg_id indicates which segment the event occurs within.

cruz$cohorts$all$das %>% names
 [1] "Event"            "DateTime"         "Lat"              "Lon"             
 [5] "OnEffort"         "Cruise"           "Mode"             "OffsetGMT"       
 [9] "EffType"          "ESWsides"         "Course"           "SpdKt"           
[13] "Bft"              "SwellHght"        "WindSpdKt"        "RainFog"         
[17] "HorizSun"         "VertSun"          "Glare"            "Vis"             
[21] "ObsL"             "Rec"              "ObsR"             "ObsInd"          
[25] "Data1"            "Data2"            "Data3"            "Data4"           
[29] "Data5"            "Data6"            "Data7"            "Data8"           
[33] "Data9"            "Data10"           "Data11"           "Data12"          
[37] "EffortDot"        "EventNum"         "file_das"         "line_num"        
[41] "stratum_HI_EEZ"   "stratum_OtherCNP" "stratum_WHICEAS"  "year"            
[45] "month"            "day"              "yday"             "km_valid"        
[49] "km_int"           "km_cum"           "ship"             "stratum"         
[53] "use"              "eff_bloc"         "seg_id"          

The segmentize() function and its associated settings were designed to give researchers full control over how data are segmented, be it for design-based density analysis (which tend to use long segments of 100 km or more and allow for non-contiguous effort to be included in the same segment) or for habitat modeling (which tend to use short segments of 5 - 10 km and disallow non-contiguous effort to be pooled into the same segment). To demonstrate that versatility, checkout the appendix on segmentizing.

Process sightings

To process sightings for each cohort of species, use the function process_sightings(). This function has three basic steps: for each cohort, the function (1) prepares a sightings table using the function das_sight() from swfscDAS; (2) filters those sightings to species codes specified for the cohort in your settings input; and (3) evaluates each of those sightings, asking if each should be included in the analysis according to your settings.

cruz <- process_sightings(cruz)

The function produces a formatted dataset and adds it to a new sightings slot.

cruz$cohorts$all %>% names
[1] "segments"  "das"       "sightings"

cruz$cohorts$bottlenose %>% names
[1] "segments"  "das"       "sightings"

cruz$cohorts$spotted %>% names
[1] "segments"  "das"       "sightings"

Note that the sightings table has a column named included (TRUE = yes, use it in the analysis). Any sightings that do not meet the inclusion criteria as specified in your settings will be included = FALSE, but they won’t be removed from the data.

The sightings table also has a new column, ss_valid, indicating whether or not the group size estimate for this sighting is valid and appropriate for use in abundance estimation and detection function fitting when group size is used as a covariate.

Since the sightings in each cohort are processed slightly differently according to the cohort’s specific settings – most importantly the species that will be included – you should expect different numbers of included/excluded sightings in each cohort dataset:

cruz$cohorts$all$sightings$included %>% table
.
FALSE  TRUE 
  806  3128 

cruz$cohorts$bottlenose$sightings$included %>% table
.
FALSE  TRUE 
  113   410 

When this function’s verbose argument is TRUE (the default), a message is printed each time a sighting does not meet the inclusion criteria.

Sightings data structure

The sightings table has many other variables:

cruz$cohorts$all$sightings %>% names
 [1] "Event"            "DateTime"         "Lat"              "Lon"             
 [5] "OnEffort"         "Cruise"           "Mode"             "OffsetGMT"       
 [9] "EffType"          "ESWsides"         "Course"           "SpdKt"           
[13] "Bft"              "SwellHght"        "WindSpdKt"        "RainFog"         
[17] "HorizSun"         "VertSun"          "Glare"            "Vis"             
[21] "ObsL"             "Rec"              "ObsR"             "ObsInd"          
[25] "EffortDot"        "EventNum"         "file_das"         "line_num"        
[29] "stratum_HI_EEZ"   "stratum_OtherCNP" "stratum_WHICEAS"  "year"            
[33] "month"            "day"              "yday"             "km_valid"        
[37] "km_int"           "km_cum"           "ship"             "stratum"         
[41] "use"              "eff_bloc"         "seg_id"           "SightNo"         
[45] "Subgroup"         "SightNoDaily"     "Obs"              "ObsStd"          
[49] "Bearing"          "Reticle"          "DistNm"           "Cue"             
[53] "Method"           "Photos"           "Birds"            "CalibSchool"     
[57] "PhotosAerial"     "Biopsy"           "CourseSchool"     "TurtleSp"        
[61] "TurtleGs"         "TurtleJFR"        "TurtleAge"        "TurtleCapt"      
[65] "PinnipedSp"       "PinnipedGs"       "BoatType"         "BoatGs"          
[69] "PerpDistKm"       "species"          "best"             "low"             
[73] "high"             "prob"             "mixed"            "ss_tot"          
[77] "lnsstot"          "ss_percent"       "n_sp"             "n_obs"           
[81] "n_best"           "n_low"            "n_high"           "calibr"          
[85] "ss_valid"         "mixed_max"        "spp_max"          "included"        

Columns 42 onwards correspond to sightings information. Columns of note:

  • species contains the species code. There is only one species-code per row (i.e, multi-species sightings have been expanded to multiple rows).

  • best, low, and high contain the refined group size estimates, averaged across observers and calibrated according to the cohort’s settings specifications. For multi-species sightings, these numbers represent the number of individuals for the single species represented in the row (i.e., the original group size estimate has been scaled by the percentage attritbuted to this species).

  • The columns following those group size estimates (prob through spp_max) detail how group sizes were estimated: prob indicates whether probable species codes were accepted; mixed indicates whether this species’ sighting is part of a mixed-species sighting; n_sp provides the number of species occurring in this sighitng; n_obs gives the number of observers who contributed group size estimates; n_best through n_high gives the number of valid group size estimates given; and calibr indicates whether or not calibration was attempted for this sighting based on the settings (see next section); mixed_max indicates whether this species was the most abundant in the sighting (if multi-species); spp_max indicates the species code for the most abundant species in the sighting (if multi-species).

  • As explained above, the final column, included, indicates whether this species should be included in the analysis.

Here is a glimpse of the data:

cruz$cohorts$all$sightings %>% glimpse
Rows: 3,934
Columns: 88
$ Event            <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S"
$ DateTime         <dttm> 2009-02-06 09:50:14, 2009-02-06 11:11:17, 2009-02-06…
$ Lat              <dbl> 21.93600, 21.83917, 21.78333, 21.70433, 21.69733, 21.
$ Lon              <dbl> -158.8153, -158.6568, -158.4722, -158.2835, -158.2485
$ OnEffort         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ Cruise           <dbl> 901, 901, 901, 901, 901, 901, 901, 901, 901, 901, 901…
$ Mode             <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C"
$ OffsetGMT        <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
$ EffType          <chr> "S", "S", "S", "S", "S", "S", "S", "N", "N", "N", "N"
$ ESWsides         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ Course           <dbl> 167, 108, 111, 106, 106, 102, 102, 39, 39, 39, 39, 94…
$ SpdKt            <dbl> 9.4, 8.6, 9.2, 10.4, 10.4, 10.2, 10.2, 9.1, 9.1, 9.1,…
$ Bft              <dbl> 4, 4, 3, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 5, 4, 2,…
$ SwellHght        <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 1,…
$ WindSpdKt        <dbl> 15, 11, 6, 3, 3, 4, 4, 9, 9, 9, 15, 15, 15, 17, 17, 1…
$ RainFog          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 3, 1, 5,…
$ HorizSun         <dbl> 1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 3, 3, 12, N…
$ VertSun          <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, NA…
$ Glare            <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ Vis              <dbl> 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0
$ ObsL             <chr> "197", "308", "086", "197", "197", "307", "307", "308…
$ Rec              <chr> "309", "307", "238", "309", "309", "197", "197", "307…
$ ObsR             <chr> "086", "197", "308", "086", "086", "309", "309", "197…
$ ObsInd           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ EffortDot        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ EventNum         <chr> "027", "061", "129", "168", "176", "188", "191", "204…
$ file_das         <chr> "CenPac1986-2020_Final_alb_edited_example.das", "CenP…
$ line_num         <int> 424551, 424585, 424655, 424696, 424704, 424716, 42471…
$ stratum_HI_EEZ   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ stratum_OtherCNP <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ stratum_WHICEAS  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ year             <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009,…
$ month            <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ day              <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 8,…
$ yday             <dbl> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 3…
$ km_valid         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ km_int           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ km_cum           <dbl> 142323.0, 142344.5, 142366.8, 142388.5, 142392.2, 142…
$ ship             <chr> "OES", "OES", "OES", "OES", "OES", "OES", "OES", "OES…
$ stratum          <chr> "WHICEAS", "WHICEAS", "WHICEAS", "WHICEAS", "WHICEAS"
$ use              <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ eff_bloc         <chr> "291-0", "291-0", "291-0", "291-0", "291-0", "291-0",…
$ seg_id           <dbl> 2062, 2062, 2062, 2062, 2062, 2062, 2062, 2055, 2055,…
$ SightNo          <chr> "001", "002", "003", "004", "005", "006", "007", "008…
$ Subgroup         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SightNoDaily     <chr> "20090206_1", "20090206_2", "20090206_3", "20090206_4…
$ Obs              <chr> "086", "197", "308", "086", "086", "309", "307", "308…
$ ObsStd           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ Bearing          <dbl> 78, 3, 23, 5, 57, 49, 331, 354, 358, 7, 260, 42, 12, …
$ Reticle          <dbl> 1.4, 0.3, 2.0, 1.4, 1.0, 2.0, 0.8, 8.0, 0.1, NA, NA, …
$ DistNm           <dbl> 2.17, 4.10, 1.76, 2.17, 2.58, 1.76, 2.87, 0.31, 5.27,…
$ Cue              <dbl> 2, 6, 6, 3, 6, 6, 6, 6, 6, 6, 3, 6, 3, 6, 6, 6, 6, 6,…
$ Method           <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 4, 4, 4, 2, 4, 4, 4,…
$ Photos           <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"…
$ Birds            <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"…
$ CalibSchool      <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"…
$ PhotosAerial     <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"…
$ Biopsy           <chr> "Y", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"…
$ CourseSchool     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TurtleSp         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TurtleGs         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TurtleJFR        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TurtleAge        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TurtleCapt       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ PinnipedSp       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ PinnipedGs       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ BoatType         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ BoatGs           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ PerpDistKm       <dbl> 3.9310187037330472925589, 0.3973973829439210736503, 1…
$ species          <chr> "076", "076", "076", "076", "076", "076", "076", "076…
$ best             <dbl> 1.000000, 2.057008, 1.030011, 1.986117, 1.000000, 1.1
$ low              <dbl> 1.000000, 2.000000, 1.000000, 2.000000, 1.000000, 1.0
$ high             <dbl> 1.000000, 3.000000, 1.246355, 2.000000, 2.000000, 1.0
$ prob             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ mixed            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ ss_tot           <dbl> 1.000000, 2.057008, 1.030011, 1.986117, 1.000000, 1.1
$ lnsstot          <dbl> 0.00000000, 0.72125271, 0.02956944, 0.68618171, 0.000
$ ss_percent       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ n_sp             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ n_obs            <int> 1, 3, 3, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2,…
$ n_best           <int> 1, 3, 3, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2,…
$ n_low            <int> 1, 3, 3, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2,…
$ n_high           <int> 1, 3, 3, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2,…
$ calibr           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ ss_valid         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ mixed_max        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ spp_max          <chr> "076", "076", "076", "076", "076", "076", "076", "076…
$ included         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…

Note that the process_sightings() function draws upon cruz$settings for inclusion criteria, but some of those settings can be overridden with the function’s manual inputs if you want to explore your options (see below).

Group size estimates

In the settings we are using in this tutorial, group size estimates are adjusted using the calibration models from Barlow et al. (1998) (their analysis is refined slightly and further explained in Gerrodette et al. (2002)). These calibration corrections are observer-specific. Most observers tend to underestimate group size and their estimates are adjusted up; others tend to overestimate and their estimates are adjusted down. Some observers do not have calibration coefficients, and for them a generic adjustment (upwards, by dividing estimates by 0.8625) is used. In LTabundR, each observer’s estimate is calibrated, then all observer estimates are averaged. To do that averaging, our settings specify that we shall use a geometric weighted mean, instead of an arithmetic mean, that weights school size estimates from multiple observers according to the variance of their calibration coefficients.

Here are our current best estimates of group size:

cruz$cohorts$all$sightings$best %>% head(20)
 [1]   1.000000   2.057008   1.030011   1.986117   1.000000   1.159420
 [7]   2.318841   2.318841   3.666409   7.513902   2.783748   9.275362
[13]   3.478261   1.159420   1.986117   1.159420   1.000000   3.279336
[19]   8.115942 393.320601

Let’s compare those estimates to unadjusted ones, in which calibration (and therefore weighted geometric mean) is turned off:

cruz_demo <- process_sightings(cruz, 
                               calibrate = FALSE,
                               verbose = FALSE)
cruz_demo$cohorts$all$sightings$best %>% head(20)
 [1]   1.000000   2.000000   1.000000   2.000000   1.000000   1.000000
 [7]   2.000000   2.000000   3.162278   6.480741   3.000000   8.000000
[13]   3.000000   1.000000   2.000000   1.000000   1.000000   2.828427
[19]   7.000000 234.627970

You can also carry out calibration corrections without using a geometric weighted mean (the arithmetic mean will be used instead):

cruz_demo <- process_sightings(cruz, 
                               calibrate = TRUE,
                               geometric_mean = FALSE,
                               verbose = FALSE)
cruz_demo$cohorts$all$sightings$best %>% head(20)
 [1]   1.000000   2.166746   1.053140   1.986117   1.000000   1.159420
 [7]   2.318841   2.318841   4.057971   7.536232   2.783748   9.275362
[13]   3.478261   1.159420   1.986117   1.159420   1.000000   3.478261
[19]   8.115942 465.411665

Note that when geometric_mean = TRUE but calibration is not carried out, the simple geometric mean is calculated instead of the weighted geometric mean, since the weights are the variance estimates from the calibration routine.

Also note that group size calibration is only carried out if settings$group_size_calibration is not NULL. However, even when calibration coefficients are provided, it is possible to specify that calibration should only be carried out for raw estimates above a minimum threshold (see cohort setting calibration_floor, whose default is 0), since observers may be unlikely to mis-estimate the group size of a lone whale or pair. For observers who have calibration coefficients in the settings$group_size_coefficients table, that minimum is specified for each observer individually. For observers not in that table, calibration will only be applied to raw group size estimates above settings$cohorts[[i]]$calibration_floor or above.

Process subgroups

After sightings data are processed, the process_surveys() function calls the subroutine process_subgroups() to find and calculate subgroup group size estimates for false killer whales (or other species that may have been recorded using the subgroup functionality in WinCruz), if any occur in the DAS data (Event code “G”).

cruz <- process_subgroups(cruz) 

If subgroups are found, a subgroups slot is added to the analysis list for a cohort.

cruz$cohorts$all %>% names
[1] "segments"  "das"       "sightings" "subgroups"

This subgroups slot holds a list with three dataframes:

cruz$cohorts$all$subgroups %>% names
[1] "sightings" "subgroups" "events"   
  • $events (each row is a group size estimate for a single subgroup during a single phase of the false killer whale protocol (if applicable) within a single sighting; this is effectively the raw data). Internally, LTabundR uses the function subgroup_events() to produce this data.frame.

  • $subgroups (each row is a single subgroup for a single protocol phase, with all group size estimates averaged together (both arithmetically and geometrically). Internally, LTabundR uses the function subgroup_subgroups() to produce this data.frame.

  • $sightings (each row is a “group” size estimate for a single sighting during a single phase protocol, with all subgroup group sizes summed together). Note for false killer whales this “group” size estimate is not likely to represent actual group size because groups can be spread out over tens of kilometers, and it is not expected that every subgroup is detected during each protocol phase. Internally, LTabundR uses the function subgroup_sightings() to produce this data.frame.

For a detailed example of how subgroup data are analyzed, see the vignette page on subgroup analysis using data on Central North Pacific false killer whales.

Subgroup phase assignments

Currently, the protocol phase for each event is automatically determined simply according to the column OnEffort: if TRUE, the column Phase is 1; if FALSE, it is 2.

Since this assignment routine is simplistic, users may find the need to manually revise some of these assignments. To facilitate that work, LTabundR offers the function subgroup_phases(), which will launch an interactive Shiny app that allows users to review phase assignments and manually stage a set of revisions. Here is a step-by-step example of what this process would look like:

First, process your DAS data as usual using the process_surveys() function. The resulting cruz object will contain a subgroups slot for cohorts with G events in the DAS data:

cruz$cohorts$all$subgroups %>% names
[1] "sightings" "subgroups" "events"   

Second, manually review the phase assignments for subgroup events in your cohort of interest using the function subgroup_phases():

edits <- subgroup_phases(cruz, cohort = 'all')

This function will launch a Shiny app that looks like this:

The app provides instructions at the top of its screen. In short, you select the rows of data whose Phase column needs to be revised. Below the table, you specify what the revised Phase should be, and a button will appear (not shown above) to add this revision to a list of staged edits. You can repeat this process as many times as you need as you review the data. If you make a mistake, you can select the staged edit that you need to delete and a button will appear allowing you to delete it.

By the end of this processed, you will have a list of staged edits that you will want to save. A button will be available in the top right of the screen (not shown above) that allows you to close the app and return the staged edits as a data.frame. We capture that output in the object edits in our code above.

Those staged edits will look something like this:

# Example of `edits` output
edits
  cohort Cruise Species   Line SubGrp Event old_phase new_phase
1    all   1108     033 480021      A     1         1         2
2    all   1108     033 480034      B     1         1         2

If you need to repeat step 2, just save your edits into new objects (e.g., edits2, edits3, etc.), then bind them together with rbind() as you would any set of data.frame’s with identifcal column names.

Third, now you want to update your cruz object with these edits. To do so, run the LTabundR function process_subgroups() and include your edits as an input:

cruz <- process_subgroups(cruz, 
                          phase_edits = edits)

This will essentially re-do the subgroup processing phase of process_surveys() but this time will take into account your phase edits.

Review

By the end of this process, you have a single data object, cruz, with all the data you need to move forward into the next stages of mapping and analysis.

The LTabundR function cruz_structure() provides a synopsis of the data structure:

cruz_structure(cruz)
"cruz" list structure ========================

$settings
         $strata --- with 11 polygon coordinate sets
         $survey --- with 15 input arguments
         $cohorts --- with 3 cohorts specified, each with 19 input arguments

$strata
       ... containing a summary dataframe of 11 geostrata and their spatial areas
       ... geostratum names:
           HI_EEZ, OtherCNP, MHI, WHICEAS, Spotted_OU, Spotted_FI, Spotted_BI, Bottlenose_KaNi, Bottlenose_OUFI, Bottlenose_BI, NWHI

$cohorts

        $all
            geostrata: WHICEAS, HI_EEZ, OtherCNP
            $segments  --- with 1457 segments (median = 149.5 km)
            $das       --- with 329638 data rows
            $sightings --- with 3934 detections
            $subgroups --- with 255 subgroups, 61 sightings, and 389 events

        $bottlenose
            geostrata: WHICEAS, HI_EEZ, OtherCNP, Bottlenose_BI, Bottlenose_OUFI, Bottlenose_KaNi
            $segments  --- with 1538 segments (median = 149.2 km)
            $das       --- with 329638 data rows
            $sightings --- with 523 detections

        $spotted
            geostrata: WHICEAS, HI_EEZ, OtherCNP, Spotted_OU, Spotted_FI, Spotted_BI
            $segments  --- with 1540 segments (median = 149.1 km)
            $das       --- with 329638 data rows
            $sightings --- with 527 detections

Each species-specific cohort has its own list under cruz$cohorts, and each of these cohorts has the same list structure:

  • segments is a summary table of segments.

  • das is the raw DAS data, modified with seg_id to associate each row with a segment.

  • sightings is a dataframe of sightings processed according to this cohort’s settings.

  • subgroups (if any subgroup data exist in your survey) is a list with subgroup details.

In each cohort data.frame, there are three critically important columns to keep in mind:

  • seg_id: this column is used to indicate the segment ID that a row of data belongs to.

  • use: this column indicates whether a row of effort should be used in the line-transect analysis. Every row of data within a single segment will have the same use value.

  • included: this column occurs in the sightings dataframe only. It indicates whether the sightings should be included in line-transect analysis based on the specified settings. Any sighting with use == FALSE will also have included == FALSE, but it is possible for sightings to have use == TRUE with included == FALSE. For example, if the setting abeam_sightings is set to FALSE, a sighting with a bearing angle beyond the ship’s beam can be excluded from the analysis (included == FALSE) even though the effort segment it occurs within will still be used (use == TRUE).

Finally, let’s save this cruz object locally, to use in downstream scripts:

save(cruz, file='whiceas_cruz.RData')