4 Chapter 3 - Final optimization of presence data

In this chapter, we will load the presence data we worked on in Chapter 1 and the raster variables we processed in Chapter 2. We will optimize the presence data by considering the spatial resolution of the variables.

This ensures that no presence is repeated in the same pixel (also referred to as a cell). Having multiple presences in the same pixel would mean that we would provide the same set of environmental conditions repeated as many times as the presences. This can create an artificial bias in the models towards these values. By setting a resolution and removing duplicate presences, we also reduce bias towards areas with more sampling effort (an area with more effort tend to have more duplicates at wider spatial resolutions).

We will only need terra library for this chapter.

library(terra)

We can now open the relevant data for this chapter. Since in the last chapter we fully aligned the raster variables, we only need to open a single one now as it will provide enough information for this process. The presence data set is that produced at the end of chapter 1.

evi <- rast("data/rasters/evi.tif")
pres <- read.table("data/species/speciesPresence_v1.csv", sep="\t", header=TRUE)

Now we will extract raster values at each presence location point. For that, we will use the extract function provided by the terra package. This function will detect in which pixel/cell the presence is located and extract that information. If a presence is located in a No Data pixel, we will be able to identify it and remove it. However, here we are also interested in identifying if the pixel is the same or not. We cannot rely on EVI values for that, as different pixels might have the same value.

We can set the cells=TRUE parameter in the extract function. For each presence, this will provide a unique pixel/cell identifier. Thus, if two presences are in the same pixel, they will have the same identifier, and we can keep only one, removing the duplicates.

dt <- extract(evi, pres[,c("x", "y")], cells=TRUE)
head(dt)

##   ID       evi  cell
## 1  1 0.5606292 70845
## 2  2 0.5706229 69470
## 3  3 0.5267091 71757
## 4  4 0.6034736 68556
## 5  5 0.5350373 70382
## 6  6 0.4996728 68569

Now we remove those presences that fall in No Data by checking the evi column of the extracted data table:

mask <- is.na(dt$evi)

# how many presences fall in missing data?
sum(mask)

## [1] 10

dt <- dt[!mask,]
pres <- pres[!mask,]

Now that both the presence and extrated datasets are free of missing data, we need to check for pixel duplicates. However, there is a detail that adds a bit of complexity to the process. Since we are working with three species, we have to check for duplicates independently for each species. This is because the three species can coexist in the same pixel (sympatry), and we do not want to confound this situation as pixel duplicates.

The easiest way to do this is to create a loop over the species so that we can detect duplicates for each species independently. The code is organized as follows: 1. Check the names of the species to loop over (three different species in this case) 2. Create a column in the presence data set that will store a value of TRUE is the presence is duplicated in the pixel (i.e., if there was already another presence before at the same pixel) or FALSE if the presence is unique or the first record for a given pixel. 3. Loop over species 1. Identify the rows of presence table referring to the current species in the loop 2. Detect of duplicates only for the current species 3. Update the duplicated column created in step 2 with the relevant information for the species. 4. Print in the console how many duplicates were found for each species

sps <- unique(pres$species)

pres$duplicated <- NA

for (sp in sps) {
    rows <- which(pres$species == sp)
    sp.dup <- duplicated(dt$cell[rows])
    pres$duplicated[rows] <- sp.dup
    print(paste("Species", sp, "has", sum(sp.dup), "duplicates"))
}

## [1] "Species Vaspis has 9349 duplicates"
## [1] "Species Vlatastei has 959 duplicates"
## [1] "Species Vseoanei has 750 duplicates"

We can check the first rows of data to check the table and how many presences were identified in total (each TRUE is equivalent to 1, and a FALSE to 0)

head(pres)

##   species    x     y duplicated
## 1  Vaspis 2.70 42.05      FALSE
## 2  Vaspis 2.09 42.31      FALSE
## 3  Vaspis 2.70 41.87      FALSE
## 4  Vaspis 1.97 42.49      FALSE
## 5  Vaspis 2.09 42.13      FALSE
## 6  Vaspis 3.06 42.50      FALSE

sum(pres$duplicated)

## [1] 11058

Everything seems coherent, so we proceed to remove the duplicated rows. We have to invert the logical value as we want to keep the rows that are not duplicated (were set as FALSE). The sub setting with [ ] only keeps the rows set as TRUE, thus, by inverting with the exclamation point (!) before, we invert and set TRUE to FALSE, and FALSE to TRUE.

final_pres <- pres[!pres$duplicated, 1:3]

We only kept the first 3 columns (species, name, longitude and latitude) as they are the information we need to model.

We can get a summary of how may presence data points we kept at the end of presence processing.

dim(final_pres)

## [1] 6796    3

table(final_pres$species)

## 
##    Vaspis Vlatastei  Vseoanei 
##      4969      1227       600

Finally, we end this chapter by saving a new file with the optimized data set that will be used for modelling.

# write to file
filename <- "data/species/speciesPresence_v2.csv"
write.table(final_pres, filename, sep="\t", row.names=FALSE, col.names=TRUE)