--- title: "Dealing with the GESLA dataset in R" output: rmarkdown::html_vignette author: Fernando Mayer, Niamh Cahill vignette: > %\VignetteIndexEntry{Dealing with the GESLA dataset in R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The **geslaR** package was designed to fetch the [GESLA][] dataset into R, as a full dataset or as a subset of the data. There a few ways that this can be achieved: - Downloading the entire dataset locally, then loading it into R and working from there using the [Apache Arrow][] framework (`download_gesla()`) - Querying the hosted dataset directly from R, specifying the desired subset (`query_gesla()`) - Using the GESLA Shiny app (thereafter called only geslaR-app), to filter, visualise and possibly download subsets of the dataset (`run_gesla_app()`) In the sections below, we will discuss each of these possibilities. ## Loading the package When you first load the **geslaR** package into an R session, two more packages will automatically be loaded: **arrow** and **dplyr**. ```r library(geslaR) #> Loading required package: arrow #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp #> Loading required package: dplyr #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union ``` These two packages are strictly necessary for dealing with a huge dataset like GESLA, to alleviate the need to load it in memory all at once. The full GESLA dataset has more than 1.7 billion lines, and is nearly impossible to load it in R (or any other RAM memory language, like e.g. Python) all at once. To avoid this problem, there are two possibilities: 1. Load (smaller) subsets of the full dataset 2. Use an "in-memory" analytics framework In the best case scenario, these two approaches can be used together for maximum efficiency, and the **geslaR** package was designed to achieve this. Regardless the way the GESLA data is loaded into R, the best way to deal with it is using the [Apache Arrow][] framework. For a more detailed explanation and examples, please see the `vignette("intro-to-arrow")`. ## Downloading the full GESLA dataset If you want to explore the full GESLA dataset, you can download it with the `download_gesla()` function as follows ```r download_gesla() ``` This will create a directory called `gesla_dataset` in the current working directory (as defined by `getwd()`) and download the full dataset locally. This download may take some time (expect around 5 to 10 minutes), as it depends on internet connection. Note that this full dataset will need at least 7GB of (hard drive) storage, so make sure this is feasible. However, once downloaded, you will have access to the full dataset, and you will only need to do this once. You will notice that the full dataset is composed of 5119 [Apache Parquet][] files, ending in `.parquet` ```r ## Number of downloaded files length(list.files("gesla_dataset")) #> [1] 5119 ## Check the first files head(list.files("gesla_dataset")) #> [1] "a121-a12-nld-cmems.parquet" "a2-a2-bel-cmems.parquet" #> [3] "aalesund-aal-nor-cmems.parquet" "aarhus-aar-dnk-cmems.parquet" #> [5] "aasiaat-aas-grl-gloss.parquet" "abashiri-347a-jpn-uhslc.parquet" ``` These files are the same as those originally distributed in the GESLA dataset, so that each one refers to a site from where the data comes from. To load this full dataset in R, use the `arrow::open_dataset()` function, specifying the location of the `.parquet` files. Although there are many files, this function recognises them as a single dataset, because they all have the same structure (or "Schema"). ```r ## Open dataset da <- open_dataset("gesla_dataset") ## Check the object da #> FileSystemDataset with 5119 Parquet files #> date_time: timestamp[us] #> year: int64 #> month: int64 #> day: int64 #> hour: int64 #> country: string #> site_name: string #> lat: double #> lon: double #> sea_level: double #> qc_flag: int64 #> use_flag: int64 #> file_name: string #> #> See $metadata for additional Schema metadata ## Verify class class(da) #> [1] "FileSystemDataset" "Dataset" "ArrowObject" #> [4] "R6" ``` Since this is an `ArrowObject` object, it will actually not load the full dataset in memory (as it would if it was a standard R object, such as a `tibble` or `data.frame`). Note, however, that some basic information, such as `dim()` and `names()` can be retrieved simply with ```r dim(da) #> [1] 1172435674 13 names(da) #> [1] "date_time" "year" "month" "day" "hour" "country" #> [7] "site_name" "lat" "lon" "sea_level" "qc_flag" "use_flag" #> [13] "file_name" ``` Any other manipulation of the dataset must be made using **dplyr** verbs. For example, to count the number of observations by country, one could use ```r da |> count(country) |> collect() #> # A tibble: 113 × 2 #> country n #> #> 1 BEL 1263467 #> 2 JPN 74580447 #> 3 USA 243838504 #> 4 DNK 36726648 #> 5 SLV 812230 #> 6 GBR 76038844 #> 7 YEM 63362 #> 8 CAN 67019312 #> 9 AUS 72441932 #> 10 EGY 220872 #> # ℹ 103 more rows ``` ## Querying subsets directly into R The `query_gesla()` function can be used to load subsets of the full GESLA dataset into R. This function will connect to the GESLA dataset hosted on an Amazon AWS S3 bucket, and will automatically filter the required piece of data. Note that the query may vary, as it will depend on the size of the required subset and on internet connection speed. The selection is made by specifying one or more countries, years and site names. The only mandatory argument is `country`, where it should be a character vector of three-letter [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) codes. If the argument `year` is empty, then all available years for those countries will be selected. The same occurs for the `site_name` argument. For example, to select all the available data for Ireland (IRL), simply use ```r da <- query_gesla(country = "IRL") #> ℹ This process can take some time, as it depends on the size of the final dataset, and on #> internet connection. #> ℹ Connecting to the data server... ✔ Connecting to the data server... [3.5s] #> ℹ Filtering data. This can take some time... ✔ Filtering data. This can take some time... [2m 43.4s] #> ℹ Query finished. ✔ Query finished. [13ms] da #> Table #> 16321218 rows x 13 columns #> $date_time #> $year #> $month #> $day #> $hour #> $country #> $site_name #> $lat #> $lon #> $sea_level #> $qc_flag #> $use_flag #> $file_name #> #> See $metadata for additional Schema metadata dim(da) #> [1] 16321218 13 names(da) #> [1] "date_time" "year" "month" "day" "hour" "country" #> [7] "site_name" "lat" "lon" "sea_level" "qc_flag" "use_flag" #> [13] "file_name" class(da) #> [1] "Table" "ArrowTabular" "ArrowObject" "R6" ``` This is also an `ArrowObject` object, and works the same way as any other Arrow object type, i.e., it can be manipulated with **dplyr** verbs. By the end of this section we will show some examples. Many other queries can be made, for example, ```r ## Select one specific year da <- query_gesla(country = "IRL", year = 2015) ## Multiple years da <- query_gesla(country = "IRL", year = c(2015, 2017)) da <- query_gesla(country = "IRL", year = 2010:2017) da <- query_gesla(country = "IRL", year = c(2010, 2012, 2015)) ## Multiple countries da <- query_gesla(country = c("IRL", "ATA"), year = 2015) da <- query_gesla(country = c("IRL", "ATA"), year = 2010:2017) ``` As one last example, we may be interested in data only for the site `Dublin_Port` in Ireland, and for the years 2015 and 2017. This query would be done as follows ```r da <- query_gesla(country = "IRL", year = c(2015, 2017), site_name = "Dublin_Port") #> ℹ This process can take some time, as it depends on the size of the final dataset, and on #> internet connection. #> ℹ Connecting to the data server... ✔ Connecting to the data server... [3.4s] #> ℹ Filtering data. This can take some time... ✔ Filtering data. This can take some time... [2m 34s] #> ℹ Query finished. ✔ Query finished. [21ms] ``` Now we could explore this subset using **dplyr** verbs as usual ```r ## Verifying number of observations per year da |> count(year) |> collect() #> # A tibble: 2 × 2 #> year n #> #> 1 2015 93790 #> 2 2017 104835 ## Calculating summary statistics per year da |> group_by(year) |> summarise( min = min(sea_level), mean = mean(sea_level), max = max(sea_level)) |> collect() #> # A tibble: 2 × 4 #> year min mean max #> #> 1 2015 -2.62 -0.0104 2.38 #> 2 2017 -2.43 0.0762 2.53 ``` Note the use of the `dplyr::collect()` function at the end of each call. This is necessary to convert the final result (which is processed by Arrow) to a common R object (usually a `tibble` or similar). As stated before, the queries made by `query_gesla()` may take some time to process. To avoid repeating this query in every new session, you can save this object in a file locally, so it will be available instantly the next time you may need it. To achieve this, you can use the `write_gesla()` function ```r write_gesla(da) ``` In this case, this call will create a file named `gesla-data.parquet`, and it will be ready for reading at any time, without the need for querying again. At any time, just use ```r da <- read_gesla("gesla-data.parquet") ``` to make it available again. Please, do not try to save this object as a standard R file, such as `.rds` or `.rda`, as using `saveRDS()` or `save()` for example. In this case the resulting saved object will not be the same as the original object, as an `ArrowObject` object is not a standard R object (it actually is just a pointer to an Arrow style object). The `query_gesla()` function also has two more arguments. The `use_flag = 1` means that only the data suggested by the GESLA team as being useful for analysis must be used. Please see the [GESLA format][] page for more information. The other argument, `as_data_frame = FALSE` specifies if the resulting object should be an `ArrowObject` (the default) or a `data.frame`. We highly recommend to keep this default option, as the resulting `data.frame` can be larger-than-memory, resulting in a memory overflow or even an R session crash. Keeping the object as an `ArrowObject` will always guarantee that the resulting query will be made available. In any case, an `ArrowObject` can always be converted to a `data.frame`, if that is desired, without the need to repeat the query with `as_data_frame = TRUE`. For example, the object `da` created above (which is an `ArrowObject`) may be converted to a `data.frame` with ```r class(da) #> [1] "Table" "ArrowTabular" "ArrowObject" "R6" db <- da |> collect() class(db) #> [1] "tbl_df" "tbl" "data.frame" ``` ## Using the GESLA Shiny app The GESLA Shiny app (geslaR-app) was developed as a user-friendly interface to explore the whole GESLA dataset. The app may be called simply as ```r run_gesla_app() ``` This will create a directory called `gesla-app` on the current working directory, containing the necessary files for the app itself, and a sub-directory called `gesla_dataset` containing all the GESLA dataset files (as `.parquet` files, as this is done with the `download_gesla()` function as explained above). Note that the first time this function is called, it will create these directories and download the full GESLA dataset. However, once this is done, the next time this function is called on this working directory, only the app will open in the default browser (no more downloads will be necessary). This web-interface allows the user to apply some basic filters, by choosing countries, years and sites. By doing this, the app will filter the data and show information about the selected subset. The first tab contains a preview of the selected subset. The two subsequent tabs provide some plots with the number of observations by year, and summary plots for the sea level data itself. The last tab will show a map with the selected locations, and by clicking on the locations, a pop-up will show information about that specific station. The user has the option to download the selected subset in CSV or Parquet file formats. We highly recommend to export it as a Parquet file, as the file size will be much smaller, and it can be simply loaded into R with the `read_gesla()` function, creating an `ArrowObject` object. [GESLA]: https://gesla787883612.wordpress.com [Apache Arrow]: https://arrow.apache.org [Apache Parquet]: https://parquet.apache.org [GESLA format]: https://gesla787883612.wordpress.com/format/