Importing Data in R
Notes on ways to get data into R
By Chi Kit Yeung in R Data Analysis Notes
August 26, 2022
Importing Multiple CSVs from a Directory
A lot of times I find myself with data sources that are stored in multiple separate csv files, this method seems to be the best one I’ve found so far to import them all into a single dataframe.
# Importing the data
# First define the path where the data is stored
path <- "../data/"
filenames <- list.files(path = path)
## Below is to import a specific range of files in the directory
# files <- list.files(path = path)
# filenames <- files[c((length(files)-7):length(files))]
## Into a list
all_df <- lapply(filenames, function(i) {
i <- paste(path,i,sep="")
read.csv(i, header=FALSE)
})
filenames <- gsub("-","_",filenames)
names(all_df) <- gsub(".csv","",filenames)
## Joining all the data frames
joined_df <- all_df %>%
reduce(full_join)
## Reduce function above appends a new placeholder column name and pushes the actual name to first row
names(joined_df) <- joined_df[1,]
df <- joined_df %>%
slice(-1)
The disadvantage of this method is that all data types will be converted to char
so an additional step will be needed to convert your data into the right types.
Importing Data from REST API
Recently at work I’ve found the need to access data via REST API servers. Our organization has some readily available client python wrappers available but none for R so I had to create my own to access the data. It was a great learning experience.
To tackle this I’ve decided to create a helper R script to store the API functions that my RMD reports could call using source("../api.R")
.
First we’ll have to get to know a couple of R packages that makes this process relatively painless.
library(jsonlite)
library(httr)
Get to know your data source
The REST API you are trying to access would (hopefully) have some documentation available that guides you on how you can access the data you want. In my case I’m accessing some test data that’s being ran in a off-shore server lab.
Typically a REST API database would have a base URL and then numerous sub directories to serve different call purposes.
As an example I’ll use GitHub’s API. Documentation here: https://docs.github.com/en/rest
# Define the base URL
base <- "https://api.github.com"
url <- paste(base, "/repos/chikity/r-notebook", sep = "")
res <- GET(url = url, config = list())
status_code <- httr::status_code(res)
print(paste("Status code:", status_code))
## [1] "Status code: 200"
GET
GET(url = NULL, config = list(), ..., handle = NULL)
GET(url = url, config = list())
## Response [https://api.github.com/repos/chikity/r-notebook]
## Date: 2023-01-13 08:08
## Status: 200
## Content-Type: application/json; charset=utf-8
## Size: 5.41 kB
## {
## "id": 515826878,
## "node_id": "R_kgDOHr7kvg",
## "name": "r-notebook",
## "full_name": "chikity/r-notebook",
## "private": false,
## "owner": {
## "login": "chikity",
## "id": 24665013,
## "node_id": "MDQ6VXNlcjI0NjY1MDEz",
## ...
POST
POST(
url = NULL,
config = list(),
...,
body = NULL,
encode = c("multipart", "form", "json", "raw"),
handle = NULL
)
A request body is needed for POST call methods. The specifics of the contents, structure, and data type depends on your use case and can be referred from the API’s documentation. But more often than not the contents have to be fed as a json object. This is where jsonlite package comes in handy.
Building the Body
# First make a nested list (depends on your POST's specifics)
body <- list()
subBody <- list()
subBody2 <- list()
page <- 1
subBody$contentX <- "a"
subBody$contentY <- "b"
subBody2$foo <- "bar"
body$subBody <- subBody
body$subBody2 <- subBody2
body$page <- page
b <- jsonlite::toJSON(body, pretty = TRUE, auto_unbox = TRUE)
print("Here's what it looks like as a JSON")
## [1] "Here's what it looks like as a JSON"
b
## {
## "subBody": {
## "contentX": "a",
## "contentY": "b"
## },
## "subBody2": {
## "foo": "bar"
## },
## "page": 1
## }
Handling Multi-page Results
- Deconstruct
- Get total number of available pages
- Loop call from each idividual page and aggregate results
In my particular use case, the POST API I was calling returns only 100 results per page and to get all the results I had to iterate through all the available pages and combine the returned results. If you’ve noticed the request body example above there is a item for page that specifies the result’s page number that can be used to navigate through all the available results.
totalPage <- result$totalPage
while (df$page < totalPage) {
# Update the request body to go through the pages
page <- page+1
}
Getting results into a DataFrame
Basic
result <- httr::content(res, "text")
df <- fromJSON(result)
Multi-page Result
# Make a dataframe list
dfs <- list(df)
while (df$page < totalPage) {
result <- httr::content(res, "text")
df <- fromJSON(result)
dfs <- append(dfs, list(df))
}
data <- dfs %>% reduce(full_join)
Chunking Requests
Some APIs specify the number of items it can take in a single request. This calls for the need to chunk items within the limit.
# Get the total number of items you need to request
nRequest <- length(requestList)
chunkSize <- 100
# Making a template data frame to store request results
dfs <- setNames(data.frame(matrix(ncol = 3, nrow = 0)), c("a", "b", "c"))
# Iterate through the requests
for (i in 1:ceiling(nRequest / chunkSize)) {
# Get a list of items to request by chunk size
request <-requestList[((i-1)*chunkSize+1):min(nRequest,(i*chunkSize))]
# Convert that list into a comma delimited string (Depends on your API)
input <- paste(as.character(requests), collapse = ",")
# Make the API request
response <- httr::GET(url, query = list(variable = input))
result <- httr::content(response, "text")
df <- fromJSON(result)
dfs <- rbind(dfs, df)
}
How to deal with nested lists in dataframe
Use tidyr’s unnest()
function
df_new <- df %>%
unnest(listedColumn)
- Posted on:
- August 26, 2022
- Length:
- 5 minute read, 928 words
- Categories:
- R Data Analysis Notes
- Tags:
- R