Reading an xlsb workbook

read_xlsb attempts to import a region from a binary format Excel workbook (xlsb)

sheet

Either a name, or the index of the sheet to read. Index of the first sheet is 1. If the sheet name is embedded in the range argument, or implied if range is a named range, then this argument is ignored

range

range can be specified as

A named range. Named ranges are not case sensitive
In Sheet!A1 notation
In Sheet!R1C1 notation
As a cellranger::cell_limits object

col_names

TRUE: The first row is used for column names. Empty cells result in a column name of the form ‘column.i’
FALSE: Column names will be ‘column.i’
Character vector: vector containing column names.

col_types

Can be implied from the spreadsheet or specified in advanced. When specifying types, options are

“logical” (or “boolean”), “numeric” (or “double”), “integer”, “date” and “string” (or “character”)
Use “skip” (or “ignore”) to skip a column

When implying types from the underlying spreadsheet data, the resultant type is the regarded as the ‘least fragile’.

Effectively the order is logical – datetime – integer – double – string

If 99 rows are of type ‘integer’ and 1 row is of type ‘double’, then all cells are regarded as ‘double’ in that column.
If 99 rows are of type ‘date’ and 1 row is of type ‘string’, then all cells are promoted to ‘string’

Currently ‘date’ is implied from cell formatting. It is either one of the built-in Excel datetime formats or a custom format where the format string contains only the characters Y, M, D, H, S, y, m, d, h, s and - (dash), : (colon), (space), . (dot). That should be good enough to identify any dates.

If ‘date’ is specified as the column type, then any strings are converted to dates. The format, I’m afraid, isn’t flexible at this stage - it’s assumed to be of the type “%Y-%m-%dT%H:%M:%S”.

I think there’s a 1900 leap year bug in Excel. I haven’t corrected for that.

If any of the cells in a ‘date’ column contain time, then a POSIXct object is returned, otherwise a Date object is returned. The timezone for POSIXct is set as UTC. Seems like the sensible thing to do.

na

A character string that is interpret as NA. This does not effect the implied data type for a column.

trim_ws

Should leading and trailing whitespaces be trimmed from character strings?

skip

The number of rows to skip before reading data.

…

Additional options. At present just debug = TRUE is supported. This returns a list with fields ‘result’ set to the resulting data.frame and ‘env’ set to an internal environment that may be useful for debugging.

res = read_xlsb(path = system.file("extdata", "TestBook.xlsb", package = "readxlsb"), range = "PORTFOLIO", debug = TRUE)

ls(res$env)
#> [1] "content"      "named_ranges" "sheets"       "stream"

res$env$named_ranges
#>             name                     range sheet_idx first_column first_row
#> 1   INFO_RELEASE          FirstSheet!$A$11         0            1        11
#> 2        OUTLOOK 'My SecondTab'!$A$1:$C$13         1            1         1
#> 3      PORTFOLIO      FirstSheet!$A$3:$C$9         0            1         3
#> 4 SAVED_DATETIME          FirstSheet!$C$14         0            3        14
#> 5          TITLE           FirstSheet!$A$1         0            1         1
#> 6    UNICODE_STR          FirstSheet!$A$12         0            1        12
#>   last_column last_row
#> 1           1       11
#> 2           3       13
#> 3           3        9
#> 4           3       14
#> 5           1        1
#> 6           1       12