The file_schema_dsv() function returns a data frame with the schema
of a DSV file reading only the first max_lines of a delimiter
separated values (DSV) text file to infer column names and data types
(it does not read the full dataset into memory). Then it converts them
to the candidate data frame columns' names and data types.
Usage
file_schema_dsv(
input_file,
header = TRUE,
sep = ",",
dec = ".",
grp = "",
id_quote_method = "DB_NAMES",
max_lines = 2000,
null_columns = FALSE,
force_num_cols = TRUE,
...
)Arguments
- input_file
character, file name (including path) to be read.
- header
logical, if
TRUEthe first line contains the fields' names. IfFALSE, the column names will be formed sing a "V" followed by the column number (as specified inutils::read.table()).- sep
character, field delimiter (e.g., "," for CSV, "\t" for TSV) in the input file. Defaults to ",".
- dec
character, decimal separator (e.g., "." or "," depending on locale) in the input file. Defaults to ".".
- grp
character, character used for digit grouping. It defaults to
""(i.e. no grouping).- id_quote_method
character, used to specify how to build the SQLite columns' names using the fields' identifiers read from the input file. For details see the description of the
quote_methodparameter of theformat_column_names()function. Defautls toDB_NAMES.- max_lines
integer, number of lines (excluding the header) to be read to infer columns' data types. Defaults to 2000.
- null_columns
logical, if
TRUEthe col_type of columuns consisting only of NAs or zero-length strings will be marked as"NULL", otherwise they will be marked ascharacter. Defaults toFALSE- force_num_cols
logical, if
TRUEthe returned schema will have all rows withn_colscolumns (i.e. the guessed number of columns determined inspecting the firstmax_lineslines of the input file), even if there are rows in the input file with fewer or greater columns thann_cols. IfFALSEand any of the tested lines has a number of columns not equal ton_cols, the function will return a list without theschemaelement. It defaults toTRUE.- ...
Additional arguments for quoting and data interpretation as described in the
base::scan()function. The parameters used byfile_schema_dsvare:quote, character, the set of quoting characters. Defaults to""(i.e., no quoting).comment.char, character, the comment character. Defaults to""(i.e., no comments).skip, integer, the number of lines to skip before reading data. Defaults to0.fileEncoding, character, the name of the encoding of the input file. Defaults to"".na.strings, character vector, the strings to be interpreted as NAs. Defaults toc("NA").
Value
a list with the following named elements:
schema, a data frame with these columns:col_names: columns' names, after applying the selected quote method;col_names_unquoted: columns' names, unquoted; ifid_quote_methodis set toDB_NAMESthey will be the same ascol_names; for other quote methods they will be the unquoted versions ofcol_names,that is generally the same assrc_namesunlesssrc_namescontain the quoting characters;col_types: columns' R data types;sql_types: columns' SQLite data types;src_names: columns' names as they appear in the input file.src_types: defaults totextfor all columns.src_is_quoted: logical vector indicating if each column has at least one value enclosed in quotes.
col_counts, a data frame with these columns:num_col: number of columns,Freq: number of rows (withinmax_lines) that have the number of colums shown innum_col.
n_cols, integer, the number of columns selected for the file.num_col, a vector of integers of lengthmax_lineswith the number of detected columns in each row tested.col_fill, logical, it is set toTRUEif there are lines with less columns thann_cols.col_flush, logical, it is set toTRUEif there are lines with more columns thann_cols.
Examples
# Inspect CSV file schema without loading full dataset
data_path <- system.file("extdata", package = "RSQLite.toolkit")
# Get schema information for abalone CSV
schema_info <- file_schema_dsv(
input_file = file.path(data_path, "abalone.csv"),
header = TRUE,
sep = ",",
dec = ".",
max_lines = 50
)
# Display schema information
print(schema_info$schema[, c("col_names", "col_types", "sql_types")])
#> col_names col_types sql_types
#> 1 Sex character TEXT
#> 2 Length numeric REAL
#> 3 Diam numeric REAL
#> 4 Height numeric REAL
#> 5 Whole numeric REAL
#> 6 Shucked numeric REAL
#> 7 Viscera numeric REAL
#> 8 Shell numeric REAL
#> 9 Rings integer INTEGER
# Check column consistency
print(schema_info$col_counts)
#> num_col Freq
#> 1 9 50
print(paste("Guessed columns:", schema_info$n_cols))
#> [1] "Guessed columns: 9"
# Example with different parameters
schema_custom <- file_schema_dsv(
input_file = file.path(data_path, "abalone.csv"),
header = TRUE,
sep = ",",
dec = ".",
max_lines = 50,
id_quote_method = "SQL_SERVER"
)
print(schema_custom$schema[, c("col_names", "col_types", "src_names")])
#> col_names col_types src_names
#> 1 [Sex] character Sex
#> 2 [Length] numeric Length
#> 3 [Diam] numeric Diam
#> 4 [Height] numeric Height
#> 5 [Whole] numeric Whole
#> 6 [Shucked] numeric Shucked
#> 7 [Viscera] numeric Viscera
#> 8 [Shell] numeric Shell
#> 9 [Rings] integer Rings
