annoying "feature" (or bugs?) for RODBC
Asked Answered
I

1

5

RODBC is the main library in R to import data from a database into R. RODBC seems to have the ability of "guess" the datatype of the column which I find it particularly annoying.

I have uploaded a file test.xls here, or you may create a xls file yourself:

  1. create 2 columns, first column named col_a and the second column named col_b.
  2. type whatever you like in col_a, I typed letters on this column for 92 rows
  3. at the 92th row of col_b, type a number there, I typed "1923" without changing the data type (i.e. not using ')
  4. try to import the xls file into R using the following script:

library(RODBC)

setwd("C:/Users/hke775/Documents/Enoch/MISC/R_problems/RODBC")
channel <- odbcConnectExcel("test.xls",readOnly=TRUE)
dummy.df <- sqlFetch(channel,"Sheet1")
odbcClose(channel)

You will see that in dummy.df, col_b is all NA, the 1923 in this column is gone.

If you want to see the 1923 again, you can change the 1st row of col_b to a number, and it is back again.

This is very annoying as I don't prefer modifying data manually. I need to use other package to do the xls importing, but I can't find other packages do as smooth as RODBC (I tried gdata and xlsReadWrite).

Did I missing anything in the sqlFetch command, and cause the trouble? Thanks.

Immune answered 30/1, 2012 at 10:56 Comment(5)
Are you sure you tried gdata? When I tried your test file with the gdata package using DF <- read.xls("test.xls") it worked. The 1923 was present.Crumple
Gdata needs to have perl, which I can't install it in my officeImmune
If the powers that be can install R then they should also install the Rtools distribution too since that is required for R development and that has perl in it: cran.r-project.org/bin/windows/RtoolsCrumple
so my issue is considered to be a bug for RODBC?Immune
It might be a problem with the ODBC driver rather than RODBC itself.Crumple
D
8

Please don't blame R or RODBC for Microsoft's bugs... ;)

However, due to a bug in the ODBC driver, specifying the Rows to Scan (MaxScanRows) setting currently has no effect. In other words, the Excel ODBC driver (MDAC 2.1 and later) always scans the first 8 rows in the specified data source in order to determine each column's datatype.

For additional information about the Rows to Scan bug, including a simple workaround, click the article number below to view the article in the Microsoft Knowledge Base:

189897 XL97: Data Truncated to 255 Characters with Excel ODBC Driver

I tried the fix in KB189897 by setting the TypeGuessRows value to 0 and look what happens!

> library(RODBC)
> channel <- odbcConnectExcel("test.xls",readOnly=TRUE)
> tail(dummy.df <- sqlFetch(channel,"Sheet1"))
   col_a col_b
87     c    NA
88     d    NA
89     e    NA
90     f    NA
91     g    NA
92     h  1923
> odbcClose(channel)

Please, no up-votes or check marks... just send cash. :)

Diagonal answered 30/1, 2012 at 23:41 Comment(3)
Great. But then should we conclude that a user without admin privileges has no solution ?Droppings
better set TypeGuessRows to 24 or any deep enough setting. But 0 will do an unlimited scan and will will affect peformance. (I'm reading thousands of non structured .xls/.xlsx files)Hairspring
@Langeleppel: Yes, setting to 0 will affect performance, but it will guarantee correctness. Setting to 24 (or any other non-zero value) will only be correct if it is "deep enough" for all future xls/xlsx files read on the affected system.Diagonal

© 2022 - 2024 — McMap. All rights reserved.