A billion rows is pushing the limits for R on a personal computer in my experience; if you're open to other alternatives, you could process the data out-of-memory using AWK:
cat full.csv
"id","item","itemformat"
"1","a1","AA"
"1","a2","AB"
"2","a1","BB"
"2","a2","AB"
awk 'BEGIN{ FS=OFS="," }
NR==FNR {
if (NR > 1 ) {
items[$2]
}
next
}
FNR==1 {
printf "%s", $1
for (item in items) {
printf "%s%s", OFS, item
}
print ""
next
}
$1 != prev[1] {
if ( FNR > 2 ) {
prt()
}
split($0,prev)
}
{
values[$2] = $3
}
END { prt() }
function prt(item, value) {
printf "%s", prev[1]
for (item in items) {
value = values[item]
printf "%s%s", OFS, value
}
print ""
delete values
}' full.csv full.csv > full2.csv
cat full2.csv
"id","a1","a2"
"1","AA","AB"
"2","BB","AB"
Edit
I think @zephryl's answer is the best solution to your problem (+1), but it might be worth doing some benchmarking to make sure it doesn't require more resources than you have available, or an unreasonable amount of time. For my AWK answer, with an example dataset of a billion rows (~14Gb), this method uses max 500Mb RAM and takes ~45mins to do the pivot on my laptop (MacBook Pro 2017, 2.3GHz Dual-Core i5, 8GB RAM):
# Create a large example dataset (~1 billion rows)
awk 'BEGIN{print "id,item,itemformat"; for (i=800000; i<1000000000; i++) {printf "%d,%s%d,%s\n", i/800000, "a", i%800000+1, (rand()>0.5?"AA":"AB")}}' > full.csv
head full.csv
id,item,itemformat
1,a1,AA
1,a2,AA
1,a3,AB
1,a4,AA
1,a5,AA
1,a6,AA
1,a7,AB
1,a8,AB
1,a9,AA
## "Items" printed in any order
awk 'BEGIN{ FS=OFS="," }
NR==FNR {
if (NR > 1 ) {
items[$2]
}
next
}
FNR==1 {
printf "%s", $1
for (item in items) {
printf "%s%s", OFS, item
}
print ""
next
}
$1 != prev[1] {
if ( FNR > 2 ) {
prt()
}
split($0,prev)
}
{
values[$2] = $3
}
END { prt() }
function prt(item, value) {
printf "%s", prev[1]
for (item in items) {
value = values[item]
printf "%s%s", OFS, value
}
print ""
delete values
}' full.csv full.csv > full2.csv
###############################
# "Items" printed in the 'original' order
awk '
BEGIN{ FS=OFS="," }
NR==FNR {
PROCINFO["sorted_in"] = "@val_num_asc"
if (NR > 1 ) {
items[$2]=NR
}
next
}
FNR==1 {
printf "%s", $1
for (item in items) {
printf "%s%s", OFS, item
}
print ""
next
}
$1 != prev[1] {
if ( FNR > 2 ) {
prt()
}
split($0,prev)
}
{
values[$2] = $3
}
END { prt() }
function prt(item, value) {
printf "%s", prev[1]
for (item in items) {
value = values[item]
printf "%s%s", OFS, value
}
print ""
delete values
}' test.csv test.csv > output.csv
dcast
is the data.table equivalent, as shown here at one of the canonical questions for reshaping: https://mcmap.net/q/86974/-how-to-reshape-data-from-long-to-wide-format , with the full documentation here: cran.r-project.org/web/packages/data.table/vignettes/… – Bisutun