I have a large table of contract information (several hundred contracts) all collected in a single column of a table. Each contract occupies 6 sequential rows. I have been able to add another column (CAT) indicating the content of each row in the column: Company, Address, CitySTZip, Contact, Contract, Title. [Using R]
A reproducible version of the data after adding the second column to represent column names, the data looks like this:
textFile <- "col1|col2
XYZCo|Company
123 Main Street|Address
Yourtown, MA 12345|CityStZip
Joe Smith|Contact
20-234-56/3|Contract
Process for Work|Title
ZZTop Co|Company
123 Jefferson Street|Address
Chicago, IL 60636|CityStZip
Jane Doe|Contact
23-274-11/3|Contract
Yet Another One|Title"
data <- read.csv(text=textFile,header = TRUE,sep="|")
data
col1 col2
1 XYZCo Company
2 123 Main Street Address
3 Yourtown, MA 12345 CityStZip
4 Joe Smith Contact
5 20-234-56/3 Contract
6 Process for Work Title
7 ZZTop Co Company
8 123 Jefferson Street Address
9 Chicago, IL 60636 CityStZip
10 Jane Doe Contact
11 23-274-11/3 Contract
12 Yet Another One Title
I would like to reorder all the data so that each contract occupies a single row in the data table with the CAT values as the column headers.
Per comments on the answer posted below, an attempt at reformatting the file using for()
loops was unsuccessful.
for(i in 1:nrow(data)){
for(j in 1:6){
# got stuck here...
}
}
The desired output would look like this:
Company Address CitySTZip Contact Contract Title
XYZCo 123 Main Street Yourtown, MA 12345 Joe Smith 20-234-56/3 Process for Work
After posting my original answer I realized that the data may be different than my assumptions due to the fact that the content in the original post referenced a column 1 and a column 2 in the raw data. If the data looks like the following, there is a relatively straightforward answer that combines dplyr
with tidyr::pivot_wider()
.
First, we'll read the data and print the resulting data frame, 2 columns including data values and column names.
textFile <- "col1|col2
XYZCo|Company
123 Main Street|Address
Yourtown, MA 12345|CityStZip
Joe Smith|Contact
20-234-56/3|Contract
Process for Work|Title
ZZTop Co|Company
123 Jefferson Street|Address
Chicago, IL 60636|CityStZip
Jane Doe|Contact
23-274-11/3|Contract
Yet Another One|Title"
data <- read.csv(text = textFile,header = TRUE, sep="|")
The data frame looks like this:
> data
col1 col2
1 XYZCo Company
2 123 Main Street Address
3 Yourtown, MA 12345 CityStZip
4 Joe Smith Contact
5 20-234-56/3 Contract
6 Process for Work Title
7 ZZTop Co Company
8 123 Jefferson Street Address
9 Chicago, IL 60636 CityStZip
10 Jane Doe Contact
11 23-274-11/3 Contract
12 Yet Another One Title
In order to pivot the data frame to wide format tidy data, we'll need to add an ID column to distinguish one observation from other observations. We can use dplyr::mutate()
for this, along with the ceiling()
function. The ceiling()
function is needed because we want the ID value to be constant for each 6 rows of input data. When we divide the result of seq_along()
by 6, it generates the desired vector.
Once we've added the ID column, pivoting to wide format is relatively straightforward.
library(dplyr)
library(tidyr)
data %>% mutate(id = ceiling(seq_along(col1)/6)) %>%
pivot_wider(.,id,names_from=col2,values_from=col1)
...and the output:
# A tibble: 2 x 7
id Company Address CityStZip Contact Contract Title
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 XYZCo 123 Main Street Yourtown, MA 1… Joe Smi… 20-234-56… Process for…
2 2 ZZTop Co 123 Jefferson S… Chicago, IL 60… Jane Doe 23-274-11… Yet Another…
The interesting challenge with this question is that an observation spans 6 rows of data, yet is not in a fixed record layout so we can't use read.fwf()
or read.fortran()
to read the file.
Instead, we'll use readLines()
to read the data into a vector, then we'll write it to a temporary file, combining every 6 rows into a single output record. Finally, we'll read the reshaped data with read.csv()
.
The original post wasn't clear whether the column names were distinguishable from the rest of the data in the raw data file, so this solution assumes that we need to parse them out of the desired result data frame.
textFile <- "XYZCo Company
123 Main Street Address
Yourtown, MA 12345 CityStZip
Joe Smith Contact
20-234-56/3 Contract
Process for Work Title
ZZTop Co Company
123 Jefferson Street Address
Chicago, IL 60636 CityStZip
Jane Doe Contact
23-274-11/3 Contract
Yet Another One Title"
First we read the data into a character vector with readLines()
.
dataVector <- readLines(textConnection(textFile))
Next, we strip out the column name data from the vector. Since there are only 6 column names, I was lazy and just used sub()
repeatedly. The original question indicates that the column names were added after the data was loaded into R, so this code may be unnecessary.
# clean column names from raw data
dataVector <- sub("Company","",dataVector)
dataVector <- sub("Address","",dataVector)
dataVector <- sub("CityStZip","",dataVector)
dataVector <- sub("Contact","",dataVector)
dataVector <- sub("Contract","",dataVector)
dataVector <- sub("Title","",dataVector)
Next, we loop through the vector, and combine every 6 rows into a single output record, using the pipe |
as a separator because the data includes commas in the CityStZip
field.
# write to tempfile as pipe separated values
tmpFile <- "./data/tmpfile.csv"
counter <- 0
outLine <- NULL
for(i in 1:length(dataVector)){
counter <- counter + 1
if(counter == 1 ) outLine <- dataVector[i]
else outLine <- paste(outLine,dataVector[i],sep="|")
if(counter == 6) {
cat(outLine,file = "./data/tmpfile.csv",sep="\n",append=TRUE)
counter <- 0
outLine <- NULL
}
}
Finally, we read the file we just created, and specify sep = '|'
as the separator between columns. We also use the col.names
argument to set the column names.
colNames <- c("Company","Address","CityStZip","Contact","Contract","Title")
data <- read.csv("./data/tmpfile.csv",header = FALSE,sep = "|",
col.names = colNames)
...and the output:
> data
Company Address CityStZip Contact Contract
1 XYZCo 123 Main Street Yourtown, MA 12345 Joe Smith 20-234-56/3
2 ZZTop Co 123 Jefferson Street Chicago, IL 60636 Jane Doe 23-274-11/3
Title
1 Process for Work
2 Yet Another One