rreformat

convert a columnar table of contract info into rows


I have a large table of contract information (several hundred contracts) all collected in a single column of a table. Each contract occupies 6 sequential rows. I have been able to add another column (CAT) indicating the content of each row in the column: Company, Address, CitySTZip, Contact, Contract, Title. [Using R]

A reproducible version of the data after adding the second column to represent column names, the data looks like this:

textFile <- "col1|col2
XYZCo|Company
123 Main Street|Address
Yourtown, MA 12345|CityStZip
Joe Smith|Contact
20-234-56/3|Contract
Process for Work|Title
ZZTop Co|Company
123 Jefferson Street|Address
Chicago, IL 60636|CityStZip
Jane Doe|Contact
23-274-11/3|Contract
Yet Another One|Title"

data <- read.csv(text=textFile,header = TRUE,sep="|")
data


                   col1      col2
1                 XYZCo   Company
2       123 Main Street   Address
3    Yourtown, MA 12345 CityStZip
4             Joe Smith   Contact
5           20-234-56/3  Contract
6      Process for Work     Title
7              ZZTop Co   Company
8  123 Jefferson Street   Address
9     Chicago, IL 60636 CityStZip
10             Jane Doe   Contact
11          23-274-11/3  Contract
12      Yet Another One     Title

I would like to reorder all the data so that each contract occupies a single row in the data table with the CAT values as the column headers.

Per comments on the answer posted below, an attempt at reformatting the file using for() loops was unsuccessful.

for(i in 1:nrow(data)){
   for(j in 1:6){
      # got stuck here... 
   }
} 

The desired output would look like this:

Company Address           CitySTZip            Contact     Contract      Title

XYZCo 123 Main Street   Yourtown, MA 12345   Joe Smith   20-234-56/3   Process for Work

Solution

  • After posting my original answer I realized that the data may be different than my assumptions due to the fact that the content in the original post referenced a column 1 and a column 2 in the raw data. If the data looks like the following, there is a relatively straightforward answer that combines dplyr with tidyr::pivot_wider().

    First, we'll read the data and print the resulting data frame, 2 columns including data values and column names.

    textFile <- "col1|col2
    XYZCo|Company
    123 Main Street|Address
    Yourtown, MA 12345|CityStZip
    Joe Smith|Contact
    20-234-56/3|Contract
    Process for Work|Title
    ZZTop Co|Company
    123 Jefferson Street|Address
    Chicago, IL 60636|CityStZip
    Jane Doe|Contact
    23-274-11/3|Contract
    Yet Another One|Title"
    data <- read.csv(text = textFile,header = TRUE, sep="|")
    

    The data frame looks like this:

    > data
                       col1      col2
    1                 XYZCo   Company
    2       123 Main Street   Address
    3    Yourtown, MA 12345 CityStZip
    4             Joe Smith   Contact
    5           20-234-56/3  Contract
    6      Process for Work     Title
    7              ZZTop Co   Company
    8  123 Jefferson Street   Address
    9     Chicago, IL 60636 CityStZip
    10             Jane Doe   Contact
    11          23-274-11/3  Contract
    12      Yet Another One     Title
    

    In order to pivot the data frame to wide format tidy data, we'll need to add an ID column to distinguish one observation from other observations. We can use dplyr::mutate() for this, along with the ceiling() function. The ceiling() function is needed because we want the ID value to be constant for each 6 rows of input data. When we divide the result of seq_along() by 6, it generates the desired vector.

    Once we've added the ID column, pivoting to wide format is relatively straightforward.

    library(dplyr)
    library(tidyr)
    data %>% mutate(id = ceiling(seq_along(col1)/6)) %>%
        pivot_wider(.,id,names_from=col2,values_from=col1)
    

    ...and the output:

    # A tibble: 2 x 7
         id Company  Address          CityStZip       Contact  Contract   Title       
      <dbl> <chr>    <chr>            <chr>           <chr>    <chr>      <chr>       
    1     1 XYZCo    123 Main Street  Yourtown, MA 1… Joe Smi… 20-234-56… Process for…
    2     2 ZZTop Co 123 Jefferson S… Chicago, IL 60… Jane Doe 23-274-11… Yet Another…
    

    Original Answer

    The interesting challenge with this question is that an observation spans 6 rows of data, yet is not in a fixed record layout so we can't use read.fwf() or read.fortran() to read the file.

    Instead, we'll use readLines() to read the data into a vector, then we'll write it to a temporary file, combining every 6 rows into a single output record. Finally, we'll read the reshaped data with read.csv().

    The original post wasn't clear whether the column names were distinguishable from the rest of the data in the raw data file, so this solution assumes that we need to parse them out of the desired result data frame.

    textFile <- "XYZCo Company
    123 Main Street Address
    Yourtown, MA 12345 CityStZip
    Joe Smith Contact
    20-234-56/3 Contract
    Process for Work Title
    ZZTop Co Company
    123 Jefferson Street Address
    Chicago, IL 60636 CityStZip
    Jane Doe Contact
    23-274-11/3 Contract
    Yet Another One Title"
        
    

    First we read the data into a character vector with readLines().

    dataVector <- readLines(textConnection(textFile))
    

    Next, we strip out the column name data from the vector. Since there are only 6 column names, I was lazy and just used sub() repeatedly. The original question indicates that the column names were added after the data was loaded into R, so this code may be unnecessary.

    # clean column names from raw data 
    dataVector <- sub("Company","",dataVector)
    dataVector <- sub("Address","",dataVector)
    dataVector <- sub("CityStZip","",dataVector)
    dataVector <- sub("Contact","",dataVector)
    dataVector <- sub("Contract","",dataVector)
    dataVector <- sub("Title","",dataVector)
    

    Next, we loop through the vector, and combine every 6 rows into a single output record, using the pipe | as a separator because the data includes commas in the CityStZip field.

    # write to tempfile as pipe separated values
    tmpFile <- "./data/tmpfile.csv"
    counter <- 0
    outLine <- NULL
    for(i in 1:length(dataVector)){
        counter <- counter + 1
        if(counter == 1 ) outLine <- dataVector[i]
        else outLine <- paste(outLine,dataVector[i],sep="|")
        if(counter == 6) {
             cat(outLine,file = "./data/tmpfile.csv",sep="\n",append=TRUE)
             counter <- 0
             outLine <- NULL
        }
    }
    

    Finally, we read the file we just created, and specify sep = '|' as the separator between columns. We also use the col.names argument to set the column names.

    colNames <- c("Company","Address","CityStZip","Contact","Contract","Title")
    data <- read.csv("./data/tmpfile.csv",header = FALSE,sep = "|",
                     col.names = colNames)
    

    ...and the output:

    > data
        Company               Address           CityStZip    Contact     Contract
    1    XYZCo       123 Main Street  Yourtown, MA 12345  Joe Smith  20-234-56/3 
    2 ZZTop Co  123 Jefferson Street   Chicago, IL 60636   Jane Doe  23-274-11/3 
                  Title
    1 Process for Work 
    2  Yet Another One