batch-filetext-manipulation

How can I remove a list of substrings, then find and delete the first TAB and after in a CSV with multiple lines?


Example of my dataset "doc name with spaces.csv" (with anonymized data) file that has multiple lines. Length of file will be variable from day to day as part of an export.

Patient Full Name   Order Date Of Service   Order Accession Number  Day of Patient Birth Date   Procedure Description   Facility Name   
AAAAA, Ms Joan  10/11/2022  xx.1111111  1 November 2000 Ultrasound Obstetric 22+ Weeks  Facility 1  
BBBBB, Mr John  10/11/2022  xx.2222222  2 July 2000 Ultrasound Left Calf    Facility 2  
CCCCC, Mrs Anne 10/11/2022  xx.3333333  3 July 2000 X-ray Chest Facility 3  
DDDDD, Master Jack  10/11/2022  xx.4444444  4 July 2000 Ultrasound Left Ankle   Facility 4
....

Trying to create a BATCH script to

  1. Read each Line of "doc name with spaces.csv"
  2. Delete all occurrences of strings matching lines found in "titles.txt" (located in same directory)
  3. Delete first TAB (\t) found per line, and everything after it on same line.
  4. Copy results to Windows clipboard

Example:

AAAAA, Ms Joan  10/11/2022  xx.1111111  1 November 2000 Ultrasound Obstetric 22+ Weeks  Facility 1
BBBBB, Mr John  10/11/2022  xx.2222222  2 July 2000 Ultrasound Left Calf    Facility 2  

to

AAAAA, Joan
BBBBB, John

NB: The title is always followed by a white space, so no risk of removing Dr or Mr etc from a name, if the white space is accounted for in the find/delete. Content of "titles.txt" below:

Mrs 
Mr 
Miss 
Ms 
Dr 
Prof 
A/Prof 

Taken a look at other scripts online, but none quite match what I'm doing. Also a bit advanced for where I am currently at, but the need for this has arisen regardless.


Solution

  • @ECHO OFF
    SETLOCAL
    rem The following settings for the directories and filename are names
    rem that I use for testing and deliberately include names which include spaces to make sure
    rem that the process works using such names. These will need to be changed to suit your situation.
    
    SET "sourcedir=u:\your files"
    SET "filename1=%sourcedir%\q74397743.txt"
    SET "filename2=%sourcedir%\q74397743_2.txt"
    SET "destdir=u:\your results"
    SET "outfile=%destdir%\outfile.txt"
    
    (
    FOR /f "usebackqskip=1delims=" %%e IN ("%filename1%") DO @CALL :process %%e
    )>"%outfile%"
    TYPE "%outfile%"|clip
    
    GOTO :EOF
    
    :process
    :: first parameter = patient_id
    SET "patient_id=%1
    SET "patient_name="
    SHIFT
    :: Second parameter = Title
    :: skip if on titles list
    FINDSTR /i /x "%1" "%filename2%">NUL
    IF NOT ERRORLEVEL 1 SHIFT
    :: Build name until %1 begins with a numeric
    :nameloop
    SET "nextpart=%1"
    SET "firstchar=%nextpart:~0,1%"
    FOR /L %%z IN (0,1,9) DO IF "%firstchar%"=="%%z" ECHO %patient_id%,%patient_name%&GOTO :eof
    SET "patient_name=%patient_name% %nextpart%"
    SHIFT
    GOTO nameloop
    GOTO :eof
    

    Always verify against a test directory before applying to real data.

    Note that if the filename does not contain separators like spaces, then both usebackq and the quotes around %filename1% can be omitted.

    You don't indicate where the Tabs are. Master missing from titles file. Spaces removed from end-of-line in titles file.

    Assumed that since the date follows the name, finding a field that starts with a numeric is sufficient for end-of-name.

    Surnames missing despite column name "full name"

    Simply read each line and extract first token, optionally skip second then build together next until numeric character found. Use the comma, spaces and tabs as separators for the subroutine parameters.