CSV, RegExp, VBScript

Removing commas between double quotes in a CSV file

CSV (Comma Separated Values) data files, as the term implies, contains plain text records with column (field) values separated or delimited by commas.  One example of using CSV files includes data exchange between applications and/or computing platforms.  Writing a script to extract CSV data appears to be a straight forward process – simply loading an array with data from each comma delimited column.

However, this process may not be adequate when a comma exists inside string values (values enclosed by double quotes).

For example, consider the following CSV file with three records:

03/03/2010,John Doe,"Thanks for the message.","Signed: Harvey",10,14,-30,"123.45"
03/04/2010,Sally John,"I'm sorry, but I need to leave.","Sally",23,21,5,"100.33"
03/10/2010,Johnie Lang,"Thank You!  Juan, I will call you later today.","Johnie",0,9,2,"-10,324.56"

Some records are processed incorrectly due to improper detection of the comma in a string value.

Record 1: Correct Record 2: Incorrect Record 3: Incorrect
Extracted 8 values correctly Extracted values 3 and 4 as two separate values due to embedded commas Extracted value 3, 4, 9 and 10 as two separate values due to embedded commas
Value 1: 03/03/2010 Value 1: 03/04/2010 Value 1: 03/10/2010
Value 2: John Doe Value 2: Sally John Value 2: Johnie Lang
Value 3: “Thanks for the message.” Value 3: “I’m sorry Value 3: “Thank You! Juan
Value 4: “Signed: Harvey” Value 4: but I need to leave.” Value 4: I will call you later today.”
Value 5: 10 Value 5: Sally Value 5: “Johnie”
Value 6: 14 Value 6: 23 Value 6: 0
Value 7: -30 Value 7: 21 Value 7: 9
Value 8: “123.45” Value 8: 5 Value 8: 2
  Value 9 : “100.33” Value 9: “-10
    Value 10: 324.56″

As shown above, additional inspection (parsing) of record data needs to be incorporated to yield correct CSV extracting. In this case, resulting in 8 values per record.

A Proposed Solution…

To maintain value extraction complete, a regular expression can be used to decode content enclosed between two double quotes and change any commas to a special character(s) before CSV processing is performed on the data.

Today, most object-oriented languages such as (but not limited to) Java, Javascript, VBScript, C++, Perl, and Phython include the Regular Expression build-in function.

Regular Expression is a powerful search function using search patterns. Many articles on the topic of regular expression can be found on the internet via various search engines.

The proposed process in comprised of the following steps until no more CSV records exists:

 1. Read a record from CSV file
 2. Use Regular Expression to find-and-replace commas in string values with ‘@@’ in CSV record read
 3. Extract CSV data from modified record into array
 4. Replace ‘@@’ with a comma in each array element to reinstate the original value
 5. Log intermediate and resulting values of processed CSV record

For example, record 1
03/03/2010,John Doe,”Thanks for the message.”,”Signed: Harvey”,10,14,-30,”123.45″
intermediate array data would change to (remains unchanged)
03/03/2010,John Doe,”Thanks for the message.”,”Signed: Harvey”,10,14,-30,”123.45″
final array data reinstates to (remains unchanged)
03/03/2010,John Doe,”Thanks for the message.”,”Signed: Harvey”,10,14,-30,”123.45″

Record 2
03/04/2010,Sally John,”I’m sorry, but I need to leave.”,”Sally”,23,21,5,”100.33″
intermediate array data would change to
03/04/2010,Sally John,”I’m sorry@@ but I need to leave.”,”Sally”,23,21,5,”100.33″
final array data reinstates to
03/04/2010,Sally John,”I’m sorry, but I need to leave.”,”Sally”,23,21,5,”100.33″

Record 3
03/10/2010,Johnie Lang,”Thank You! Juan, I will call you later today.”,”Johnie”,0,9,2,”-10,324.56″
intermediate array data would change to
03/10/2010,Johnie Lang,”Thank You! Juan@@ I will call you later today.”,”Johnie”,0,9,2,”-10@@324.56″
final array data reinstates to
03/10/2010,Johnie Lang,”Thank You! Juan, I will call you later today.”,”Johnie”,0,9,2,”-10,324.56″

Take for a test drive…

A Visual Basic script was written to demonstrate the proposed solution on a Windows platform. Follow the below steps to execute the script:

  1. Download VBScript demo zip file from here
  2. Unzip above file
  3. Create a directory on your local drive C:\CSV-Demo
  4. Save the data file (CSV-comma-data.txt) to directory in step 3
  5. Save the VBScript file (CSV-RegExp.vbs) to directory in step 3
  6. Start a DOS Command Prompt and point to directory in step 3
  7. Run VBScript in DOS Command Prompt as CSV-RegExp.vbs
  8. A log file (CSV-comma-log.txt) will be created in directory from step 3
  9. You can now review / study the script, data and log file using Notepad or any other text editor.

Enjoy!


CSV-comma-dat.txt

03/03/2010,John Doe,"Thanks for the message.","Signed: Harvey",10,14,-30,"123.45"
03/04/2010,Sally John,"I'm sorry, but I need to leave.","Sally",23,21,5,"100.33"
03/10/2010,Johnie Lang,"Thank You!  Juan, I will call you later today.","Johnie",0,9,2,"-10,324.56"


CSV-RegExp.vbs

'***********************************************************
'
' Read and Print a CSV file
'
' Author     : Larry Belmontes, Jr.
' Description: Script to read CSV formatted file, decode 
'              delimited columns via Regular Expressions,
'              and print each delimited data element to a
'              text file.
' Languages  : VBScript
' File Name  : CSV-RegExp.vbs
' Disclaimer: This software is intended for educational learning
'             and illustrative purposes.
'
'             This software is provided "AS IS" and without any
'             expressed or implied warranties, including, without
'             limitation, the implied warranties of merchantability
'             and fitness for a particular purpose.
'
'             No guarantee; No warranty; Install / Use at your own risk.
'
'
' Copyright (C) 2015 Larry Belmontes, Jr. 
'
'
'
'***********************************************************
'
' All variables must be defined
Option Explicit
'
' Define variables in script
Dim objFSO, objTextFile, objLOGout, objReg
Dim arrStr, rcdCnt, CSVfileIN, LOGfileOUT
Dim logContent, sLine, x
'
' Instantiate Regular Expression object
'   Ignore case and inspect entire string
'   using this pattern ,(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)
Set objReg=CreateObject("vbscript.regexp")
objReg.IgnoreCase = True
objReg.Global = True
objReg.Pattern=",(?=[^" & chr(34) & "]*" & chr(34) & "[^" & chr(34) & "]*(?:" & chr(34) & "[^" & chr(34) & "]*" & chr(34) & "[^" & chr(34) & "]*)*$)"
'
' Declare filenames
CSVfileIN = "C:\CSV-Demo\CSV-comma-data.txt"
LOGfileOUT = "C:\CSV-Demo\CSV-comma-log.txt"
'
' Instantiate File System, CSV input and Log output objects
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objTextFile = objFSO.OpenTextFile(CSVfileIN)	
Set objLOGout = objFSO.CreateTextFile(LOGfileOUT,2)
'
' Capture LOG data into string logContent
logContent = ""
logContent = logContent + "--------------------------------------------------------------------------" + vbCrLf
'
' Read each record in CSV input file until end of file
Do While NOT objTextFile.AtEndOfStream
	rcdCnt = rcdCnt + 1
	sLine = objTextFile.ReadLine
	logContent = logContent + "** Record #" + Cstr(rcdCnt) + "    (raw data):" + sLine + vbCrLf
	'change comma to unique set of characters within double quotes
	sLine = objReg.Replace(sLine,"@@")
	logContent = logContent + "** Record #" + Cstr(rcdCnt) + " (after Regex):" + sLine + vbCrLf
	'extract all values delimited by comma into array arrStr
	arrStr = split(sLine,",")
	'print each element in array arrStr
	For x = 0 to Ubound(arrStr)
		logContent = logContent + "** Value " + Right("0" & Cstr(x+1), 2) + ": " + arrStr(x)  + vbCrLf
		'restore unique set of characters to comma
		arrStr(x) = Replace(arrStr(x),"@@",",")
		logContent = logContent + "**    after: " + arrStr(x) + vbCrLf
	Next
	logContent = logContent + vbCrLf
Loop
'
' Last LOG entry
logContent = logContent + "--------------------------------------------------------------------------" + vbCrLf
logContent = logContent + vbCrLf
'
' Write data to LOG file
objLOGout.Write logContent 
'
' Close files
objLOGout.Close
objTextFile.Close
'
' Release Objects
Set objTextFile = Nothing
Set objLOGout = Nothing
Set objFSO = Nothing
Set objReg = Nothing


CSV-comma-log.txt (sample)

Sample of log file (CSV-comma-log.txt) generated after CSA-RegExp.vbs is executed. 

--------------------------------------------------------------------------
** Record #1    (raw data):03/03/2010,John Doe,"Thanks for the message.","Signed: Harvey",10,14,-30,"123.45"
** Record #1 (after Regex):03/03/2010,John Doe,"Thanks for the message.","Signed: Harvey",10,14,-30,"123.45"
** Value 01: 03/03/2010
**    after: 03/03/2010
** Value 02: John Doe
**    after: John Doe
** Value 03: "Thanks for the message."
**    after: "Thanks for the message."
** Value 04: "Signed: Harvey"
**    after: "Signed: Harvey"
** Value 05: 10
**    after: 10
** Value 06: 14
**    after: 14
** Value 07: -30
**    after: -30
** Value 08: "123.45"
**    after: "123.45"

** Record #2    (raw data):03/04/2010,Sally John,"I'm sorry, but I need to leave.","Sally",23,21,5,"100.33"
** Record #2 (after Regex):03/04/2010,Sally John,"I'm sorry@@ but I need to leave.","Sally",23,21,5,"100.33"
** Value 01: 03/04/2010
**    after: 03/04/2010
** Value 02: Sally John
**    after: Sally John
** Value 03: "I'm sorry@@ but I need to leave."
**    after: "I'm sorry, but I need to leave."
** Value 04: "Sally"
**    after: "Sally"
** Value 05: 23
**    after: 23
** Value 06: 21
**    after: 21
** Value 07: 5
**    after: 5
** Value 08: "100.33"
**    after: "100.33"

** Record #3    (raw data):03/10/2010,Johnie Lang,"Thank You!  Juan, I will call you later today.","Johnie",0,9,2,"-10,324.56"
** Record #3 (after Regex):03/10/2010,Johnie Lang,"Thank You!  Juan@@ I will call you later today.","Johnie",0,9,2,"-10@@324.56"
** Value 01: 03/10/2010
**    after: 03/10/2010
** Value 02: Johnie Lang
**    after: Johnie Lang
** Value 03: "Thank You!  Juan@@ I will call you later today."
**    after: "Thank You!  Juan, I will call you later today."
** Value 04: "Johnie"
**    after: "Johnie"
** Value 05: 0
**    after: 0
** Value 06: 9
**    after: 9
** Value 07: 2
**    after: 2
** Value 08: "-10@@324.56"
**    after: "-10,324.56"

--------------------------------------------------------------------------
Tagged

Leave a Reply

Your email address will not be published. Required fields are marked *


CAPTCHA Image
Reload Image

This site uses Akismet to reduce spam. Learn how your comment data is processed.