CSV (Comma Separated Values) data files, as the term implies, contains plain text records with column (field) values separated or delimited by commas. One example of using CSV files includes data exchange between applications and/or computing platforms. Writing a script to extract CSV data appears to be a straight forward process – simply loading an array with data from each comma delimited column.
However, this process may not be adequate when a comma exists inside string values (values enclosed by double quotes).
For example, consider the following CSV file with three records:
03/03/2010,John Doe,"Thanks for the message.","Signed: Harvey",10,14,-30,"123.45" 03/04/2010,Sally John,"I'm sorry, but I need to leave.","Sally",23,21,5,"100.33" 03/10/2010,Johnie Lang,"Thank You! Juan, I will call you later today.","Johnie",0,9,2,"-10,324.56"
Some records are processed incorrectly due to improper detection of the comma in a string value.
Record 1: Correct | Record 2: Incorrect | Record 3: Incorrect |
Extracted 8 values correctly | Extracted values 3 and 4 as two separate values due to embedded commas | Extracted value 3, 4, 9 and 10 as two separate values due to embedded commas |
Value 1: 03/03/2010 | Value 1: 03/04/2010 | Value 1: 03/10/2010 |
Value 2: John Doe | Value 2: Sally John | Value 2: Johnie Lang |
Value 3: “Thanks for the message.” | Value 3: “I’m sorry | Value 3: “Thank You! Juan |
Value 4: “Signed: Harvey” | Value 4: but I need to leave.” | Value 4: I will call you later today.” |
Value 5: 10 | Value 5: Sally | Value 5: “Johnie” |
Value 6: 14 | Value 6: 23 | Value 6: 0 |
Value 7: -30 | Value 7: 21 | Value 7: 9 |
Value 8: “123.45” | Value 8: 5 | Value 8: 2 |
Value 9 : “100.33” | Value 9: “-10 | |
Value 10: 324.56″ |
As shown above, additional inspection (parsing) of record data needs to be incorporated to yield correct CSV extracting. In this case, resulting in 8 values per record.
A Proposed Solution…
To maintain value extraction complete, a regular expression can be used to decode content enclosed between two double quotes and change any commas to a special character(s) before CSV processing is performed on the data.
Today, most object-oriented languages such as (but not limited to) Java, Javascript, VBScript, C++, Perl, and Phython include the Regular Expression build-in function.
Regular Expression is a powerful search function using search patterns. Many articles on the topic of regular expression can be found on the internet via various search engines.
The proposed process in comprised of the following steps until no more CSV records exists:
1. | Read a record from CSV file |
2. | Use Regular Expression to find-and-replace commas in string values with ‘@@’ in CSV record read |
3. | Extract CSV data from modified record into array |
4. | Replace ‘@@’ with a comma in each array element to reinstate the original value |
5. | Log intermediate and resulting values of processed CSV record |
For example, record 1
03/03/2010,John Doe,”Thanks for the message.”,”Signed: Harvey”,10,14,-30,”123.45″
intermediate array data would change to (remains unchanged)
03/03/2010,John Doe,”Thanks for the message.”,”Signed: Harvey”,10,14,-30,”123.45″
final array data reinstates to (remains unchanged)
03/03/2010,John Doe,”Thanks for the message.”,”Signed: Harvey”,10,14,-30,”123.45″
Record 2
03/04/2010,Sally John,”I’m sorry, but I need to leave.”,”Sally”,23,21,5,”100.33″
intermediate array data would change to
03/04/2010,Sally John,”I’m sorry@@ but I need to leave.”,”Sally”,23,21,5,”100.33″
final array data reinstates to
03/04/2010,Sally John,”I’m sorry, but I need to leave.”,”Sally”,23,21,5,”100.33″
Record 3
03/10/2010,Johnie Lang,”Thank You! Juan, I will call you later today.”,”Johnie”,0,9,2,”-10,324.56″
intermediate array data would change to
03/10/2010,Johnie Lang,”Thank You! Juan@@ I will call you later today.”,”Johnie”,0,9,2,”-10@@324.56″
final array data reinstates to
03/10/2010,Johnie Lang,”Thank You! Juan, I will call you later today.”,”Johnie”,0,9,2,”-10,324.56″
Take for a test drive…
A Visual Basic script was written to demonstrate the proposed solution on a Windows platform. Follow the below steps to execute the script:
- Download VBScript demo zip file from here
- Unzip above file
- Create a directory on your local drive C:\CSV-Demo
- Save the data file (CSV-comma-data.txt) to directory in step 3
- Save the VBScript file (CSV-RegExp.vbs) to directory in step 3
- Start a DOS Command Prompt and point to directory in step 3
- Run VBScript in DOS Command Prompt as CSV-RegExp.vbs
- A log file (CSV-comma-log.txt) will be created in directory from step 3
- You can now review / study the script, data and log file using Notepad or any other text editor.
Enjoy!
CSV-comma-dat.txt
03/03/2010,John Doe,"Thanks for the message.","Signed: Harvey",10,14,-30,"123.45" 03/04/2010,Sally John,"I'm sorry, but I need to leave.","Sally",23,21,5,"100.33" 03/10/2010,Johnie Lang,"Thank You! Juan, I will call you later today.","Johnie",0,9,2,"-10,324.56"
CSV-RegExp.vbs
'*********************************************************** ' ' Read and Print a CSV file ' ' Author : Larry Belmontes, Jr. ' Description: Script to read CSV formatted file, decode ' delimited columns via Regular Expressions, ' and print each delimited data element to a ' text file. ' Languages : VBScript ' File Name : CSV-RegExp.vbs ' Disclaimer: This software is intended for educational learning ' and illustrative purposes. ' ' This software is provided "AS IS" and without any ' expressed or implied warranties, including, without ' limitation, the implied warranties of merchantability ' and fitness for a particular purpose. ' ' No guarantee; No warranty; Install / Use at your own risk. ' ' ' Copyright (C) 2015 Larry Belmontes, Jr. ' ' ' '*********************************************************** ' ' All variables must be defined Option Explicit ' ' Define variables in script Dim objFSO, objTextFile, objLOGout, objReg Dim arrStr, rcdCnt, CSVfileIN, LOGfileOUT Dim logContent, sLine, x ' ' Instantiate Regular Expression object ' Ignore case and inspect entire string ' using this pattern ,(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$) Set objReg=CreateObject("vbscript.regexp") objReg.IgnoreCase = True objReg.Global = True objReg.Pattern=",(?=[^" & chr(34) & "]*" & chr(34) & "[^" & chr(34) & "]*(?:" & chr(34) & "[^" & chr(34) & "]*" & chr(34) & "[^" & chr(34) & "]*)*$)" ' ' Declare filenames CSVfileIN = "C:\CSV-Demo\CSV-comma-data.txt" LOGfileOUT = "C:\CSV-Demo\CSV-comma-log.txt" ' ' Instantiate File System, CSV input and Log output objects Set objFSO = CreateObject("Scripting.FileSystemObject") Set objTextFile = objFSO.OpenTextFile(CSVfileIN) Set objLOGout = objFSO.CreateTextFile(LOGfileOUT,2) ' ' Capture LOG data into string logContent logContent = "" logContent = logContent + "--------------------------------------------------------------------------" + vbCrLf ' ' Read each record in CSV input file until end of file Do While NOT objTextFile.AtEndOfStream rcdCnt = rcdCnt + 1 sLine = objTextFile.ReadLine logContent = logContent + "** Record #" + Cstr(rcdCnt) + " (raw data):" + sLine + vbCrLf 'change comma to unique set of characters within double quotes sLine = objReg.Replace(sLine,"@@") logContent = logContent + "** Record #" + Cstr(rcdCnt) + " (after Regex):" + sLine + vbCrLf 'extract all values delimited by comma into array arrStr arrStr = split(sLine,",") 'print each element in array arrStr For x = 0 to Ubound(arrStr) logContent = logContent + "** Value " + Right("0" & Cstr(x+1), 2) + ": " + arrStr(x) + vbCrLf 'restore unique set of characters to comma arrStr(x) = Replace(arrStr(x),"@@",",") logContent = logContent + "** after: " + arrStr(x) + vbCrLf Next logContent = logContent + vbCrLf Loop ' ' Last LOG entry logContent = logContent + "--------------------------------------------------------------------------" + vbCrLf logContent = logContent + vbCrLf ' ' Write data to LOG file objLOGout.Write logContent ' ' Close files objLOGout.Close objTextFile.Close ' ' Release Objects Set objTextFile = Nothing Set objLOGout = Nothing Set objFSO = Nothing Set objReg = Nothing
CSV-comma-log.txt (sample)
Sample of log file (CSV-comma-log.txt) generated after CSA-RegExp.vbs is executed.
-------------------------------------------------------------------------- ** Record #1 (raw data):03/03/2010,John Doe,"Thanks for the message.","Signed: Harvey",10,14,-30,"123.45" ** Record #1 (after Regex):03/03/2010,John Doe,"Thanks for the message.","Signed: Harvey",10,14,-30,"123.45" ** Value 01: 03/03/2010 ** after: 03/03/2010 ** Value 02: John Doe ** after: John Doe ** Value 03: "Thanks for the message." ** after: "Thanks for the message." ** Value 04: "Signed: Harvey" ** after: "Signed: Harvey" ** Value 05: 10 ** after: 10 ** Value 06: 14 ** after: 14 ** Value 07: -30 ** after: -30 ** Value 08: "123.45" ** after: "123.45" ** Record #2 (raw data):03/04/2010,Sally John,"I'm sorry, but I need to leave.","Sally",23,21,5,"100.33" ** Record #2 (after Regex):03/04/2010,Sally John,"I'm sorry@@ but I need to leave.","Sally",23,21,5,"100.33" ** Value 01: 03/04/2010 ** after: 03/04/2010 ** Value 02: Sally John ** after: Sally John ** Value 03: "I'm sorry@@ but I need to leave." ** after: "I'm sorry, but I need to leave." ** Value 04: "Sally" ** after: "Sally" ** Value 05: 23 ** after: 23 ** Value 06: 21 ** after: 21 ** Value 07: 5 ** after: 5 ** Value 08: "100.33" ** after: "100.33" ** Record #3 (raw data):03/10/2010,Johnie Lang,"Thank You! Juan, I will call you later today.","Johnie",0,9,2,"-10,324.56" ** Record #3 (after Regex):03/10/2010,Johnie Lang,"Thank You! Juan@@ I will call you later today.","Johnie",0,9,2,"-10@@324.56" ** Value 01: 03/10/2010 ** after: 03/10/2010 ** Value 02: Johnie Lang ** after: Johnie Lang ** Value 03: "Thank You! Juan@@ I will call you later today." ** after: "Thank You! Juan, I will call you later today." ** Value 04: "Johnie" ** after: "Johnie" ** Value 05: 0 ** after: 0 ** Value 06: 9 ** after: 9 ** Value 07: 2 ** after: 2 ** Value 08: "-10@@324.56" ** after: "-10,324.56" --------------------------------------------------------------------------