CSV (Comma-separated values) remains an extremely popular file format, and, in the context of programming, it is particularly useful for importing and exporting data between different software applications.
One of the issues with CSV, or any other delimiter-separated file format, is that often the values that need to be parsed into separate ‘parts’ will contain the delimiter.
This is especially true when dealing with CSV files since commas are frequently used in text.
While there are plenty of third-party libraries available that are great for working with CSV and other file formats, sometimes you just need something simple for a one-off tool or a small application for which you want to limit dependencies.
In this post, I will walk through an approach for parsing CSV files that have values containing commas and transform the data into a JSON document by way of example.
Sample data
To start with, let’s imagine we are dealing with a CSV file that has the following contents.
Product Code,Product Name,Product Description,Group Code,Group Name
WA123,Widget A,"A versatile widget, suitable for various tasks.",W01,Widgets
WB456,Widget B,"Robust and durable widget, ideal for heavy-duty use.",W01,Widgets
GX789,Gadget X,"Compact and efficient gadget, perfect for modern needs.",G01,Gadgets
GY101,Gadget Y,"Innovative gadget, combines technology and ease of use.",G01,Gadgets
T112,Tool 1,"High-quality tool, reliable and user-friendly.",T01,Tools
The first line in the file contains the header values that describe what each column/part should contain. The remainder of the file contains the data regarding 5 product records that have a code, name, and description. Additionally, the products are assigned a group code which has its own description.
Note how the parts that contain commas are surrounded by double quotes (more on this shortly).
Let’s say our requirements are to export this data into a JSON structure as follows.
{ "groups": [ { "code": "W01", "name": "Widgets" }, { "code": "G01", "name": "Gadgets" },
{
"code": "T01",
"name": "Tools"
} ], "products": [ { "code": "WA123", "name": "Widget A", "description": "A versatile widget, suitable for various tasks.",
"groupCode": "W01" }, { "code": "WB456", "name": "Widget B", "description": "Robust and durable widget, ideal for heavy-duty use.",
"groupCode": "W01" }, { "code": "GX789", "name": "Gadget X", "description": "Compact and efficient gadget, perfect for modern needs.",
"groupCode": "G01" }, { "code": "GY101", "name": "Gadget Y", "description": "Innovative gadget, combines technology and ease of use.",
"groupCode": "G01" }, { "code": "T112", "name": "Tool 1", "description": "High-quality tool, reliable and user-friendly.",
"groupCode": "T01" } ] }
The JSON document provides a better structure compared to the CSV file with the separation of the groups from the products which reference a group via the group code.
Dealing with commas
As mentioned previously, you will notice in the CSV data that any values that contain commas are surrounded by double quotes, such as the emboldened text within the line shown below.
WA123,Widget A,"A versatile widget, suitable for various tasks.",W01,Widgets
This is intentional and provides us with a way of recognising where the true delimiters are in the file.
Although not an absolute standard, this approach is fairly universal. If you save a spreadsheet in Microsoft Excel or Google Sheets as a CSV file, for example, you will find that this formatting is applied.
If you have a bunch of data in a spreadsheet to work with, your best starting point is to save the data as a CSV file from your spreadsheet program so that the correct formatting is applied and then you can read this file into your program.
Sample program
Now for the coding, let’s create a sample C# program and see how we can implement the requirements.
If you want to follow along, I recommend that you download and install the latest version of Visual Studio (if you haven’t already) or you can use another IDE or editor such as Rider or Visual Studio Code if you prefer.
We’re going to keep things as simple as possible, so let’s start by creating a C# Console App, using .NET 5 or greater so that we can benefit from top-level statements.
Next, let’s add the following code to the ‘Program.cs’ file of the project.
using System.Text.Encodings.Web; using System.Text.Json; // Read the lines from the CSV file, skipping the header line. IEnumerable<string> lines = File .ReadAllLines(@"C:\Users\Jonathan\Desktop\sample-products.csv") .Skip(1); // Set up lists to hold the groups and products. var groups = new List<Group>(); var products = new List<Product>(); foreach (string line in lines) { // Split the CSV line. IEnumerable<string> parts = line.SplitCsv(); // Extract the values. string productCode = parts.ElementAt(0); string productName = parts.ElementAt(1); string productDescription = parts.ElementAt(2); string groupCode = parts.ElementAt(3); string groupName = parts.ElementAt(4); // Check if the group already exists. Group? group = groups.FirstOrDefault(g => g.Code == groupCode); if (group is null) { // Add the group. groups.Add(new Group { Code = groupCode, Name = groupName }); } // Add the product. products.Add(new Product { Code = productCode, Name = productName, Description = productDescription, GroupCode = groupCode }); } // Create the JSON structure. var catalog = new { groups, products }; string json = JsonSerializer.Serialize(catalog, new JsonSerializerOptions { Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping, PropertyNamingPolicy = JsonNamingPolicy.CamelCase, WriteIndented = true }); // Write the serialized object to a JSON file. File.WriteAllText(@"C:\Users\Jonathan\Desktop\sample-products.json", json); // Inform the user that the program has completed. Console.WriteLine("Program complete, press any key to exit..."); Console.ReadKey(true); class Product { public string? Code { get; set; } public string? Name { get; set; } public string? Description { get; set; } public string? GroupCode { get; set; } } class Group { public string? Code { get; set; } public string? Name { get; set; } }
Note that the above code is intentionally very procedural in nature. The assumption here is that we are developing a simple tool that will carry out the parsing and transformation of the CSV data into the desired JSON structure.
The code first reads all of the lines from a CSV file at a specific location and skips the first line i.e. the header line so that only the data lines are included in the collection for processing.
Note that you should adjust the location where the CSV file is being loaded accordingly.
Lists are set up to hold the group and product objects. The code then iterates through each line that was read from the CSV file.
For each loop iteration, the current line is split on the delimiter via the SplitCsv
extension method and the available values are extracted. If you wish, you can add a check after splitting to ensure that the number of values/parts is equal to the number of parts you are expecting for extra safety when processing the data.
After extracting the values, the code checks if we have already added the group object to the group list by looking up the group code and adding the group to the list if required. A product object is added to the product list with the appropriate values.
When all of the lines have been processed, the final JSON structure is created and serialized to a string via the System.Text.Json.JsonSerializer
. Feel free to adjust the serializer options according to your requirements. The Encoder
property in this example has been set to JavaScriptEncoder.UnsafeRelaxedJsonEscaping
so that special characters such as á and é are not encoded as \u00E1 and \u00E9, for example.
Lastly, the JSON string is written to a JSON file at a specific location. The Product
and Group
classes have been included in the Program.cs file to help keep the demo as simple as possible.
Note that you should adjust the location where the JSON file is being saved accordingly.
Splitting the CSV
Now that we’ve walked through the high-level code, let’s look at how the CSV splitting is implemented.
The SplitCsv
method that is used in the Program.cs file we just covered is an extension method defined within a static CsvStringExtensions
class, as shown below.
using System.Text; /// <summary> /// Contains CSV string extensions. /// </summary> public static class CsvStringExtensions { /// <summary> /// Splits the specified CSV line string into its constituent parts. /// </summary> /// <param name="csvLine">The CSV line to split</param> /// <returns>A collection containing the split parts</returns> public static IEnumerable<string> SplitCsv(this string csvLine) { var builder = new StringBuilder(); bool escaped = false; bool inQuotes = false; foreach (char character in csvLine) { if (character == ',' && !inQuotes) { yield return builder.ToString(); builder.Clear(); } else if (character == '\\' && !escaped) { escaped = true; } else if (character == '"' && !escaped) { inQuotes = !inQuotes; } else { escaped = false; builder.Append(character); } } yield return builder.ToString(); } }
The above extension method is an iterator block containing logic which inspects each character in the CSV string that has been passed in and builds the individual line parts.
The escaped
and inQuotes
variables are used to track if the current character is escaped and if the current position is inside quotes respectively.
When iterating through each character the following logic is applied.
- If the character is a comma and we are not inside quotes, this signifies the end of a value. In this case, the value is ‘yielded’ to the caller and the
StringBuilder
is cleared. - If the character is a backslash and is not already escaped, the
escaped
flag is set. This is used for escaping characters like quotes in the value. - If the character is a double quote and is not escaped, the
inQuotes
flag is toggled. This is used to handle values that are enclosed in quotes, where commas inside quotes are not considered delimiters. - In all other cases, the
escaped
flag is reset to false and the character is appended to the current value that is being built.
After the character loop has been completed, the last value is yielded to the caller. This handles the case where the line does not end with a comma.
Running the program
Provided that the SplitCsv
extension method has been defined somewhere in your project that is accessible to the code in your Program.cs file, you should now be in a position to run the program. Just make sure that you have updated the file paths used in the code to suit your environment and check that the CSV file exists and contains content in the required format.
If you run the program you should find that the CSV file has been parsed and the JSON file has been successfully generated!
Formatting CSV values
As a side note, if you want to format your CSV according to a commonly accepted standard, you may find the code below useful.
/// <summary> /// Formats/escapes the specified string value for exporting to a CSV file. /// </summary> /// <param name="value">The string value to format</param> /// <returns>A formatted string value which can be exported to a CSV file</returns> public static string FormatAsCsvValue(this string value) { bool mustQuote = value.Contains(",") || value.Contains("\"") || value.Contains("\r") || value.Contains("\n"); if (mustQuote) { var builder = new StringBuilder(); builder.Append("\""); foreach (char character in value) { builder.Append(character); if (character == '"') { builder.Append("\""); } } builder.Append("\""); return builder.ToString(); } else { return value; } }
This code is similar to the code from the previous subsection, in that it iterates through each character in the specified string. However, this time, it is the individual CSV line part that should be passed to the method, not the full CSV line.
The code checks if the specified value contains one or more characters that need to be quoted and if so it creates a StringBuilder
instance and appends a double quote at the start and end of the string. Each character within the specified value is inspected and double quotes are escaped if required. After the character loop completes, the final string is built and returned to the caller.
The result is that the specified string will be returned to the caller in a format that is suitable for exporting safely to a CSV file so that it can be read back into the program again later without any parsing issues.
Summary
In this post, I have walked through a basic example of how to parse a CSV file that has lines where the individual values contain commas and demonstrated how to transform these lines into a JSON structure.
I started by showing you the sample data that we are starting with and what we wanted to end up with as the output.
I then provided the code that can fulfil the requirements by reading the CSV file, parsing it, transforming it into the appropriate structure, and then exporting this to a JSON file.
I also covered how you can format string values appropriately when exporting to a CSV file so that you can implement the same standards that spreadsheet programs like Microsoft Excel and Google Sheets adhere to.
Perhaps you will find the code useful for a tool or as a simple means of automating a data transformation task!
Comments