How to replace accented characters with plain characters using C#

The majority of software applications developed today need to work with strings that contain accented characters such as à and é since many human languages use these to indicate variations in pronunciation and other subtleties.

Even English borrows some French words that contain accents and other glyphs, such as café, déjà vu, and façade. These words are frequently used in day-to-day conversations and written text.

A fairly common software requirement is the need to replace or remove accented characters within such words. The use case could be to allow the storage of text in a legacy system that does not support Unicode characters, or perhaps to generate a user-friendly alphanumeric code that is based on a portion of text but must contain only plain characters and numbers.

In this post, I will explain how you can remove accents from characters and effectively replace accented characters with the equivalent ‘plain’ characters using C#.

Diacritics

The ‘accents’ that form part of characters such as à and é are examples of glyphs that are formally known as diacritics.

Diacritics are small marks or symbols added to letters in various alphabets to indicate specific phonetic or linguistic features, such as pronunciation, stress, tone, or other linguistic distinctions. Diacritics can alter the sound, meaning, or pronunciation of a letter or word.

As well as understanding that diacritics are the glyphs/marks on the characters and not the characters themselves, it’s also important to be aware that it’s not just accents that are referred to as diacritics.

Below is a list of various diacritics, along with examples.

  1. Acute Accent: ´ (e.g., á, é, í, ó, ú)
  2. Grave Accent: ` (e.g., à, è, ì, ò, ù)
  3. Circumflex: ˆ (e.g., â, ê, î, ô, û)
  4. Tilde: ~ (e.g., ã, ñ, õ, ñ)
  5. Cedilla: ¸ (e.g., ç)
  6. Diaeresis (Umlaut): ¨ (e.g., ä, ë, ï, ö, ü)
  7. Breve: ˘ (e.g., ă, ĕ, ĭ, ŏ, ŭ)
  8. Macron: ¯ (e.g., ā, ē, ī, ō, ū)
  9. Caron (Háček): ˇ (e.g., č, š, ž, ě)
  10. Underdot: ̣ (e.g., ṇ, ḥ, ẏ)
  11. Dot Above: ˙ (e.g., ḷ, ȧ)
  12. Ring Above: ˚ (e.g., å, ů, ŋ)
  13. Hook Above: ̉ (e.g., ả, ấ, ặ)
  14. Double Acute Accent: ̋ (e.g., ő, ű)
  15. Double Grave Accent: ̏ (e.g., ȁ, ȑ)
  16. Double Dot Above: ̈̈ (e.g., ẅ, ẗ)
  17. Horn: ̛ (e.g., ơ, ư)

As you can see from the above list, many different types of diacritics are used across different languages!

So how can our code cater to all of these possibilities and replace the characters that use diacritics with the equivalent plain character version? We’ll look at the code solution that can handle this requirement in the next section.

Code solution

Now for the code solution. I have included the definition of a C# string extension method below named RemoveDiacritics that can remove diacritics from any character in a specified string.

using System.Globalization;
using System.Text;

/// <summary>
/// Contains <see cref="string"/> extension methods.
/// </summary>
public static class StringExtensions
{
    /// <summary>
    /// Removes diacritics from the specified text.
    /// </summary>
    /// <param name="text">The text to remove diacritics from</param>
    /// <returns>A new version of the text with diacritics removed</returns>
    public static string RemoveDiacritics(this string text)
    {
        string normalizedString = text.Normalize(NormalizationForm.FormD);
 
        var stringBuilder = new StringBuilder(normalizedString.Length);
 
        foreach (char c in normalizedString)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(c!= UnicodeCategory.NonSpacingMark)
            {
                stringBuilder.Append(c);
            }
        }
 
        return stringBuilder.ToString();
    }
}

Okay, let’s break down what the above method is doing so that we have a full understanding of what is happening.

At a high level, the method takes a string as the input, normalises it, and returns a new string instance to the caller with the ‘non-spacing mark’ characters removed (more on this shortly).

The System.String class contains an instance method named Normalize which, as per the documentation, “returns a new string whose textual value is the same as this string, but whose binary representation is the normalization form specified by the normalizationForm parameter.”

By calling the Normalize method, the plain characters will be split from the diacritic glyphs into separate characters, allowing us to iterate through each character in the string and exclude the diacritics from the result.

After normalising the string, a StringBuilder instance is created and configured with an initial capacity that matches the length of the normalised string. Setting the initial capacity provides a small performance optimisation for most cases.

Next, the code iterates through each character of the normalised string.

Consider a scenario where the following text has been passed to the RemoveDiacritics method: áèîõü

In this case, the characters from the normalised string will be processed by the foreach loop in the order shown below.

a
´
e
'
i
^
o
~
u
¨

If we didn’t normalise the string first, there would only have been 5 loop iterations, as the characters wouldn’t have been separated out as above.

Within the loop, the diacritics are identified and filtered out by calling the CharUnicodeInfo.GetUnicodeCategory method and checking for UnicodeCategory.NonSpacingMark characters. The diacritics will all be classed as non-spacing marks.

After the loop has been completed, the ToString method is called on the StringBuilder instance and the result is returned.

Using the extension method

Below is an example of calling the RemoveDiacritics extension method and outputting the result to the Console.

string diacriticCharacterText = "áèîõü";
string plainCharacterText = diacriticCharacterText.RemoveDiacritics();
 
Console.WriteLine(plainCharacterText);
 
// Output: aeiou

In the above case, the diacritics are all removed as expected.

Testing inputs

While the basic example shown in the previous subsection looks promising, how can we be sure that the RemoveDiacritics method is correctly handling all of the scenarios that we need to cater for?

Below is a set of xUnit test cases that can provide us with more confidence in the solution.

public class StringExtensionsTests
{
    [Theory]
    [InlineData("aeiou""áéíóú")]
    [InlineData("aeiou""àèìòù")]
    [InlineData("aeiou""âêîôû")]
    [InlineData("anon""ãñõñ")]
    [InlineData("c""ç")]
    [InlineData("aeiou""äëïöü")]
    [InlineData("aeiou""ăĕĭŏŭ")]
    [InlineData("aeiou""āēīōū")]
    [InlineData("csze""čšžě")]
    [InlineData("nhy""ṇḥẏ")]
    [InlineData("la""ḷȧ")]
    [InlineData("au""åů")]
    [InlineData("aaa""ảấặ")]
    [InlineData("ou""őű")]
    [InlineData("ar""ȁȑ")]
    [InlineData("wt""ẅẗ")]
    [InlineData("ou""ơư")]
    [InlineData("""")]
    public void RemoveDiacritics_ShouldReturnResult(string expectedstring input)
    {
        // Arrange/Act.
        var result = input.RemoveDiacritics();
 
        // Assert.
        Assert.Equal(expected, result);
    }
}

If you run these tests, you should find that they all pass.

In the above code, the xUnit Theory attribute is used along with InlineData attributes to define multiple test cases that are based on the list of diacritic examples shown earlier in the post.

Note that you can of course use your preferred unit test library (e.g. NUnit) and replace the attributes etc. with the equivalents (e.g. Test, TestCase attributes).

When diacritics are not enough

Sometimes you will need to concern yourself with more than just diacritics.

What if you’re dealing with a language that uses characters such as the following?

  • Æ
  • ß
  • ð
  • Ð
  • Œ
  • Ø
  • œ

Unfortunately, the above characters do not contain simple diacritics and, as a result, the RemoveDiacritics method will not be able to handle these in the way you might expect.

In these cases, you will need to write some mapping code to obtain reliable results. Usually, this would consist of a mapping dictionary where you decide what plain text character you want to use in place of the unsupported character.

Alternatively, depending on your use case, you may be able to simply remove the unsupported characters from your resulting string. This could be the case, for example, when generating an alphanumeric code.

Below is a modified version of the earlier example that removes non-ASCII characters after calling the RemoveDiacritics extension method.

string diacriticCharacterText = "ÆßðáèîõüЌ؜";
string plainCharacterText = new string(diacriticCharacterText
    .RemoveDiacritics()
    .Where(char.IsAscii)
    .ToArray());
 
Console.WriteLine(plainCharacterText);
 
// Output: aeiou

The above code results in the same output as the earlier example.

The LINQ Where extension method is used in combination with the Char.IsAscii method to filter out non-ASCII characters. The LINQ ToArray extension method is then used to convert the results to an array of characters that can be passed to the String constructor.

Summary

In this post, I covered how to remove accents (diacritics) from strings using C#.

I started by explaining what diacritics are and I provided examples of the many possible types of diacritics.

I then provided a code solution in the form of a string extension method that can strip diacritics from a specified string and return a plain text string as the result.

After showing examples of calling the extension method and testing it, I wrapped things up by considering the options you have when dealing with more complex characters that are used in some languages.


I hope you enjoyed this post! Comments are always welcome and I respond to all questions.

If you like my content and it helped you out, please check out the button below 🙂

Comments