
The majority of software applications developed today need to work with strings that contain accented characters such as à and é since many human languages use these to indicate variations in pronunciation and other subtleties.
Even English borrows some French words that contain accents and other glyphs, such as café, déjà vu, and façade. These words are frequently used in day-to-day conversations and written text.
A fairly common software requirement is the need to replace or remove accented characters within such words. The use case could be to allow the storage of text in a legacy system that does not support Unicode characters, or perhaps to generate a user-friendly alphanumeric code that is based on a portion of text but must contain only plain characters and numbers.
In this post, I will explain how you can remove accents from characters and effectively replace accented characters with the equivalent ‘plain’ characters using C#.
Diacritics
The ‘accents’ that form part of characters such as à and é are examples of glyphs that are formally known as diacritics.
Diacritics are small marks or symbols added to letters in various alphabets to indicate specific phonetic or linguistic features, such as pronunciation, stress, tone, or other linguistic distinctions. Diacritics can alter the sound, meaning, or pronunciation of a letter or word.
As well as understanding that diacritics are the glyphs/marks on the characters and not the characters themselves, it’s also important to be aware that it’s not just accents that are referred to as diacritics.
Below is a list of various diacritics, along with examples.
- Acute Accent: ´ (e.g., á, é, í, ó, ú)
- Grave Accent: ` (e.g., à, è, ì, ò, ù)
- Circumflex: ˆ (e.g., â, ê, î, ô, û)
- Tilde: ~ (e.g., ã, ñ, õ, ñ)
- Cedilla: ¸ (e.g., ç)
- Diaeresis (Umlaut): ¨ (e.g., ä, ë, ï, ö, ü)
- Breve: ˘ (e.g., ă, ĕ, ĭ, ŏ, ŭ)
- Macron: ¯ (e.g., ā, ē, ī, ō, ū)
- Caron (Háček): ˇ (e.g., č, š, ž, ě)
- Underdot: ̣ (e.g., ṇ, ḥ, ẏ)
- Dot Above: ˙ (e.g., ḷ, ȧ)
- Ring Above: ˚ (e.g., å, ů, ŋ)
- Hook Above: ̉ (e.g., ả, ấ, ặ)
- Double Acute Accent: ̋ (e.g., ő, ű)
- Double Grave Accent: ̏ (e.g., ȁ, ȑ)
- Double Dot Above: ̈̈ (e.g., ẅ, ẗ)
- Horn: ̛ (e.g., ơ, ư)
As you can see from the above list, many different types of diacritics are used across different languages!
So how can our code cater to all of these possibilities and replace the characters that use diacritics with the equivalent plain character version? We’ll look at the code solution that can handle this requirement in the next section.
Code solution
Now for the code solution. I have included the definition of a C# string extension method below named RemoveDiacritics
that can remove diacritics from any character in a specified string.
using System.Globalization; using System.Text; /// <summary> /// Contains <see cref="string"/> extension methods. /// </summary> public static class StringExtensions { /// <summary> /// Removes diacritics from the specified text. /// </summary> /// <param name="text">The text to remove diacritics from</param> /// <returns>A new version of the text with diacritics removed</returns> public static string RemoveDiacritics(this string text) { string normalizedString = text.Normalize(NormalizationForm.FormD); var stringBuilder = new StringBuilder(normalizedString.Length); foreach (char c in normalizedString) { if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark) { stringBuilder.Append(c); } } return stringBuilder.ToString(); } }
Okay, let’s break down what the above method is doing so that we have a full understanding of what is happening.
At a high level, the method takes a string
as the input, normalises it, and returns a new string
instance to the caller with the ‘non-spacing mark’ characters removed (more on this shortly).
The System.String
class contains an instance method named Normalize
which, as per the documentation, “returns a new string whose textual value is the same as this string, but whose binary representation is the normalization form specified by the normalizationForm
parameter.”
By calling the Normalize
method, the plain characters will be split from the diacritic glyphs into separate characters, allowing us to iterate through each character in the string and exclude the diacritics from the result.
After normalising the string, a StringBuilder
instance is created and configured with an initial capacity that matches the length of the normalised string. Setting the initial capacity provides a small performance optimisation for most cases.
Next, the code iterates through each character of the normalised string.
Consider a scenario where the following text has been passed to the RemoveDiacritics
method: áèîõü
In this case, the characters from the normalised string will be processed by the foreach
loop in the order shown below.
a ´ e ' i ^ o ~ u ¨
If we didn’t normalise the string first, there would only have been 5 loop iterations, as the characters wouldn’t have been separated out as above.
Within the loop, the diacritics are identified and filtered out by calling the CharUnicodeInfo.GetUnicodeCategory
method and checking for UnicodeCategory.NonSpacingMark
characters. The diacritics will all be classed as non-spacing marks.
After the loop has been completed, the ToString
method is called on the StringBuilder
instance and the result is returned.
Using the extension method
Below is an example of calling the RemoveDiacritics
extension method and outputting the result to the Console.
string diacriticCharacterText = "áèîõü"; string plainCharacterText = diacriticCharacterText.RemoveDiacritics(); Console.WriteLine(plainCharacterText); // Output: aeiou
In the above case, the diacritics are all removed as expected.
Testing inputs
While the basic example shown in the previous subsection looks promising, how can we be sure that the RemoveDiacritics
method is correctly handling all of the scenarios that we need to cater for?
Below is a set of xUnit test cases that can provide us with more confidence in the solution.
public class StringExtensionsTests { [Theory] [InlineData("aeiou", "áéíóú")] [InlineData("aeiou", "àèìòù")] [InlineData("aeiou", "âêîôû")] [InlineData("anon", "ãñõñ")] [InlineData("c", "ç")] [InlineData("aeiou", "äëïöü")] [InlineData("aeiou", "ăĕĭŏŭ")] [InlineData("aeiou", "āēīōū")] [InlineData("csze", "čšžě")] [InlineData("nhy", "ṇḥẏ")] [InlineData("la", "ḷȧ")] [InlineData("au", "åů")] [InlineData("aaa", "ảấặ")] [InlineData("ou", "őű")] [InlineData("ar", "ȁȑ")] [InlineData("wt", "ẅẗ")] [InlineData("ou", "ơư")] [InlineData("", "")] public void RemoveDiacritics_ShouldReturnResult(string expected, string input) { // Arrange/Act. var result = input.RemoveDiacritics(); // Assert. Assert.Equal(expected, result); } }
If you run these tests, you should find that they all pass.
In the above code, the xUnit Theory
attribute is used along with InlineData
attributes to define multiple test cases that are based on the list of diacritic examples shown earlier in the post.
Note that you can of course use your preferred unit test library (e.g. NUnit) and replace the attributes etc. with the equivalents (e.g. Test
, TestCase
attributes).
When diacritics are not enough
Sometimes you will need to concern yourself with more than just diacritics.
What if you’re dealing with a language that uses characters such as the following?
- Æ
- ß
- ð
- Ð
- Œ
- Ø
- œ
Unfortunately, the above characters do not contain simple diacritics and, as a result, the RemoveDiacritics
method will not be able to handle these in the way you might expect.
In these cases, you will need to write some mapping code to obtain reliable results. Usually, this would consist of a mapping dictionary where you decide what plain text character you want to use in place of the unsupported character.
Alternatively, depending on your use case, you may be able to simply remove the unsupported characters from your resulting string. This could be the case, for example, when generating an alphanumeric code.
Below is a modified version of the earlier example that removes non-ASCII characters after calling the RemoveDiacritics
extension method.
string diacriticCharacterText = "ÆßðáèîõüЌ؜"; string plainCharacterText = new string(diacriticCharacterText .RemoveDiacritics() .Where(char.IsAscii) .ToArray()); Console.WriteLine(plainCharacterText); // Output: aeiou
The above code results in the same output as the earlier example.
The LINQ Where
extension method is used in combination with the Char.IsAscii
method to filter out non-ASCII characters. The LINQ ToArray
extension method is then used to convert the results to an array of characters that can be passed to the String
constructor.
Summary
In this post, I covered how to remove accents (diacritics) from strings using C#.
I started by explaining what diacritics are and I provided examples of the many possible types of diacritics.
I then provided a code solution in the form of a string extension method that can strip diacritics from a specified string and return a plain text string as the result.
After showing examples of calling the extension method and testing it, I wrapped things up by considering the options you have when dealing with more complex characters that are used in some languages.
Comments