Listing Unicode (UTF-8) Filenames – Michael's Software Support Forum

Tagged: languages, Unicode, UTF-8

This topic has 4 replies, 2 voices, and was last updated 11 years, 10 months ago by Michael Gilkes.

Viewing 5 posts - 1 through 5 (of 5 total)

Author

Posts
August 10, 2011 at 3:21 pm #145
Michael Gilkes
Keymaster
Hi Everyone,

There is an ongoing issue with both my listing extensions (Easy Folder Listing module and Easy Folder Listing Pro plugin). Technically, the problem is not with the extensions themselves, but rather an issue with PHP and multi-byte unicode characters in the file system. I decided to start this topic in an attempt to summarize and explain the situation, and hopefully solicite some feedback from everyone. My goal is to make my listing extensions the best that they can be so that they will help people accomplish what they need.

Here’s the situation:

So far, persons have contacted me saying that they are trying to display filenames that use characters from the following languages: Turkish, Swedish, Greek, Chinese and German (to name a few). When these files are listed, persons express that they will see a � character on their webpage. Oftentimes, you may see 2 or 3 of these symbols where one character is supposed to be. Other than �, some persons have seen 1 or 2 extra (accented) characters where there were none before. For example, instead of seeing ä, you would see aÌˆ.

How I initially handled it:

So, when persons told me this. I first tried to replicate the issue. I asked a few persons to email me blank files with filenames that include these accented characters. I also got some files with chinese filenames. In my first attempt, the listing showed perfectly on my Mac running MAMP and using Firefox/Safari/Opera/Chrome. However, if I removed the UTF-8 charset meta setting, it would show with the errors above. So, my first solution, which is in the current version of the plugin, is to force the UTF-8 charset in either the header or the meta tag. After I did this persons still told me there was an issue, especially since Joomla typically sets the charset to UTF-8 anyways. So, I decided to do some more experiments and some research online.

Here’s what’s actually happening:

After the research and experiments with PHP, I realized that the issue had to do with PHP and the character encoding of the file system. Here is a list of some of the sites I visited:
I also wrote up some PHP tests on my own to verify and attempt to find a solution to the problem. What I found was that the problem is that the file system is storing the filenames as multi-byte format. It could be in UTF-8, UTF-16 or ISO-8859-1. I attempted to use the php function utf8_decode() and also a host of mb_ and iconv_ functions to deal with the situation, but I haven’t been successful yet. However, I have learned quite a bit. The main issue is that natively, PHP reads strings one byte at a time, and the multi-lingual/Unicode text contain characters that are 2 bytes or sometimes 3 bytes. I’ll share 2 examples with you:

Example 1:

let’s say we are trying to show the following file name: åäö.txt. What you find is that the listing will show: aÌŠaÌˆoÌˆ.txt instead. This happens because of how it is encoded. To explain, if you were to run strlen on åäö.txt, it would give you a length of 13. When I saw that, I was like, “What the?!?”. Shouldn’t that be only 7 characters?! But, then I realized that it is counting each umlaut (accent) as 2 bytes. Since there are 3 of them, that would be 7 + 6 = 13 bytes (characters). So, then I looked up how to handle mutli-byte strings in php, and I tried to use mb_strlen (as well as iconv_strlen) instead, re-encoding the string as ‘UTF-8’, but the string length that was returned was 10. So, that told me that the umlauts are now being counted as a single character, although they were 2 bytes each.

So, what I did next was to convert the string to HEX, and what I found was that åäö.txt is actually 61cc8a61cc886fcc882e747874, where:
- 61 = a – lower case a
- cc8a = ̊ – combining ring above
- 61 = a – lower case a
- cc88 = ̈ – combining diaeresis
- 6f = o – lower case o
- cc88 = ̈ – combining diaeresis
- 2e = . – full stop
- 74 = t – lower case t
- 78 = x – lower case x
- 74 = t – lower case t
Now, if you count the bytes, you will see that it is indeed 13 bytes used to make 10 characters, where the umlauts are counted as their own characters.

So, now we have isolated two issues, one is the fact that is a multi-bye string, and the second is the fact that although we can figure that out, there doesn’t seem to be a way to combine the umlauts with the preceding characters.

Possible Solution?

After finding out about combining umlauts, I did some more research, and realized that Unicode can be represented in different ways: Normalization Form D (NFD) – Canonical decomposition, Normalization Form C (NFC) – Canonical decomposition, followed by Canonical composition, and two other ways (NFKD and NFKC). Basically what this means is that in NFD, you have the ä represented as a HEX 61 (U+0061) followed by ̈ HEX cc88 (U+0308), whereas in NFC, it is represented as a single 2-byte character ä with HEX c3a4 (U+00E4).

So, I wondered how I could accomplish this in PHP, so I did a little more research and learned about PHP’s normalizer class. It seems to be a function/class that allows you to combine a letter with an umlaut and provide you with the single character equivalent. The only problem is that the Normalizer class functions are not available in the default PHP installation. So, I can’t use it. It is part of the PECL intl package and needs to be installed.

Example 2:

The second example deals with trying to display some chinese characters. The text I used was 汉语.txt. strlen on it produced a length of 10, and mb_strlen, re-encoded in UTF-8, produce a length of 6. But it seems that the multi-byte functions were wrong this time, since the HEX of the string is: e6b189e8afad2e747874, which can be separated as follows:
- e6b189 = 汉
- e8afad = 语
- 2e = . – full stop
- 74 = t – lower case t
- 78 = x – lower case x
- 74 = t – lower case t
As you can see, the 2 chinese characters are 3 bytes each. They weren’t being converted or shown properly. And it seems that using the multi-byte PHP functions make it worse.

Conclusion:

So far, I believe I have isolated the problem. However, the solution to solving it is unclear. From what I have researched so far, it seems to be a PHP issue. There really isn’t anything that my extensions are doing to the filenames that is causing it. However, I will be looking out for a workaround or evitable solution. The Normalizer functions seem to be a possible solution, but it is not in the standard PHP installation. So, I can’t really use it. The representation of Unicode on the server file system seems to be the key. PHP seems not be totally Unicode-friendly, but I wil still keep my eyes open for a solution.

So, I hope this helps everyone to understand the situation. If you learn something new or some possible solution that you think will or has fixed this, please post it here. Thanks.
August 10, 2011 at 7:13 pm #1227
Michael Gilkes
Keymaster
As a note to my explanation above, this problem only seems to exist where the file system encoding is not UTF-8. If the internal encoding of the filename is UTF-8, then everything will show fine. For example, my MAMP server uses ISO-8859-1, and once the file was create in my file system it shows fine. I think that is because ISO-8859-1 uses precomposed characters. It would actually be good if persons with this problem were able to share what the encoding their file system was. You can tell this by using mb_internal_encoding or mb_detect_encoding.

For more information on this, check out these links:
September 1, 2011 at 3:47 pm #1228
Michael Gilkes
Keymaster
Update:

Hi everyone!

I am still working on this issue. Here are two of the most recent links that I am looking at:
I invite comments and suggestions from others who have been working with this.
November 5, 2012 at 10:41 am #1229

Germansailor
Participant

Hi Michael,

since a few days I am using your plugin – a very very nice tool! 🙂

But now I have this problems with German-Umlaut. The files are showing correct – but the download failed.

The last entry in this thread is 1 year old – so please can you tell us more about the problem-solution? Or is this problem unresolvable?

November 5, 2012 at 1:06 pm #1230

Michael Gilkes
Keymaster

@Germansailor. Thanks for posting. I am actually aware of this issue as of a few weeks ago. One of the issues that I notice during my investigation is that the servers that give this issue will give an error when you copy the exact link directly and place it in the address bar. When you try to do this, the server gives an error. I *believe* this has to do with some server configuration of how the web server itself handles the utf-8 filenames. However, as I am not an expert web server admin, I cannot say definitively.

Please know that I am working on it, and if anyone can help with the troubleshooting aspect of it, I welcome your input. The problem for me is that the code works as expected on my servers, even when I use the files that my customers send to me. So, it is difficult for me to ascertain the exact problem when I have not been able to replicate the same environment in which the problem is seen.
Author

Posts

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.