Encoding problems (cyrillic) for importing data with CSV Import

adrian · August 4, 2016, 11:49am

Exploratory Interaface does not show Russian(cyrillic characters):

both in headers and in columns
in summary view, in table view, in command editor

Exploratory script works fine in RStudio, R with the same locale settings
R studio dataviewer shows names of the columns just fine

Dataframe was loaded from Excel file;

But the same problem remains when dataframe was loaded from csv via script with correct encoding (“WINDOWS-1251”)

read_delim("E:\\_other\\Playground\\Data Samples\\abon.txt" , "\t", quote = "\"", skip = 0 , col_names = TRUE , na = c("","NA"), n_max=-1 , locale=locale(encoding = "WINDOWS-1251", decimal_mark = ".") , progress = FALSE)

When you export data via “save as csv”, the file is saved in UTF-8 with correct headers and content.
After csv reimport with “UTF-8” encoding - the same problem in the intarface.

So it looks like the problem is mainly related to interface, not to R or its settings.

In Mac version the same problem do exist.

Another problem - cyrillic names does not work for import dialog in Windows version:

as well as for custom script import:

#Data Analysis Steps
xx <- read_delim("E:\\_other\\Playground\\Data Samples\\Абоненты.txt" , "\t", quote = "\"", skip = 0 , col_names = TRUE , na = c("","NA"), n_max=-1 , locale=locale(encoding = "WINDOWS-1251", decimal_mark = ".") , progress = FALSE) %>%
exploratory::clean_data_frame()

while the same script works fine for Rstudio.

For Mac version import from files with cyrillic characters works fine.

“R version 3.3.1 (2016-06-21)”,
“Platform: x86_64-w64-mingw32/x64 (64-bit)”,
“Running under: Windows 7 x64 (build 7601) Service Pack 1”,
“”,
“locale:”,
"[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251 ",
"[3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C ",
"[5] LC_TIME=Russian_Russia.1251 ",

Kan_Nishida · August 4, 2016, 4:17pm

Hi Adrian, can you share a sample data file with cyrillic characters for us?

Kan_Nishida · August 4, 2016, 5:27pm

I’ve just tested a sample Cyrillic letters with IBM866 encoding, and this seems to work, can you try this with custom R script?

df <- read_csv("/Users/kannishida/Dropbox/Data/DT_DemoDialogue.csv", locale = locale(encoding = “IBM866”))

Once confirmed, we’ll add additional encoding types to the UI.

Kan_Nishida · August 4, 2016, 5:28pm

For our testing, it would be still great if you can give us sample data with Cyrillic letters. Thanks!

adrian · August 4, 2016, 7:33pm

Here it is

I will try your file tomorrow.

adrian · August 4, 2016, 8:14pm

I just tested Абоненты.txt on my mac with custom import script:
# Data Analysis Steps
read_delim("/Dropbox/Playgrounds/Test Data/Data Samples/Абоненты.txt" , "\t", quote = "\"", skip = 0 , col_names = TRUE , na = c("","NA"), n_max=-1 , locale=locale(encoding = "WINDOWS-1251", decimal_mark = ".") , progress = FALSE)

and it works perfectly - both cyrillic named file Абоненты.txt is imported, and all cyrillic headers are readable; transformation steps work well with cyrillic headers.

So the problem is only with windows version / russian locale.

sorry for wrong initial info

my mac diagnostic specs
“R version 3.3.0 (2016-05-03)”,
“Platform: x86_64-apple-darwin13.4.0 (64-bit)”,
“”,
“locale:”,
“[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8”:

Kan_Nishida · August 5, 2016, 1:12am

Thanks for confirming, Adrian!

We have added additional encoding types in the data import dialogs, so you should be able to select the one you like. This will be in the next release, we’ll be releasing early next week.

For the Windows issue, we had very similar issue with Japanese characters as well. This seems to be some issue with R, it’s also an issue in RStudio as well. We haven’t found a concrete solution for addressing this for Windows yet, but will keep looking into it, stay tuned!

Kei_Saito · August 5, 2016, 4:14am

Hi Adrian. I’m Kei from Exploratory team.

As Kan mentioned, we added a lot of new encoding options including Cyrillic ones, and we also confirmed we could read your data. It will be available in the next release.

Thanks!
–Kei

adrian · August 8, 2016, 11:21am

Thanks, Key

It will work only for Mac? or Windows version will work with Cyrillic characters as well?

Kei_Saito · August 8, 2016, 4:30pm

Adrian.

The encoding option itself works on windows too. But we have another issue on viewing Cyrillic characters as you mentioned and we are working on it.

Thanks
-Kei

adrian · August 9, 2016, 3:12pm

Ok, Kei, thank you for clarification.

This issue is real dealbraker for me, almost all my data have Cyrillic characters somwhere, and I use Windows for analysis;
Maybe my ‘experiments’ below will help to resolve the issue, or at least to find some workaround.

Its does not looks like issue connected to R itself (at least in 3.3), R/Rgui and Rstudio in my environment works fine:
Rgui:

Rstdudio:

but Exploratory headers:

Exploratory data:

And please note that export to csv from Explanatory work well - file are reindcoded to UTF-8, and can be opend in any other program which supports UTF-8 without problems (eg Notepad, Notepad++), all characters are readable

Both Rstudio and exploratory are portable installations, which communicates with the same portable R instance, with the same system locale:
(

I think issue is in a way R communicates with Exploratory and RStudio / RGui, system locale used by frontend and the problems with UTF-8 encoding in Windows.

As far as I understand modern R uses UTF-8, but system locale Russian_Russia.1251 are ANSI based.

There is no way you can set UTF-8 locale for Windows R - Sys.setlocale("LC_ALL", 'en_US.UTF-8') or Sys.setlocale("LC_ALL", 'ru_RU.UTF-8') does not work.

UTF-8 is codepage 65001
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

ButSys.setlocale("LC_ALL", 'Russian_Russia.65001')and everything else with 65001 does not work for R.

only ANSI based codepage eg 1251 or 1252 could be set in R on Windows :
Sys.setlocale("LC_ALL", 'Russian_Russia.1251') works
Sys.setlocale("LC_ALL", 'Russian_Russia.1252') works

But somehow using iconv I was able to see data properly:

colnames()%>% iconv(to = “cp65001”, from = ‘UTF-8’)%>% data.frame()

It is strange for me, but somehow it works.
iconv(to = 'cp65001', from = 'cp65001') also gives correct result ,
iconv(to = 'UTF-8', from = 'UTF-8') - unreadble symbols
iconv(to = 'UTF-8') - unreadble symbols
iconv(to = 'UTF-8' , from = 'cp1251') unreadble symbols

However after result of colnames()%>% iconv(to = "cp65001", from = 'UTF-8')%>% data.frame()
export to csv, this csv is not readable at all (I was not able to open it even in Notepad++ - i mean content is not readable)

Maybe I can use iconv step to reincode the whole dataframe whis headers and data? and than reincode it back if I need export? (at least as a short term work around)

Thank you,
Adrian

Kan_Nishida · August 10, 2016, 10:56pm

Hi Adrian,

We’re having similar issue with Japanese as well, and trying to figure out how to support this on Windows. We’ll continue looking into it, and keep you updated. Sorry for the inconvenience!!

khanhdinh · August 22, 2016, 10:07am

Hello,

I am having problem with Vietnamese characters as well. As in the case of Adrian, the Mac version works like a charm, only does the Windows version encounter difficulties.

Thanks for your support!

Kan_Nishida · August 22, 2016, 2:54pm

We think we have figured out the root cause, still need some time for testing, but will keep you guys updated, stay tuned!

Hideaki_Hayashi · August 23, 2016, 6:04am

Hi,
Here is a new build of Exploratory for windows with some fixes on the communication between R and Exploratory’s frontend. (Yes, you were right, Adrian!)

https://download2.exploratory.io/windows/Exploratory_1.9.0.2_WIN.zip

We can read multibyte or non-latin1 character data with this build. (So far I tried Japanese csv data in UTF-8 and Cyrillic csv data in Windows-1251.)
But we are still seeing some problems in two areas, and working on them.

Commands that has multibyte/non-latin1 literals does not work in most of the cases. e.g. Commands like mutate(col1 = `Код`) or filter(col2 == “Код”) does not give expected results.
Column name or column value suggestions in the command input sometimes turns into unreadable characters.

Let us know if this fix works for you.
We will keep you updated as we make more progress.
Thank you!

khanhdinh · August 23, 2016, 7:49am

Many thanks Hideaki!!! The new version is working extremely well on Vietnamese characters.
Thanks thanks thanks

Hideaki_Hayashi · August 23, 2016, 8:34am

Happy to hear that! Thank you!

khanhdinh · August 23, 2016, 9:37am

Hello,

Unfortunately Exploratory does not recognize the headers in function calls, after importing them sucessfully:
Not recognizing headers

Thanks a lot,
Khanh

Hideaki_Hayashi · August 23, 2016, 7:18pm

Hi Khanh,
Is it possible to share with us a data file with which this issue happens?
Thank you,

khanhdinh · August 24, 2016, 8:58am

Here you go: