The Thorn characters -
Þ, þ - are largely unkown outside of the modern Icelandic Alphabet. While unsual, they do serve an interesting purpose. Within the CSV format comma -
, - separated values are the standard. Unfortunately, commas pose a variety of challenges, especially when large numbers are delimited in the same fashion.
name,balance john,10,000 david,1,000,000 jane,100
In this example, it is non-obvious where the columns should be delimited. This is where the thorn character comes into play, since it is highly foreign it can ensure maximum differtiantion between columns.
nameþbalance johnþ10,000 davidþ1,000,000 janeþ100
While this change is great for data integrity, it comes with a few quirks. For one, the thorn character derives from the ISO-8859-1 (Latin-1) encoding. And not 100% of Latin-1 characters are comptaible with UTF-8 (the modern standard). This becomes obvious once you start to try and parse a CSV file with a thorn delimiter.
The underlying problem is around byte size. The thorn delimiter is above 127, so it is not able to be translated into a single byte UTF-8 character. Due to this, when you try to parse a thorn delimted CSV you’ll often get the entire CSV in one dataframe column - not the intended goal.
The following data is what I used - thorn_test.txt.
first_nameþlast_nameþsalary jamesþbondþ100000000000 ronaldþmcdonaldþ1010202 lebronþjamesþ203030303 helmutþnewtonþ2374843
Parsing Thorn Delimeter in R
Sadly, I haven’t uncovered a straightforward way to accomplish this task in R. My hope is that putting this out will elicit a more elegant solution, or worst case, it will continue to bug me until I uncover one. Either way, the following attempt does work.
library(tibble) library(tidyr) # Stripping character on entry # Replacing with comma file <- gsub('þ', ',', readLines('~/thorn_test.txt')) # first_name,last_name,salary # james,bond,100000000000 # ronald,mcdonald,1010202 # lebronþjames,203030303 # helmut,newton,2374843 # Convert to tibble df <- tibble::as_tibble(file) # A tibble: 5 x 1 # value # <chr> # first_name,last_name,salary # james,bond,100000000000 # ronald,mcdonald,1010202 # lebronþjames,203030303 # helmut,newton,2374843 # Use tidyr::separate() -> to create new columns df_final <- df %>% separate(value, c('first_name', 'last_name', 'salary'), sep=',') # A tibble: 5 x 3 # first_name last_name salary # <chr> <chr> <chr> # first_name last_name salary # james bond 100000000000 # ronald mcdonald 1010202 # lebron james 203030303 # helmut newton 2374843
The final version is certianly manageable. The two most obvious pain points are: (1.) the column data types, all characters - no bueno and (2.) the fact that our column headers are the first row. Neither one of these problems is massive, but annoying enough that a better solution would be welcomed. In this pursuit I also tried using readr’s
read_csv() method with the encoding switched to ISO-8859-1, however that yielded a nasty ISO-8859-1
3/4 character in it’s place. This leads to
gsub() being mapped over every column, thus running into many of the same issues.
Parsing Thorn Delimiter in Python
In Python, the solution is far simpler.
import pandas pandas.read_csv('thorn_test.txt', delimiter='\xfe', engine='python')
The first notable is the delimiter value - the unicode representation of the lowercase thorn character. Suprisingly, for whatever reason, I haven’t been able to get this value to work in R - not sure why. Secondly, we must specify that the parsing engine as Python. This decision is driven by the fact, as mentioned before, that the thorn character is larger than one byte. The C parsing engine, the default for Pandas, does not support these type of separators. Thus if you try it without the engine specified as Python you will get a warning, and the underlying engine will be switched anyways.