Parsing Thorn Delimiters In R & Python

Kade Killary · 2018.03.23 · 3 minutes until it's over

thorn

The Thorn characters - Þ, þ - are largely unkown outside of the modern Icelandic Alphabet. While unsual, they do serve an interesting purpose. Within the CSV format comma - , - separated values are the standard. Unfortunately, commas pose a variety of challenges, especially when large numbers are delimited in the same fashion.

name,balance
john,10,000
david,1,000,000
jane,100

In this example, it is non-obvious where the columns should be delimited. This is where the thorn character comes into play, since it is highly foreign it can ensure maximum differtiantion between columns.

nameþbalance
johnþ10,000
davidþ1,000,000
janeþ100

While this change is great for data integrity, it comes with a few quirks. For one, the thorn character derives from the ISO-8859-1 (Latin-1) encoding. And not 100% of Latin-1 characters are comptaible with UTF-8 (the modern standard). This becomes obvious once you start to try and parse a CSV file with a thorn delimiter.

The underlying problem is around byte size. The thorn delimiter is above 127, so it is not able to be translated into a single byte UTF-8 character. Due to this, when you try to parse a thorn delimted CSV you’ll often get the entire CSV in one dataframe column - not the intended goal.

Example Data

The following data is what I used - thorn_test.txt.

first_nameþlast_nameþsalary
jamesþbondþ100000000000
ronaldþmcdonaldþ1010202
lebronþjamesþ203030303
helmutþnewtonþ2374843

Parsing Thorn Delimeter in R

Sadly, I haven’t uncovered a straightforward way to accomplish this task in R. My hope is that putting this out will elicit a more elegant solution, or worst case, it will continue to bug me until I uncover one. Either way, the following attempt does work.

library(tibble)
library(tidyr)

# Stripping character on entry
# Replacing with comma
file <- gsub('þ', ',', readLines('~/thorn_test.txt'))

# first_name,last_name,salary
# james,bond,100000000000
# ronald,mcdonald,1010202
# lebronþjames,203030303
# helmut,newton,2374843

# Convert to tibble
df <- tibble::as_tibble(file)

# A tibble: 5 x 1
# value
# <chr>
# first_name,last_name,salary
# james,bond,100000000000
# ronald,mcdonald,1010202
# lebronþjames,203030303
# helmut,newton,2374843

# Use tidyr::separate() -> to create new columns
df_final <- df %>%
    separate(value, c('first_name', 'last_name', 'salary'), sep=',')

# A tibble: 5 x 3
# first_name last_name salary
# <chr>      <chr>     <chr>
# first_name last_name salary
# james      bond      100000000000
# ronald     mcdonald  1010202
# lebron     james     203030303
# helmut     newton    2374843

The final version is certianly manageable. The two most obvious pain points are: (1.) the column data types, all characters - no bueno and (2.) the fact that our column headers are the first row. Neither one of these problems is massive, but annoying enough that a better solution would be welcomed. In this pursuit I also tried using readr’s read_csv() method with the encoding switched to ISO-8859-1, however that yielded a nasty ISO-8859-1 3/4 character in it’s place. This leads to apply() with gsub() being mapped over every column, thus running into many of the same issues.

Parsing Thorn Delimiter in Python

In Python, the solution is far simpler.

import pandas

pandas.read_csv('thorn_test.txt', delimiter='\xfe', engine='python')

The first notable is the delimiter value - the unicode representation of the lowercase thorn character. Suprisingly, for whatever reason, I haven’t been able to get this value to work in R - not sure why. Secondly, we must specify that the parsing engine as Python. This decision is driven by the fact, as mentioned before, that the thorn character is larger than one byte. The C parsing engine, the default for Pandas, does not support these type of separators. Thus if you try it without the engine specified as Python you will get a warning, and the underlying engine will be switched anyways.