Last year there was an anonymous piece written in the New York Times which led to a controversy over who could possibly have written it. When this controversy evolved, there was a lot of interest in the piece itself and there were a lot of efforts made to try and discern who wrote it by analysing the content. This study of content in this way is known in forensic science as Questioned Document Examination. This whole debacle however brought light to many subjects I find fascinating, notably those of Linguistics, Steganography, and to a lesser extent Cryptography.
For my dissertation as part of my degree I created an encryption algorithm. I've had an interest in encryption for many years, not because of the goal of keeping things secure, but rather the opposite. I love puzzles, and I love anything that makes me stop and thinking. Cryptography is the act of encrypting or decrypting information in general, and its sister subject Cryptanalysis is the act of analysing encrypted information with the ultimate goal being decryption of that information. Both these fields heavily incorporate linguistics. Simple ciphers can be broken with basic linguistic analysis; take a substitution cipher, this is where every letter of the alphabet is swapped for a different unique letter, allowing you to encrypt a message and to produce a cipher text, which looks like a jumbled mess. In the English language the distribution of the letters of the alphabet by frequency of which they are used results in the letters E, T, A, O, I, and N, being the most frequently occurring letters in that order. Due to this fact, most simple substitution ciphers can be broken by taking the most common letter in the cipher text and mapping it to the letter E, and then the next to T and so on. This isn't an infallible method, there will always be exceptions, and anyone who knows even the basics about Cryptography will know this method of encryption offers little security.
What this linguistic analysis serves to prove however, is that there is a lot more information contained within a piece of writing than the words alone and the meaning they convey. There's a lot of meta data that can be extracted, things like average sentence length, word count, character count, character per word count, letter frequency distribution, unique word count, unique word frequency distribution, as well as other sentimental indicators and cumulative indicators when you have more than one sample such as cliché counts, recurring phrases, recurring words, and incorrect word usage consistency. All of this information can give you a lot of clues about who wrote something and whether or not multiples examples were written by the same person. Even when posts are edited quite a bit, without rewriting the whole thing every time, some information will be preserved.
Steganography is the act of hiding information in a passage of text without the reader being aware that it is there. This can be through the use of fixed points within the passage, for example using the first letter of every word or every third letter or some other more complicated means to find the information. Once the reader has the means to find this information that can go back and find it hidden in plain sight. Perhaps you noticed it, perhaps you didn't but this paragraph of text for instance contains the word "STOP" hidden within it, comprised of the first letter of the first word of each sentence.
Understandably these techniques if they had been used in the anonymous piece written in the New York Times would have been spotted fairly quickly. Indeed within the first few hours of its publication there were already analyses of the article and a number of indicators as to who may have wrote the article. Most speculation revolved around the use of words that were less common in the English language but were known to have been used many times by certain individuals within the US Administration.
Regardless, who wrote the piece is now irrelevant. What it made me realise however is that there are a number of articulations that I make here in these posts that give away the fact I wrote every one of them. I tend to use the word "ultimately" quite a bit, and I don't always use it right. Whilst the meaning and sentiment are conveyed in its usage, there are times when it's not being used in the right context. Whenever things are brought to my attention I usually amend them. At this point however there are well over 150 posts on this blog and I just don't have the time to go back and edit every one of them again. This is part of the reason why I said in other posts, we often see our own mistakes but don't recognise them in the moment, we gloss over them because they don't stand out. We see what we think we see rather than what is actually there. Again this is why it is advisable to have someone else proof read your publications, or to leave enough time for you to forget the content to be able to proof it yourself. Alas I am only one man and I don't have the time to go back and edit everything, this is a hobby, it isn't my job.
What do you think would give you away? If you were to write an anonymous article, what word or phrase or other linguistic feature of your writing style would be the key indicator that you were the one that wrote it?
No comments:
Post a Comment
All comments are moderated before they are published. If you want your comment to remain private please state that clearly.