Difference between revisions of "UTF8 insert Byte Order Mark"

From n0r1sk software solutions
Jump to: navigation, search
Line 7: Line 7:
 
== Show the differce ==
 
== Show the differce ==
 
With the following commands you could determine the used encoding.
 
With the following commands you could determine the used encoding.
 +
 +
Command:
 +
<pre>file mytest.file</pre>
 +
 +
Output without BOM:
 +
<pre>file mytest.file: UTF-8 Unicode text</pre>
 +
 +
Output with BOM:
 +
<pre>file mytest.file: UTF-8 Unicode (with BOM) text</pre>
  
 
== The script to add BOM (Byte Order Mark ==
 
== The script to add BOM (Byte Order Mark ==
Line 19: Line 28:
 
If you like you could make the BASEDIR and TARGETDIR virables as parameter passed to the script.
 
If you like you could make the BASEDIR and TARGETDIR virables as parameter passed to the script.
 
The script will duplicate the filesystem tree to the target directory. The source will remain unchanged.
 
The script will duplicate the filesystem tree to the target directory. The source will remain unchanged.
 +
 +
'''Be aware: The script deletes the target directory on each run!'''
  
 
<pre>
 
<pre>
Line 60: Line 71:
  
 
(cd $BASEDIR; RecursiveConvert)
 
(cd $BASEDIR; RecursiveConvert)
 
  
 
</pre>
 
</pre>

Revision as of 12:28, 4 July 2011

UTF8 in general does not need a BOM, a Byte Order Mark but sometimes libraries that are reading and writing files need it as mandatory argument. In UTF16 and UTF32 the BOM is mandatory.

More information could be found on Wikipedia.[1][2]

Show the differce

With the following commands you could determine the used encoding.

Command:

file mytest.file

Output without BOM:

file mytest.file: UTF-8 Unicode text

Output with BOM:

file mytest.file: UTF-8 Unicode (with BOM) text

The script to add BOM (Byte Order Mark

Prerequirement

The follwing software package will install the user space utiliy "uconv".[3]

apt-get install libicu-dev

Script

If you like you could make the BASEDIR and TARGETDIR virables as parameter passed to the script. The script will duplicate the filesystem tree to the target directory. The source will remain unchanged.

Be aware: The script deletes the target directory on each run!

#!/bin/bash

BASEDIR=/root/messages
TARGETDIR=/tmp/messages

rm -Rf $TARGETDIR
mkdir $TARGETDIR

function RecursiveConvert()
{
        for f in *
        do
                if [ -d $f ]; then
                        echo "Directory: $f"
                        (cd $f; mkdir -v $TARGETDIR/${PWD##*/}; RecursiveConvert);
                else   
                        echo "File: $f"
                        OUTPUT=`file $f | awk -F ":" '{ print $2 }'`
                        OUTPUT=$(sed -e 's/^[[:space:]]*//' <<<"$OUTPUT")
                        echo $OUTPUT                    
                        if [ "$OUTPUT" = "UTF-8 Unicode text" ]; then
                                echo "UNICODE WITHOUT BOM"
                                echo "Converting....."
                                uconv --add-signature $f > $TARGETDIR/${PWD##*/}/$f
                                echo "......done!"

                        else   
                                echo "Other file encoding $OUTPUT"
                                echo "Copying....."
                                cp -v $f $TARGETDIR/${PWD##*/}/$f
                                echo ".....done"
                        fi

                fi

        done
}

(cd $BASEDIR; RecursiveConvert)