Notes - Remove Broken Files (broken content, broken file name) in Ubuntu

I once wrote a python program to create text files, which only contain ASCII and UTF-8 Unicode text. Before being written to a file, every non-UTF-8 line of text is decoded to UTF-8 using this line of code:

line = line.decode('utf-8', 'ignore').encode("utf-8")

I noticed many suggestions on StackOverFlow to use python’s string.printable or bash’s strings command to sanitize the text. However, these functions will remove all non-ASCII characters, which includes many, many valid UTF-8 characters because ASCII is a subset of UTF-8 (reference).

Accidentally, my program was unintentionally interrupted, which made many files broken. They contains non-ASCII/non-UTF-8 characters (e.g., binary and non-printable characters) in both file name and content of the file. This makes it hard to read, or even delete these files.

Some files contain escape characters \x.. in their name, which makes some programs unable to access/read these files normally. Some files contain bad characters in its content, which cause error when trying to read those files. For example, some broken file names contains these string, which crash most programs that try to read these files.

`h\241\212_pap\024i\033ular\001neee~\015\004)5resolv-retry!\271nfinite\241\212nobind

ӝ_BF-CBE\xwffw\qw\ff

鍺_995\005F\xAC\x\ff

After spending hours on trying to write some scripts to detect/fix/delete these files, I come up with the following two steps to remove corrupted files from my directories.

Step I: remove files with broken file name

There would be no problem if I only had a couple broken files with known_file_name. I could just manually use rm with tab completion or regular expression (i.e. rm broken_file_name_*weird_char*) to deleted them. However, in my case, there are thousands broken files, which contain unpredictable non-printable characters. Luckily, ls has -q flag to force printing of non-graphic characters in file names as the character ?.

Method 1: Remove broken files the hard way

  1. Get all broken file names:

        ls -q | grep '?'
    
  2. Copy the output to a text file temp.txt, then use the Selection/Split into Lines tool of Sublime to select all lines and add rm in front of each line. Save the file as a bash script and run it within the directory where broken files stored to remove them. Unfortunately, this step will remove most of the broken files but not all (at least in my case). There are still some broken files that this script could not remove. I then run the command in step 1 again, and notice the list of broken files is now shorter. One more time, I copy this list into a new temp.txt file.

  3. Then I use the following python script to replace all special characters with the wildcard *, and make a bash script.

    import re
    
    modified = []
    with open("temp.txt","r") as original:
        for line in original.readlines():
            line = re.sub('[^A-Za-z0-9]+', '*', line)
            modified.append(line)
    
    with open("rm_bad_files.sh","w") as script:
        for line in modified:
            script.write("rm " + line + ";\n")
    
  4. The bash script then has similar content with the following. Be Cautioned! that if your file contains lines like rm * with a very short file name following rm, this script may accidentally removed all your files, including good ones. In that case, remove such lines from the script for safety purpose.

    rm 22222*F0amoUy*DFtnMMUm2PiiirJaaaWlJcaaaaaiWlJMjin*DFtlB;
    rm *173*198*117*infinite*nobi*d*18;
    rm 190*16*223*udp*paqdi*ular*neee*Hu*resolv*retry*nfinite*nobi;
    rm *18*188*q*p*h*pap*i*ular*neee*5resolv*retry*nfinite*nobind*;
    rm 206*199*196*udp*h*M*RTIFIwyTEIQ*IQ*MIIF2DCCAwIBAgIQTKr5yt7*b*;
    rm *114*36*proto*tcp*M*RTIFIwyTEIQ*IQ*MIIF2DgIQTKr5yt7*b*Af907pn*;
    
  5. Finally, you can run bash rm_bad_files.sh from within the directory where broken files stored to remove them.

Method 2: Remove broken files with one-line command

After several days discovered the way in Method 1, I recognized that files with broken file name could also be removed using this one-line command:

rm $(ls -q /current_working_directory | grep '?' | awk -F'?' '{printf $1"*"$2"*\n"}');

Note that this command assumes that your file name does not contain the character ?.

Step II: remove files with broken content

Note that the script in this step won’t work properly if files with non-printable name have not been removed in Step I.

Unix has a command line to determine the type of a file. By running a for ... loop with file from the directory, where all files (including good and broken files, *not with files having non-printable file name) are stored.

  1. A basic file command that tells you the type of a file, which possibly helps to determine files with broken content:

    for i in *; do file $i; done
    #output
    file_name1: UTF-8 Unicode text, with CRLF, LF line terminators
    file_name2: ASCII text, with CRLF, LF line terminators
    file_name3: data
    file_name4: empty
    file_name5: cannot open <weird_file_name_may_contain_non_printable_characters> (No such file or directory) # This line should not show up if you have gone through Step 1. If a silimar line show up in your output, please make sure you remove such `cannot open` files before moving on.
    ...
    
  2. This command finds broken files with grep -a -v (see man grep). In my case, only the first 2 cases (i.e., UTF-8 Unicode text and ASCII text) indicate proper files, while the last 3 cases indicate that a file is broken (I know by manually checking some files having those types). Based on this assumption, I could then find the list of broken files by using this command:

    for i in *; do file $i | grep -a -v ': UTF-8 Unicode text' | grep -a -v ': ASCII text'; done
    
  3. The following script outputs the name of broken files to stdout, and removes them using if [[condition ]]; then rm; fi nested inside the for ... loop. Note that you will need to run the following script with /bin/bash, not /bin/sh. I have echo to notify me which file was removed.

    #!/bin/bash
    for i in $(pwd)/*; do
        if [[ $(file $i) != *": UTF-8 Unicode text"* ]] && [[ $(file $i) != *": ASCII text"* ]]; then
            echo $i
            rm -- $i
        fi
    done
    
Avatar
Nguyen Phong Hoang
Postdoctoral Researcher

Related