Notes - Remove Broken Files (broken content, broken file name) in Ubuntu
I once wrote a python program to create text files, which only contain ASCII and UTF-8 Unicode text. Before being written to a file, every non-UTF-8 line of text is decoded to UTF-8 using this line of code:
line = line.decode('utf-8', 'ignore').encode("utf-8")
I noticed many suggestions on StackOverFlow to use python’s string.printable
or bash’s strings
command to sanitize the text. However, these functions
will remove all non-ASCII characters, which includes many, many valid UTF-8
characters because ASCII is a subset of UTF-8
(reference).
Accidentally, my program was unintentionally interrupted, which made many files broken. They contains non-ASCII/non-UTF-8 characters (e.g., binary and non-printable characters) in both file name and content of the file. This makes it hard to read, or even delete these files.
Some files contain escape characters \x..
in their name, which makes some
programs unable to access/read these files normally. Some files contain bad
characters in its content, which cause error when trying to read those files.
For example, some broken file names contains these string, which crash most
programs that try to read these files.
`h\241\212_pap\024i\033ular\001neee~\015\004)5resolv-retry!\271nfinite\241\212nobind
ӝ_BF-CBE\xwffw\qw\ff
鍺_995\005F\xAC\x\ff
After spending hours on trying to write some scripts to detect/fix/delete these files, I come up with the following two steps to remove corrupted files from my directories.
Step I: remove files with broken file name
There would be no problem if I only had a couple broken files with
known_file_name. I could just manually use rm
with tab completion
or
regular expression (i.e. rm broken_file_name_*weird_char*
) to deleted them.
However, in my case, there are thousands broken files, which contain
unpredictable non-printable characters. Luckily, ls
has -q
flag to force
printing of non-graphic characters in file names as the character ?
.
Method 1: Remove broken files the hard way
-
Get all broken file names:
ls -q | grep '?'
-
Copy the output to a text file
temp.txt
, then use theSelection/Split into Lines
tool of Sublime to select all lines and addrm
in front of each line. Save the file as a bash script and run it within the directory where broken files stored to remove them. Unfortunately, this step will remove most of the broken files but not all (at least in my case). There are still some broken files that this script could not remove. I then run the command in step 1 again, and notice the list of broken files is now shorter. One more time, I copy this list into a newtemp.txt
file. -
Then I use the following python script to replace all special characters with the wildcard
*
, and make a bash script.import re modified = [] with open("temp.txt","r") as original: for line in original.readlines(): line = re.sub('[^A-Za-z0-9]+', '*', line) modified.append(line) with open("rm_bad_files.sh","w") as script: for line in modified: script.write("rm " + line + ";\n")
-
The bash script then has similar content with the following. Be Cautioned! that if your file contains lines like
rm *
with a very short file name followingrm
, this script may accidentally removed all your files, including good ones. In that case, remove such lines from the script for safety purpose.rm 22222*F0amoUy*DFtnMMUm2PiiirJaaaWlJcaaaaaiWlJMjin*DFtlB; rm *173*198*117*infinite*nobi*d*18; rm 190*16*223*udp*paqdi*ular*neee*Hu*resolv*retry*nfinite*nobi; rm *18*188*q*p*h*pap*i*ular*neee*5resolv*retry*nfinite*nobind*; rm 206*199*196*udp*h*M*RTIFIwyTEIQ*IQ*MIIF2DCCAwIBAgIQTKr5yt7*b*; rm *114*36*proto*tcp*M*RTIFIwyTEIQ*IQ*MIIF2DgIQTKr5yt7*b*Af907pn*;
-
Finally, you can run
bash rm_bad_files.sh
from within the directory where broken files stored to remove them.
Method 2: Remove broken files with one-line command
After several days discovered the way in Method 1, I recognized that files with broken file name could also be removed using this one-line command:
rm $(ls -q /current_working_directory | grep '?' | awk -F'?' '{printf $1"*"$2"*\n"}');
Note that this command assumes that your file name does not contain the
character ?
.
Step II: remove files with broken content
Note that the script in this step won’t work properly if files with non-printable name have not been removed in Step I.
Unix has a command line to determine the type of a file. By running a for ...
loop with file
from the directory, where all files (including good and
broken files, *not with files having non-printable file name) are
stored.
-
A basic
file
command that tells you the type of a file, which possibly helps to determine files with broken content:for i in *; do file $i; done #output file_name1: UTF-8 Unicode text, with CRLF, LF line terminators file_name2: ASCII text, with CRLF, LF line terminators file_name3: data file_name4: empty file_name5: cannot open <weird_file_name_may_contain_non_printable_characters> (No such file or directory) # This line should not show up if you have gone through Step 1. If a silimar line show up in your output, please make sure you remove such `cannot open` files before moving on. ...
-
This command finds broken files with
grep -a -v
(see man grep). In my case, only the first 2 cases (i.e.,UTF-8 Unicode text
andASCII text
) indicate proper files, while the last 3 cases indicate that a file is broken (I know by manually checking some files having those types). Based on this assumption, I could then find the list of broken files by using this command:for i in *; do file $i | grep -a -v ': UTF-8 Unicode text' | grep -a -v ': ASCII text'; done
-
The following script outputs the name of broken files to stdout, and removes them using
if [[condition ]]; then rm; fi
nested inside thefor ...
loop. Note that you will need to run the following script with/bin/bash
, not/bin/sh
. I haveecho
to notify me which file was removed.#!/bin/bash for i in $(pwd)/*; do if [[ $(file $i) != *": UTF-8 Unicode text"* ]] && [[ $(file $i) != *": ASCII text"* ]]; then echo $i rm -- $i fi done