Given a simple address book (ab.txt) with poor formatting: john doe john@example.com 555-111-1212 sally@example.co.uk 555-555-1212 sally doe 515-1212 joe blow joe@example.info jane doe jane@example.com bob doe bob@example.com etc... How can just the email addresses be extracted to a new file using awk? $ awk ' { for(i=1;i<=NF;i++){ if($i ~ /@/){ print $i } } } ' ab.txt john@example.com sally@example.co.uk joe@example.info jane@example.com bob@example.comWhile simply extracting fields containing the 'at' symbol suited the data set perfectly, Patrick Mylund Nielsen's EmailsFromFile is far more comprehensive. "It follows a regular expression pattern based on the RFC 2822 standard and should thus return all valid email addresses regardless of how they appear in the file." Which means that even a jumbled mess like this:
john@example.com,sally@example.co.uk|8135551212/some random info joe@example.info;sue@example.museum 42!42!42!is parsed perfectly:
$ emailsfromfile.py sloppy_file.txt joe@example.info sue@example.museum john@example.com sally@example.co.uk
EmailsFromFile is licensed under the WTFPL, and is reproduced below for posterity. (Be sure not to miss Patrick's other tools; Windows admins will especially appreciate Failsafe MSI, a "shell script that enables and starts the Windows Installer service in safe mode".)
UPDATE: emailregex.com has loads of other examples, including this surprisingly accurate grep one-liner: $ grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" filename.txt
#!/usr/bin/env python
'''
emailsfromfile.py -- Get all unique email addresses from a file
by Patrick Mylund Nielsen
http://patrickmylund.com/projects/emailsfromfile/
License: WTFPL (http://sam.zoy.org/wtfpl/)
'''
__version__ = '1.1'
import sys
import os
import re
import codecs
# Regular expression matching according to RFC 2822 (http://tools.ietf.org/html/rfc2822)
rfc2822_re = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
email_prog = re.compile(rfc2822_re, re.IGNORECASE)
def isEmailAddress(string):
return email_prog.match(string)
def main(filename, separator='\n', encoding=None):
separator_replace = {
'space': ' ',
'newline': '\n',
}
if not os.path.isfile(filename):
raise IOError("%s is not a file." % filename)
results = set()
with codecs.open(filename, 'rb', encoding) as f:
for line in f:
results.update(email_prog.findall(line))
for k, v in separator_replace.iteritems():
separator = separator.replace(k, v)
print(separator.join(results))
if __name__ == '__main__':
args = len(sys.argv) - 1
if 0 < args < 4:
main(*sys.argv[1:])
else:
print("Usage: python %s <filename> [separator] [encoding]" % sys.argv[0])
print("The default separator is a newline. To separate by space, literally enter 'space' as the separator.")
/nix | Jul 06, 2011