Extracting email addresses from a file #

Back in 2002, bookface-ga kindly replied to my question about using awk to extract email addresses from a file:

Given a simple address book (ab.txt) with poor formatting:

john doe john@example.com 555-111-1212
sally@example.co.uk 555-555-1212 sally doe
515-1212   joe blow    joe@example.info
jane doe     jane@example.com       bob doe  bob@example.com
etc...

How can just the email addresses be extracted to a new file using awk?

$ awk '
           {
              for(i=1;i<=NF;i++){
                  if($i ~ /@/){
                      print $i
                  }
              }
         }
' ab.txt
john@example.com
sally@example.co.uk
joe@example.info
jane@example.com
bob@example.com

While simply extracting fields containing the 'at' symbol suited the data set perfectly, Patrick Mylund Nielsen's EmailsFromFile is far more comprehensive. "It follows a regular expression pattern based on the RFC 2822 standard and should thus return all valid email addresses regardless of how they appear in the file." Which means that even a jumbled mess like this:

john@example.com,sally@example.co.uk|8135551212/some random info
joe@example.info;sue@example.museum 42!42!42!

is parsed perfectly:

$ emailsfromfile.py sloppy_file.txt
joe@example.info
sue@example.museum
john@example.com
sally@example.co.uk

EmailsFromFile is licensed under the WTFPL, and is reproduced below for posterity. (Be sure not to miss Patrick's other tools; Windows admins will especially appreciate Failsafe MSI, a "shell script that enables and starts the Windows Installer service in safe mode".)

UPDATE: emailregex.com has loads of other examples, including this surprisingly accurate grep one-liner: $ grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" filename.txt

#!/usr/bin/env python ''' emailsfromfile.py -- Get all unique email addresses from a file by Patrick Mylund Nielsen http://patrickmylund.com/projects/emailsfromfile/ License: WTFPL (http://sam.zoy.org/wtfpl/) ''' __version__ = '1.1' import sys import os import re import codecs # Regular expression matching according to RFC 2822 (http://tools.ietf.org/html/rfc2822) rfc2822_re = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])""" email_prog = re.compile(rfc2822_re, re.IGNORECASE) def isEmailAddress(string): return email_prog.match(string) def main(filename, separator='\n', encoding=None): separator_replace = { 'space': ' ', 'newline': '\n', } if not os.path.isfile(filename): raise IOError("%s is not a file." % filename) results = set() with codecs.open(filename, 'rb', encoding) as f: for line in f: results.update(email_prog.findall(line)) for k, v in separator_replace.iteritems(): separator = separator.replace(k, v) print(separator.join(results)) if __name__ == '__main__': args = len(sys.argv) - 1 if 0 < args < 4: main(*sys.argv[1:]) else: print("Usage: python %s <filename> [separator] [encoding]" % sys.argv[0]) print("The default separator is a newline. To separate by space, literally enter 'space' as the separator.")

/nix | Jul 06, 2011

RSS | Archives