tinyapps.org / blog


Extracting email addresses from a file #

Back in 2002, bookface-ga kindly replied to my question about using awk to extract email addresses from a file:
Given a simple address book (ab.txt) with poor formatting:

john doe john@example.com 555-111-1212
sally@example.co.uk 555-555-1212 sally doe
515-1212   joe blow    joe@example.info
jane doe     jane@example.com       bob doe  bob@example.com
etc...

How can just the email addresses be extracted to a new file using awk?

$ awk '
           {
              for(i=1;i<=NF;i++){
                  if($i ~ /@/){
                      print $i
                  }
              }
         }
' ab.txt
john@example.com
sally@example.co.uk
joe@example.info
jane@example.com
bob@example.com
While simply extracting fields containing the 'at' symbol suited the data set perfectly, Patrick Mylund Nielsen's EmailsFromFile is far more comprehensive. "It follows a regular expression pattern based on the RFC 2822 standard and should thus return all valid email addresses regardless of how they appear in the file." Which means that even a jumbled mess like this:
john@example.com,sally@example.co.uk|8135551212/some random info
joe@example.info;sue@example.museum 42!42!42!
is parsed perfectly:
$ emailsfromfile.py sloppy_file.txt
joe@example.info
sue@example.museum
john@example.com
sally@example.co.uk

EmailsFromFile is licensed under the WTFPL, and is reproduced below for posterity. (Be sure not to miss Patrick's other tools; Windows admins will especially appreciate Failsafe MSI, a "shell script that enables and starts the Windows Installer service in safe mode".)

UPDATE: emailregex.com has loads of other examples, including this surprisingly accurate grep one-liner:
$ grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" filename.txt

#!/usr/bin/env python
'''
  emailsfromfile.py -- Get all unique email addresses from a file

  by Patrick Mylund Nielsen
  http://patrickmylund.com/projects/emailsfromfile/

  License: WTFPL (http://sam.zoy.org/wtfpl/)
'''

__version__ = '1.1'

import sys
import os
import re
import codecs

# Regular expression matching according to RFC 2822 (http://tools.ietf.org/html/rfc2822)
rfc2822_re = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
email_prog = re.compile(rfc2822_re, re.IGNORECASE)

def isEmailAddress(string):
    return email_prog.match(string)

def main(filename, separator='\n', encoding=None):
    separator_replace = {
        'space': ' ',
        'newline': '\n',
    }
    if not os.path.isfile(filename):
        raise IOError("%s is not a file." % filename)
    results = set()
    with codecs.open(filename, 'rb', encoding) as f:
        for line in f:
            results.update(email_prog.findall(line))
    for k, v in separator_replace.iteritems():
        separator = separator.replace(k, v)
    print(separator.join(results))

if __name__ == '__main__':
    args = len(sys.argv) - 1
    if 0 < args < 4:
        main(*sys.argv[1:])
    else:
        print("Usage: python %s <filename> [separator] [encoding]" % sys.argv[0])
        print("The default separator is a newline. To separate by space, literally enter 'space' as the separator.")

/nix | Jul 06, 2011


Subscribe or visit the archives