A.I, Data and Software Engineering

Dumping Emails using JavaMail and jsoup

D

This post demonstrates the ETL process by scraping emails as text files for later processing, such as NPL or other ML models. We use JavaMail API for getting emails and Jsoup to get texts from email body if it is in HTML format.

FileUtils

This helper class contains a helper method that saves a string to file.

public class FileUtils {
	public static void saveToFile(String file, String content) throws IOException {
		File f = new File(file);
		f.getParentFile().mkdirs();
		if(!f.exists()) {
			f.createNewFile();
		}
		try (PrintWriter out = new PrintWriter(file)) {
		    out.println(content);
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		}
	}
}

IMAP config

This class store credential info to login into mailbox using IMAP (Internet Message Access Protocol). fromDate stores the date that we want to get all emails from.

public class EMAIL_SERVER_SETUP {
	public static final String fromDate = "2019/11/01";
	public static final String USERNAME = "youremail@mail.com";
	public static final String PASSWORD = "******";//your email pwd
	public static final String HOST = "imap.googlemail.com";
	public static final String FOLDER = "inbox";
}

Email Dumper class

You will need to import JavaMail API and Jsoup libraries using the accompanying links.

import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;
import javax.mail.Folder;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.Session;
import javax.mail.Store;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.sun.mail.imap.IMAPFolder;

The EmailDumper class contains only one static method check which takes one input param fromDate. To get only emails later than this date, we use this condition:

if (msg.getSentDate().after(fromDate)) {
    //dump this email
}

As some emails may be in HTML format, we use jsoup to extract the body as text before dumping to files.

if(msg.getContentType().contains("HTML")) {
   Document document = Jsoup.parse(body);
   body = document.body().text();
}

The full code is as follows:

public class EmailDumper {
   /**
   * get all email from the provided date
   * @param fromDate
   * @throws MessagingException
   * @throws IOException
   */
   public static void check(Date fromDate) throws MessagingException, IOException {
   SimpleDateFormat dateFormat = new SimpleDateFormat("yyyyMMdd");
   IMAPFolder folder = null;
      Store store = null;
      String subject = null;
      String body = null;
      String receivedDate = null;
      try {
         Properties props = System.getProperties();
         props.setProperty("mail.store.protocol", "imaps");
         Session session = Session.getDefaultInstance(props, null);
         store = session.getStore("imaps");
         store.connect(EMAIL_SERVER_SETUP.HOST, EMAIL_SERVER_SETUP.USERNAME, EMAIL_SERVER_SETUP.PASSWORD);
         folder = (IMAPFolder) store.getFolder(EMAIL_SERVER_SETUP.FOLDER); // This works for both email
         // account
         if (!folder.isOpen())
            folder.open(Folder.READ_ONLY);
         Message[] messages = folder.getMessages();
         System.out.println("No of Messages : " + folder.getMessageCount());
         System.out.println("No of Unread Messages : " + folder.getUnreadMessageCount());
         System.out.println(messages.length);
         for (int i = messages.length - 1; i > 0; i--) {
            Message msg = messages[i];
            if (msg.getSentDate().after(fromDate)) {
               System.out.print("Processing msg: " + i+ ": ");
               receivedDate = dateFormat.format(msg.getReceivedDate());
               subject = msg.getSubject();
               body = msg.getContent().toString();
               if(msg.getContentType().contains("HTML")) {
                  Document document = Jsoup.parse(body);
                  body = document.body().text();
               }
            FileUtils.saveToFile("./emails/"+receivedDate+subject+".txt", body);
					System.out.println("Done!");
            } else {
               break;
            }
         }
         } finally {
            if (folder != null && folder.isOpen()) {
               folder.close(true);
            }
            if (store != null) {
               store.close();
               }
            }
      	}
}

check email results

To run the program, add the code to the main method:

public static void main(String[] args) throws MessagingException, IOException, ParseException {
   EmailDumper.check(new SimpleDateFormat("yyyy/MM/dd").parse(EMAIL_SERVER_SETUP.fromDate));
}
No of Messages : 5613
No of Unread Messages : 1675
5613
Processing msg: 5612: Done!
Processing msg: 5611: Done!
Processing msg: 5610: Done!
Processing msg: 5609: Done!
...

To sum up

Dumping more than 5k+ emails using the code is not really efficient as it will take more than 2 hours. The process can be accelerated by using threads. However, you may need to pay attention to request restriction for some server like ‘googlemail‘. Feel free to try it yourself. 🙂

2 comments

💬

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Categories