A.I, Data and Software Engineering

Dumping Emails using JavaMail and jsoup

D

This post demonstrates the ETL process by scraping emails as text files for later processing, such as NPL or other ML models. We use JavaMail API for getting emails and Jsoup to get texts from email body if it is in HTML format.

FileUtils

This helper class contains a helper method that saves a string to file.

IMAP config

This class store credential info to login into mailbox using IMAP (Internet Message Access Protocol). fromDate stores the date that we want to get all emails from.

Email Dumper class

You will need to import JavaMail API and Jsoup libraries using the accompanying links.

The EmailDumper class contains only one static method check which takes one input param fromDate. To get only emails later than this date, we use this condition:

As some emails may be in HTML format, we use jsoup to extract the body as text before dumping to files.

The full code is as follows:

check email results

To run the program, add the code to the main method:

To sum up

Dumping more than 5k+ emails using the code is not really efficient as it will take more than 2 hours. The process can be accelerated by using threads. However, you may need to pay attention to request restriction for some server like ‘googlemail‘. Feel free to try it yourself. 🙂

2 comments

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Pin It on Pinterest

Newsletters

You have successfully subscribed to the newsletter

There was an error while trying to send your request. Please try again.

Petaminds will use the information you provide on this form to be in touch with you and to provide updates.
%d bloggers like this: