[Python] Script per estrarre testo da mail [Archivio]

View Full Version : [Python] Script per estrarre testo da mail

psimem

06-02-2017, 08:32

Come da titolo avrei bisogno di una mano per creare uno script in grado di:
1- accedere ad una determinata etichetta in una casella di posta gmail
2- elencarne il contenuto (titoli mail)
3- selezionare solo le mail nuove
4- estrarre una parte di testo dal corpo delle mail nuove
5- salvare tale parte di testo in modo ricorsivo in un file di testo
6- segnare le mail aperte come lette

Premetto che python non lo conosco pero' mi e' stato consigliato come soluzione migliore.
Seguendo alcuni esempi sono riuscito a mettere assieme il seguente codice che soddisfa i punti 1-2-3-6.

#!/usr/bin/env python

import datetime
import email
import imaplib
import mailbox

print '----------------------------'
print 'Login...'
mail = imaplib.IMAP4_SSL('imap.gmail.com')
(retcode, capabilities) = mail.login('emailaddress',password')
mail.list()
mail.select('inbox')
print 'Success.'
print '----------------------------'

n=0
(retcode, messages) = mail.search(None, '(UNSEEN)')
if retcode == 'OK':

for num in messages[0].split() :
#print 'Processing...'
n=n+1
typ, data = mail.fetch(num,'(RFC822)')

for response_part in data:
if isinstance(response_part, tuple):
original = email.message_from_string(response_part[1])
print original['From']
print original['Subject']
typ, data = mail.store(num,'+FLAGS','\\Seen')
print '----------------------------'

print "Unread: "
print n
print '----------------------------'

Il prossimo passo sarebbe estrarre anche il corpo della mail e processarlo... :help:

psimem

09-02-2017, 10:37

Ok, dopo un po' di tentativi sono riuscito a fare un ulteriore passo in avanti.
Al momento mi trovo con un file.txt contenente un bel po' di testo che a me non serve.
Cio' che mi servirebbe sarebbe estrarre i link come evidenziato sotto:

Tested just uploaded a video
Halftime Drones - This is Only a Test 386 - 2/9/17
http://www.youtube.com/watch?v=zDb_rotMvho&feature=em-uploademail
You can unsubscribe from notifications for this user by visiting the
subscription center:
http://www.youtube.com/subscription_manager
Tested just uploaded a video
This is only an example
http://www.youtube.com/watch?v=hdte5qlc9rb&feature=em-uploademail
You can unsubscribe from notifications for this user by visiting the
subscription center:
http://www.youtube.com/subscription_manager

In modo da ottenere un file di testo contenente solo:

http://www.youtube.com/watch?v=zDb_rotMvho
http://www.youtube.com/watch?v=hdte5qlc9rb

Che ne dite, si puo' fare?

Ho dimenticato lo script aggiornato :stordita: :

#!/usr/bin/env python

import datetime
import email
import imaplib
import mailbox

print '----------------------------'
print 'Login...'
mail = imaplib.IMAP4_SSL('imap.gmail.com')
try:
mail.login('youremail', 'yourpassword')
print "Success."
except imaplib.IMAP4.error:
print "LOGIN FAILED!!! "
# ... exit or deal with failure...
mail.list() # Lists all labels in GMail
mail.select('YTsubs') # Select YTsubs folder
print '----------------------------'

n=0
(retcode, messages) = mail.search(None, '(UNSEEN)')
if retcode == 'OK':

for num in messages[0].split() :
#print 'Processing...'
n=n+1
typ, data = mail.fetch(num,'(RFC822)')
print "Processing %s." % n

for response_part in data:
if isinstance(response_part, tuple):
original = email.message_from_string(response_part[1])
#print original
print original['From']
print original['Subject']
myfile = open("Extractedemail.txt", 'a')
#myfile.write("original\n")
myfile.write("---------- Indentation line ---------- \n %s \n" % original)
myfile.close()
typ, data = mail.store(num,'+FLAGS','\\Seen')
print '----------------------------'

print "Total processed: %s." % n
print '----------------------------'

psimem

09-02-2017, 16:20

Ok grazie del suggerimento, infatti ho trovato su google un paio di casi simili al mio, ma sufficientemente diversi da non riuscire ad adattarli alle mie esigenze :rolleyes:
Questo (http://stackoverflow.com/questions/9222106/how-to-extract-information-between-two-unique-words-in-a-large-text-file) e' l'esempio che piu' si avvicina al mio caso.

psimem

10-02-2017, 07:34

Qui (http://stackoverflow.com/questions/9222106/how-to-extract-information-between-two-unique-words-in-a-large-text-file)
http://stackoverflow.com/questions/9222106/how-to-extract-information-between-two-unique-words-in-a-large-text-file

Xfree

10-02-2017, 10:23

Beh, mi sembra chiaro che tu debba adattare l'espressione regolare per prendere cio' che ti interessa nel tuo testo.
Soffermandomi al codice da te incollato e supponendo che la struttura dell'url di youtube che tu catturi sia sempre quella, puoi notare che la struttura è fissa:
hai la parte di http (suppongo che ci possa essere opzionalmente anche la s),
poi hai www. seguito da youtube, poi watch?v= seguito dall'id del video ed infine una parte di parametri che cominciano con la &.
Ci sono diversi esempi su stackoverlflow che puoi adattare più facilmente. ;)
Oppure, se non ti vuoi imbarcare in regex, puoi usare un approccio più brutale usando il metodo startswith, visto che l'url è http://www.youtube.com/watch?v= e dividere successivamente la stringa usando come delimitatore la & e prendendo la prima parte.

psimem

10-02-2017, 16:34

Grazie Xfree per la dritta!
Effettivamente l'approccio piu' semplice (regex da quel che vedo e' molto versatile ma altrettanto complicato) e' quello di analizzare il file linea per linea:

with open('Extractedemail.txt') as inf:
for line in inf:
if 'youtube.com/watch?v=' in line:
print "Got one!!!"
with open("results.txt", "a") as myfile:
myfile.write(line)

Pero' ho trovato un'altra difficolta': non mi e' chiaro per quale motivo ma alcuni indirizzi che dovrei estrarre sono formattati in modo diverso gia' dalla mail di partenza :mad: Dovro' andare a fondo della questione.

Xfree

10-02-2017, 17:01

Il vantaggio di una regex ben fatta è proprio quello di essere più versatile, soprattutto per questo tipo di problema, a fronte di una difficoltà iniziale nella scrittura. :D
Formattati diversi in che senso? Che il formato dell'url è diverso o che contiene degli spazi che non ti aspetti però la struttura è sempre quella?

psimem

10-02-2017, 18:38

Il vantaggio di una regex ben fatta è proprio quello di essere più versatile, soprattutto per questo tipo di problema, a fronte di una difficoltà iniziale nella scrittura. :D

Eh lo so, ma per le mie capacita'/tempo al momento mi accontento :fagiano:

Formattati diversi in che senso? Che il formato dell'url è diverso o che contiene degli spazi che non ti aspetti però la struttura è sempre quella?

Al momento ho scoperto che la chiave di ricerca "youtube.com/watch?v=" non e' valida...

...infatti comparando due mail grezze ho notato che nella prima il testo del link risulta leggibile:

Content-Type: text/plain; charset=UTF-8; format=flowed; delsp=yes

Global Cycling Network just uploaded a video
Am I Overtraining? | Ask GCN Anything About Cycling
http://www.youtube.com/watch?v=rTbWMK42xiM&feature=em-uploademail
You can unsubscribe from notifications for this user by visiting the
subscription center:
http://www.youtube.com/subscription_manager

mentre nell'altra al posto del link leggibile ecco cosa vedo:

Content-Type: text/plain; charset=UTF-8; format=flowed; delsp=yes
Content-Transfer-Encoding: base64

WmV2aWsganVzdCB1cGxvYWRlZCBhIHZpZGVvDQoiQXNzYXNzaW7igJlzIENyZWVkOiBVbml0eSIg
U29sbyBXYWxrdGhyb3VnaCwgQ28tb3AgTWlzc2lvbiAjODogRGFudG9uJ3MgIA0KU2FjcmlmaWNl
ICsgQWxsIFN5bmMgUG9pbnRzDQpodHRwOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9N1Y1Nml0
bTBxblEmZmVhdHVyZT1lbS11cGxvYWRlbWFpbA0KWW91IGNhbiB1bnN1YnNjcmliZSBmcm9tIG5v
dGlmaWNhdGlvbnMgZm9yIHRoaXMgdXNlciBieSB2aXNpdGluZyB0aGUgIA0Kc3Vic2NyaXB0aW9u
IGNlbnRlcjoNCmh0dHA6Ly93d3cueW91dHViZS5jb20vc3Vic2NyaXB0aW9uX21hbmFnZXINCg==
:confused:

Indagando ho scoperto che il testo e' stato codificato con base64, infatti inserendolo in un sito di decodifica (https://www.base64decode.org/) me lo traduce come testo leggibile.
Ora la domanda che sorge e' una sola: come faccio per decodificarlo in python?
Riflettendoci su, beh facciamo due: come mai google invia due tipi di mail diversi?

psimem

10-02-2017, 18:51

Ok per decodifica base 64:

import base64

coded_string = '''WmV2aWsganVzdCB1cGxvYWRlZCBhIHZpZGVvDQoiQXNzYXNzaW7igJlzIENyZWVkOiBVbml0eSIg
U29sbyBXYWxrdGhyb3VnaCwgQ28tb3AgTWlzc2lvbiAjODogRGFudG9uJ3MgIA0KU2FjcmlmaWNl
ICsgQWxsIFN5bmMgUG9pbnRzDQpodHRwOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9N1Y1Nml0
bTBxblEmZmVhdHVyZT1lbS11cGxvYWRlbWFpbA0KWW91IGNhbiB1bnN1YnNjcmliZSBmcm9tIG5v
dGlmaWNhdGlvbnMgZm9yIHRoaXMgdXNlciBieSB2aXNpdGluZyB0aGUgIA0Kc3Vic2NyaXB0aW9u
IGNlbnRlcjoNCmh0dHA6Ly93d3cueW91dHViZS5jb20vc3Vic2NyaXB0aW9uX21hbmFnZXINCg=='''

print base64.b64decode(coded_string)

Zevik just uploaded a video
"Assassin’s Creed: Unity" Solo Walkthrough, Co-op Mission #8: Danton's
Sacrifice + All Sync Points
http://www.youtube.com/watch?v=7V56itm0qnQ&feature=em-uploademail
You can unsubscribe from notifications for this user by visiting the
subscription center:
http://www.youtube.com/subscription_manager

Tutto cio' pero' mi complica la vita (o lo script che dir si voglia) non poco.
Trovo strana questa differenza tra la formattazione delle due mail :confused:

psimem

11-02-2017, 17:19

Mi sembra di aver capito che le mail con base64 hanno a che fare con i caratteri non ASCII in essa contenuti.
Adesso vediamo se riesco a uscirne :rolleyes:

psimem

11-02-2017, 19:39

Ok dopo alcuni ravanamenti vari ecco lo script con il 98% delle funzioni che mi interessano (mancano solo alcune rifiniture):
#!/usr/bin/env python

# this script may not work in windows
# this script may require GNU-gcp if you want so (still working on it)

import datetime
import email
import imaplib
import mailbox
import base64
import os
import subprocess
import statvfs
import shutil
import glob
import sys

myusbdrive='path_to_usb_drive'
mailaddress='############@gmail.com'
mailpass='password'

print '----------------------------'
print 'Python script pytytd.'
print '----------------------------'
print 'Login...'
mail = imaplib.IMAP4_SSL('imap.gmail.com')
try:
mail.login(mailaddress, mailpass) # login gmail
print "Success."
except imaplib.IMAP4.error: # exit or deal with failure
print "LOGIN FAILED!!! "
mail.list() # Lists all labels in GMail
mail.select('YTsubs') # Select YTsubs folder
print '----------------------------'
nmail=0
(retcode, messages) = mail.search(None, '(UNSEEN)')
if retcode == 'OK':
for num in messages[0].split() :
nmail=nmail+1
typ, data = mail.fetch(num,'(RFC822)')
print "Processing %s." % nmail
for response_part in data:
if isinstance(response_part, tuple):
original = email.message_from_string(response_part[1])
#print original
print original['From']
print original['Subject']
myfile = open("extractedemail.txt", 'a')
myfile.write("%s \n" % original)
myfile.close()
typ, data = mail.store(num,'+FLAGS','\\Seen')
print '----------------------------'
print "Total mail processed: %s." % nmail
print '----------------------------'
if nmail != 0: # if there are no new mails just copy mp4 files
nlink=0 # how many direct links have been found
nbase64=0 # internal counter to select encoded lines
nbase64tot=0 # how many encoded links have been found
with open('extractedemail.txt') as extrmail:
for line in extrmail:
if nbase64 != 0:
nbase64=nbase64+1
if 'Subject:' in line: # find uploader name
#nlink=nlink+1
print "Got a name!"
with open("results.txt", "a") as myfile:
myfile.write(line)
if 'http://www.youtube.com/watch' in line: # find direct video link
nlink=nlink+1
print "Got a link!"
with open("results.txt", "a") as myfile:
myfile.write(line)
if line.startswith('Content-Transfer-Encoding: base64'): # find encoded video link
nbase64=nbase64+1
print "Got a base64!"
#print nbase64
nbase64tot=nbase64tot+1 # keeps track of how many encoded links have been found
if 3 <= nbase64 <= 8: # selecting the 6 lines which are part of the encoded text
print "Got a new line base64!"
with open("ebase64.txt", "a") as myfile: # writes every line of the encoded text to a file
myfile.write(line)
if nbase64 == 9:
with open('ebase64.txt', 'r') as myfile:
ebase64=myfile.read().replace('\n', '')
dbase64 = base64.b64decode(ebase64) # decodes the encoded text
#print dbase64
os.remove('ebase64.txt') # removes the encoded file
with open("dbase64.txt", "a") as myfile: # writes decoded text to a file
myfile.write(dbase64)
with open("dbase64.txt", "r") as myfile: # reads every line of the decoded text...
for line in myfile:
if 'http://www.youtube.com/watch' in line: # ...to find the video link
with open("results.txt", "a") as myfile: # writes the video link to the results file
myfile.write(line)
os.remove('dbase64.txt') # removes the decoded text
nbase64=0 # zeroes the nbase64 lines counter
os.remove('extractedemail.txt') # removes the decoded text
print '----------------------------'
print "Total mail processed: %s." % nmail
print "Total direct link processed: %s." % nlink
print "Total encoded link processed: %s." % nbase64tot
print '----------------------------'
with open("results.txt", "r") as res: # creates a clean list file
for line in res:
if line.startswith('http://www.youtube.com/watch'):
with open("list", "a") as myfile:
myfile.write(line)
print 'Checking if enough free space on disk...'
f = os.statvfs("/home")
#print "preferred block size", "=>", f[statvfs.F_BSIZE]
#print "fundamental block size", "=>", f[statvfs.F_FRSIZE]
#print "total blocks", "=>", f[statvfs.F_BLOCKS]
#print "total free blocks", "=>", f[statvfs.F_BFREE]
#print "available blocks", "=>", f[statvfs.F_BAVAIL]
#print "total file nodes", "=>", f[statvfs.F_FILES]
#print "total free nodes", "=>", f[statvfs.F_FFREE]
#print "available nodes", "=>", f[statvfs.F_FAVAIL]
#print "max file name length", "=>", f[statvfs.F_NAMEMAX]
if f[statvfs.F_BAVAIL] <= 200000: # if there is not enough space exit (2GB)
print 'Not enough free space on disk.'
raise SystemExit # and keep results and list files
else:
print 'Ok!'
print '----------------------------'
print 'Downloading videos:'
subprocess.call(['/bin/bash', '-i', '-c', 'ytd-ln']) # calls the bash alias of youtube-dl for a list of files
os.remove('results.txt') # removes the results file
os.remove('list') # removes the list file
print '----------------------------'
nvideofile=0
for fname in os.listdir('.'): # just looking for video files to decide to proceed or not with the script
if fname.endswith('.mp4'): # takes care of video files
nvideofile=nvideofile+1
if nvideofile != 0:
print 'Checking if enough free space on usb disk...'
if not os.path.exists(myusbdrive):
print 'No usb disk present. Deal with it!'
print '----------------------------'
raise SystemExit
f = os.statvfs(myusbdrive)
if f[statvfs.F_BAVAIL] <= 200000: # if there is not enough space exit (2GB)
print 'Not enough free space on usbdisk.'
raise SystemExit # and keep results and list files
else:
print 'Ok!'
#mydir = os.path.join(os.getcwd(), datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
myusbdir = os.path.join(datetime.datetime.now().strftime('%Y-%m-%d'))
completeusbpath=myusbdrive+'/video/'+myusbdir
if not os.path.exists(completeusbpath):
print 'Making new directory %s.' % myusbdir
os.makedirs(completeusbpath)
else:
print 'Directory %s already present.' % myusbdir
print 'Looking for video files to copy...'
nvideofile=0
for fname in os.listdir('.'):
if fname.endswith('.mp4'): # takes care of video files
nvideofile=nvideofile+1
for mp4_file in glob.iglob('*.mp4'): # copy mp4 files to usbdir
shutil.copy2(mp4_file, completeusbpath)
for mp4_file in glob.iglob('*.mp4'): # remove mp4 files
os.remove(mp4_file)
#print 'Done!'
else:
print 'No more video files found.\r',
if fname.endswith('.vtt'): # takes care of subtitles files
for vtt_file in glob.iglob('*.vtt'): # copy subtitles files to usbdir
shutil.copy2(vtt_file, completeusbpath)
for vtt_file in glob.iglob('*.vtt'): # remove mp4 files
os.remove(vtt_file)
print '\nMoved %s files.' %nvideofile
print 'Done!'
#print 'Done!'
print '----------------------------'

Passatemi le eventuali bruttezze, ma dato il fatto che fino a 5 giorni fa non sapevo nemmeno da che parte era girato Python non mi sembra malaccio :ciapet:
L'unica cosa che mi rode ancora e' la password in chiaro all'interno dello script per accedere a gmail: qualche idea in merito?

psimem

15-02-2017, 11:39

Per la password ho scoperto che si puo' usare python-keyring:

import keyring

#keyring.set_password("systemname", "usernamet", "yourpasswordhere") # uncomment this line if you run for the first time this script and you need to store the password
mailpass = (keyring.get_password("systemname", "username"))

psimem

18-02-2017, 15:22

Sono in difficolta':
con il seguente comando non ho problemi:

subprocess.call(['/bin/bash', '-i', '-c', 'youtube-dl --default-search auto -f 18 --write-sub --sub-lang en -o "/home/user/Downloads/%(uploader)s - %(upload_date)s - %(title)s -- %(id)s.%(ext)s" -R 20 -ica /home/user/Downloads/list && exit'])

invece sostituendo la variabile come da seguente comando:

mypah = "/home/user/Downloads/"
subprocess.call(['/bin/bash', '-i', '-c', 'youtube-dl --default-search auto -f 18 --write-sub --sub-lang en -o "%s%(uploader)s - %(upload_date)s - %(title)s -- %(id)s.%(ext)s" -R 20 -ica %slist && exit' % mypath])

probabilmente c'e' qualcosa che non va con la sintassi del comando dato che l'errore restituito e':

Traceback (most recent call last):
File "pytytd_v07", line 140, in <module>
subprocess.call(['/bin/bash', '-i', '-c', 'youtube-dl --default-search auto -f 18 --write-sub --sub-lang en -o "%s%(uploader)s - %(upload_date)s - %(title)s -- %(id)s.%(ext)s" -R 20 -ica %slist && exit' % mypath])
TypeError: format requires a mapping

Come posso fare? :D