duplicity-team team mailing list archive

Thread
Date

Re: Python API

To: duplicity-team@xxxxxxxxxxxxxxxxxxx, Philippe Delodder <lodder@xxxxxxxxxxx>
From: Rob Oakes <lyx-devel@xxxxxxxxxxx>
Date: Fri, 25 Sep 2009 19:47:18 -0600
In-reply-to: <4AB4F9F6.60706@web.de>
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.4pre) Gecko/20090915 Lightning/1.0pre Thunderbird/3.0b4

Hi All,

I hope that I'm not arriving too late to this particular party, but Iwanted to throw my opinion out as well. I think that having a centralduplicity API would be fantastic, though I also appreciate that creatinganything to meet the needs of all of the different projects that use itwould likely prove impossible.

I also appreciate that there are practical difficulties as well. As Kennotes:


<<  It's not really modular at this point.  There are a few classes, but
most activity is via straight function calls.  An API would be handy,
move the contents of duplicity-bin into duplicity module, make classes
as needed, and leave a thin shell as the executable.>>

When I was researching backends for Time Drive, I was originally drawnto duplicity for a number of reasons. One of those were it's very largenumber of backends, but an even larger reason was because it was writtenin Python. While I am a rather haphazard (and largely incompetent)programmer, I've always had good luck with Python. But having tried todirectly build a pure python Gui on the duplicity classes, I found thatgetting things to work can involve some interesting acrobatics.

And it doesn't always make sense. While putting Time Drive's interfacetogether (mostly just copying functions from duplicity-bin and fiddlingwith the outputs), I found that there are only a few instances thatdirectly interacting with the duplicity classes was more effective thanpassing the calls through the command line. Those were:


  1. Retrieving the files in the archive and parsing the directory
     structure.
  2. Retrieving the list of snapshot times.
  3. Deleting backup sets and returning the operation results.

In other cases, such as backing up or restoring a file, using thecommand line interface was much faster and more convenient. It also hasthe additional benefit in that you can run multiple jobs at the sametime without worrying about crashing the program or keeping trackof/resetting the global variables.

This would change, though, if some of the non-modular parts could bereorganized. It would be brilliant to have a relatively high level APIthat the command line utility and a gui could leverage in a uniform way.


<<  Logging should be integral with the pure module approach, possibly even
opened up as a resource in the API.  We may not actually use it for user
output, but for debugging, its vital.>>

Agreed. I would love to a way of interacting with the log in a morecohesive way. I'm in the process of trying to implement a few simplelogging methods to better catch duplicity errors and notify the user ofproblems at the moment, and if there were a way to cache or review acentral duplicity log, that would be wonderful.


<<  Yes, a formal API would be good, as I said above.  I'm already doing
Epydoc on a regular basis and would like to continue.  We can use that
to help formalize an API.>>

In designing such an API, I would love to see a combination of both highlevel and low level methods. Considering the needs of a Gui client, theability to start a backup operation, restore a file, delete a backupset, or combine (coalesce) a series of incremental snapshots would behighly convenient.

As a potential place to start discussions about what an API might looklike and what needs it might fulfill, I've attached the python classthat I use for Time Drive. From trial and error (the way I tend to doall development), there seem to be six methods that get used mostconsistently:


   * get_snapshot_list : Retrieves a lsit of the snapshots that are
     currently available in a given archive.
   * delete_backup : Deletes the backup sets specified by archive_url
     and backup_time
   * get_file_list : Retrieves a file_iterator associated with a
     specified archive url.
   * test_connection_url (which I need to overhaul) : Test the
     connection to an archive to ensure that it is valid.
   * restore_file : Restores a file specified by a relative_path to its
     destination
   * start_backup : Begin a new backup operation.

As noted above, additional methods to interact with the log/post updatesto the user would also be nice. But even though those six methods coverTime Drive's needs fairly comprehensively, Time Drive is still a veryyoung program. I'd be interested to hear what other things Michael andthe others have used for their respective programs.


Cheers,

Rob Oakes

#!/usr/bin/env python
#
# Time Drive - based on duplicity - Encrypted bandwidth efficient backup.
#
# Copyright 2009 Rob Oakes	<LyX-Devel@oak-tree>
#
# Backend Methods that communicate with duplicity.  Because duplicity is
# a command line utility, it assumes that the entire environment will be 
# reset between each run.  The methods here have been slightly refactored so
# that multiple tasks can be performed within the same session.

import os
import os.path
import subprocess
import sys

from duplicity import backend
from duplicity import collections
from duplicity import commandline
from duplicity import diffdir
from duplicity import dup_time
from duplicity import file_naming
from duplicity import globals
from duplicity import gpg
from duplicity import log
from duplicity import path
import duplicity.backends
import duplicity.errors


def get_snapshot_list(archive_url, gnu_passphrase, sync_remote = True,
	S3_AccessId = None, S3_SecretKey = None, S3_BucketType = None):
	"""Retrieves a list of the snapshots that are currently available in a
	given archive."""
	
	if S3_AccessId != None and S3_SecretKey != None:
		set_AmazonS3(S3_AccessId, S3_SecretKey)
	
	# Set Global Variables for European Amazon S3 Buckets
	if (S3_BucketType != None)&(S3_BucketType == "European"):
		globals.s3_european_buckets = True
		globals.s3_use_new_style = True
	
	# Setup Log and Retrieve Appropriate Backend
	log.setup()
	duplicity.backend.import_backends()
	
	filelist_backend = backend.get_backend(archive_url)
	filelist_archive_cache = retrieve_archive_dir(archive_url)
	
	# Set global variables to appropriate values
	globals.archive_dir = filelist_archive_cache
	globals.gpg_profile = gpg.GPGProfile()
	globals.gpg_profile.passphrase = gnu_passphrase
	
	# Check to See if Local List Needs to Be Synced With Remote
	if sync_remote:
		sync_archive(filelist_backend, gnu_passphrase)
	
	col_stats = collections.CollectionsStatus(filelist_backend, 
		filelist_archive_cache).set_values()
		
	try:
		backup_list = col_stats.matched_chain_pair[1].to_log_info('')
	except:	
		cleanup_globals()
		return None
	
	cleanup_globals()
	filelist_backend.close()
	log.shutdown()
	
	if S3_AccessId != None and S3_SecretKey != None:
		unset_AmazonS3()
	
	return backup_list


def delete_backup(archive_url, gnu_passphrase, backup_time, S3_AccessId = None,
	S3_SecretKey = None, S3_BucketType = None):
	"""Deletes the backup sets specified by archive_url and backup_time."""
	
	log.setup()
	duplicity.backend.import_backends()
	
	if S3_AccessId != None and S3_SecretKey != None:
		set_AmazonS3(S3_AccessId, S3_SecretKey)
	
	# Set Global Variables for European Amazon S3 Buckets
	if (S3_BucketType != None)&(S3_BucketType == "European"):
		globals.s3_european_buckets = True
		globals.s3_use_new_style = True
	
	filelist_backend = backend.get_backend(archive_url)
	filelist_archive_cache = retrieve_archive_dir(archive_url)
	
	# Set global variables to appropriate values
	globals.archive_dir = filelist_archive_cache
	globals.gpg_profile = gpg.GPGProfile()
	globals.gpg_profile.passphrase = gnu_passphrase
	
	col_stats = collections.CollectionsStatus(filelist_backend,
		filelist_archive_cache).set_values()
	result_sets, backup_chain = retrieve_backup_sets(col_stats, backup_time)
	
	try:
		for backup_set in result_sets:
			backup_set.delete()
		col_stats.set_values(sig_chain_warning = None).cleanup_signatures()
		operation_success = True
	except:
		operation_success = False
	
	# Restore Values to Default, Unset Amazon S3 and Close Log
	cleanup_globals()
	if S3_AccessId != None and S3_SecretKey != None:
		unset_AmazonS3()
	filelist_backend.close()
	log.shutdown()
	
	return operation_success


def retrieve_backup_sets(col_stats, backup_time):
	"""Retreives the backup sets specified by archive_url and backup_time."""
	
	if not col_stats.all_backup_chains:
		result_sets = []
	else:
		backup_chain = col_stats.get_backup_chain_at_time(backup_time)
		assert backup_chain, col_stats.all_backup_chains
		result_sets = backup_chain.get_all_sets()
	
	# Return the list of result sets
	return result_sets, backup_chain


def get_file_list(archive_url, gnu_passphrase, sync_remote = True, time = None,
	S3_AccessId = None, S3_SecretKey = None, S3_BucketType = None):
	"""
	Retrieve the file_list from the specified archive url.
	@param archive_url (string): valid duplicity archive url
	@param gnu_passphrase (string): the passphrase for the specified url
	@param sync_remote (bool): Default = True: Will first make sure that
		the local file list is in sync with the remote file list.
	@param time (string): Default = None: Specify the time from which to list
		files.  Date should be provided in the format year-month-date.
		Ex: 2009-04-10
	@param S3AccessId (string): Default = None: Amazon S3 Access Id.
	@param S3SecretKey (string): Default = None: Amazon S3 secret key.
	@rtype: path_iter
	@return: list of files contained in the archive.  Can be parsed
		with utils.ParseDirectoryStructure.
	"""
	
	log.setup()
	duplicity.backend.import_backends()
	
	# Set Amazon S3 AccessId and Secret Key (if applicable)
	if S3_AccessId != None and S3_SecretKey != None:
		set_AmazonS3(S3_AccessId, S3_SecretKey)
	
	# Set Global Variables for European Amazon S3 Buckets
	if (S3_BucketType != None)&(S3_BucketType == "European"):
		globals.s3_european_buckets = True
		globals.s3_use_new_style = True
	
	# Create a Duplicity Backend Object and Link to the Cache Directory
	filelist_backend = backend.get_backend(archive_url)
	filelist_archive_cache = retrieve_archive_dir(archive_url)
	
	# Set Appropriate Global Options
	globals.archive_dir = filelist_archive_cache
	globals.gpg_profile = gpg.GPGProfile()
	globals.gpg_profile.passphrase = gnu_passphrase
	
	# Check archive sync with remote, set snapshot from which to
	# list files, if applicable set Amazon S3 Access Key and Id
	
	if sync_remote:
		sync_archive(filelist_backend, gnu_passphrase)
	if time != None:
		globals.restore_time = dup_time.genstrtotime(time)
	
	# Retrieve Current Collection Status
	col_stats = collections.CollectionsStatus(filelist_backend, 
		filelist_archive_cache).set_values()
	
	# Fetch the Filelist and File Listing
	path_iter = list_current(col_stats)
	
	# Restore Values to Default and Close Log
	cleanup_globals()
	if S3_AccessId != None and S3_SecretKey != None:
		unset_AmazonS3()
		
	filelist_backend.close()
	log.shutdown()
	
	if path_iter == None:
		return None
	
	return path_iter


def test_connection_url(archive_url, S3_AccessId = None, 
	S3_SecretKey = None, S3_BucketType = None):
	"""Test the remote conneciton to ensure that it is valid"""
	
	log.setup()
	duplicity.backend.import_backends()
	
	# Set Appropriate Environment Variables for Amazon S3
	if S3_AccessId != None and S3_SecretKey != None:
		set_AmazonS3(S3_AccessId, S3_SecretKey)
	
	if (S3_BucketType != None)&(S3_BucketType == "European"):
		globals.s3_european_buckets = True
		globals.s3_use_new_style = True
	
	try:
		filelist_backend = backend.get_backend(str(archive_url))
		filelist_backend.list()
		tstResult = True
	except:
		tstResult = False
	
	# Unset Amazon S3 Environment Variables
	if S3_AccessId != None and S3_SecretKey != None:
		unset_AmazonS3()
		
	log.shutdown()
	return tstResult

def restore_file(relative_path, archive_url, gnu_passphrase, destination, 
	restore_time = None, S3_AccessId = None, S3_SecretKey = None, S3_BucketType = None):
	"""Restores the file specified by relative_path to destination"""
	
	# Set the Amazon AccessId and Secret Key (if applicable)
	if S3_AccessId != None and S3_SecretKey != None:
		set_AmazonS3(S3_AccessId, S3_SecretKey)
	
	set_Passphrase(gnu_passphrase)
	
	# Set Global Variables for European Amazon S3 Buckets
	if (S3_BucketType != None)&(S3_BucketType == "European"):
		S3_Bucket = "--s3-european-buckets --s3-use-new-style "
	else:
		S3_Bucket = ""
	
	if restore_time != None:
		timeCmd = "--restore-time '" + restore_time + "' "
		backupCmd = "duplicity " + S3_Bucket + timeCmd + "--file-to-restore '" + relative_path + "' '" + archive_url + "' '" + destination + '/' + os.path.basename(str(relative_path)) + "'"
	else:
		backupCmd = "duplicity " + S3_Bucket + "--file-to-restore '" + relative_path + "' '" + archive_url + "' '" + destination + '/' + os.path.basename(str(relative_path)) + "'"
	
	if S3_AccessId != None and S3_SecretKey != None:
		set_AmazonS3(S3_AccessId, S3_SecretKey)
		
	result = _execute_output(str(backupCmd))

	if S3_AccessId != None and S3_SecretKey != None:
		unset_AmazonS3()

	return result


def start_backup(path, archive_url, gnu_passphrase, ExcludeList = None, 
	FullBackupInterval = None, S3_AccessId = None, S3_SecretKey = None, 
	S3_BucketType = None):
	"""Begins a new backup operation for the specified path and archive url.
	Supports exclusions (ExcludeList, Default = None) and setting how often 
	a new full backup should be made (FullBackupInterval, Default = None)"""

	if ExcludeList != None:
		exclude = ""
		for item, itemtype in ExcludeList:
			if itemtype == 1:
				exclude = exclude + "--exclude-regexp " + "'" + item + "' "
			if itemtype == 2 or itemtype == 3:
				exclude = exclude + "--exclude " + "'" + item + "' "
	
	if (FullBackupInterval != None)&(FullBackupInterval > 0):
		backupInterval = "--full-if-older-than " + str(FullBackupInterval) + "D "
	else:
		backupInterval = ""
		
	if (S3_BucketType != None)&(S3_BucketType == "European"):
		S3_Bucket = "--s3-european-buckets --s3-use-new-style "
	else:
		S3_Bucket = ""

	if gnu_passphrase is None:
		backupCmd = "duplicity " + backupInterval + "--no-encryption " + exclude + "'" + path.rstrip() + "'" + " '" + archive_url + "'"
	else:
		set_Passphrase(gnu_passphrase)
		backupCmd = "duplicity " + backupInterval + exclude + "'" + path.rstrip() + "'" + " '" + archive_url + "'"
	
	if S3_AccessId != None and S3_SecretKey != None:
		set_AmazonS3(S3_AccessId, S3_SecretKey)

	result = _execute_output(str(backupCmd))
	
	if S3_AccessId != None and S3_SecretKey != None:
		unset_AmazonS3()
	
	if gnu_passphrase != None:
		unset_Passphrase()

	return result


def cleanup_globals():
	"""Returns global variables to default states after action has finished."""
	
	globals.gpg_profile = None
	globals.backend = None
	globals.archive_dir = os.path.expandvars("$XDG_CACHE_HOME/duplicity")
	globals.restore_time = None
	globals.s3_use_new_style = False
	globals.s3_european_buckets = False


def retrieve_archive_dir(archive_url):
	"""
	@param archive_url: URL to archive
	@returns: Path to the directory where the local archive cache is stored.
	"""
	base_url = os.path.expandvars("$XDG_CACHE_HOME/duplicity")
	backup_name = commandline.generate_default_backup_name(archive_url)
	archive_expanded_url = commandline.expand_fn(os.path.join(base_url, backup_name))
	
	if not os.path.exists(archive_expanded_url):
		"""Check archive dir and construct path"""
		try:
			os.makedirs(archive_expanded_url)
		except:
			pass
	
	archive_dir = path.Path(archive_expanded_url)
	
	if not archive_dir.isdir():
		log.FatalError(_("Specified archive directory '%s' does not exist, "
			"or is not a directory") % (archive_dir.name,),
			log.ErrorCode.bad_archive_dir)
	
	return archive_dir


def list_current(col_stats):
	"""
	Returns a list of the files that are currently in the archive.
	The list is generated by examining the signature only.
	@type col_stats: CollectionStatus object
	@param col_stats: collection status
	
	@rtype: path_iter (duplicity class)
	@return: path_iter which contains the contents of the specified file collection
	"""
	sig_chain = check_sig_chain(col_stats)
	if not sig_chain:
		log.Notice(_("No signature data found, unable to list files."))
		return None
	time = globals.restore_time # if None, will use latest
	path_iter = diffdir.get_combined_path_iter(sig_chain.get_fileobjs(time))
	
	return path_iter


def check_sig_chain(col_stats):
	"""
	Get last signature chain for inc backup, or None if none avaialble.
	
	@type col_stats: CollectionStatus object
	@param col_stats: collection status
	"""
	if not col_stats.matched_chain_pair:
		if globals.incremental:
			log.FatalError(_("Fatal Error: Unable to start incremental backup.  "
				"Old signatures not found and incremental specified"),
				log.ErrorCode.inc_without_sigs)
	
		else:
			log.Warn(_("No signatures found, switching to full backup."))
		return None
	return col_stats.matched_chain_pair[0]
	

def sync_archive(backend, gnu_passphrase):
	"""
	Synchronize local archive manifest file and sig chains to remote archives.
	Copy missing files from remote to local as needed to make sure the local
	archive is synchronized to remote storage.
	
	@type backend: backend class
	
	@type gnu_passphrase: string
	@param gnu_passphrase: the encryption key to the archive
	
	@type encryption: Boolean (True/False)
	@param encryption: whether the archive is encrypted or not.  Provided by the settings.
	
	@rtype: void
	@return: void
	"""
	suffixes = [".g", ".gpg", ".z", ".gz"]
	
	def get_metafiles(filelist):
		"""
		Return metafiles of interest from the file list.
		Files of interest are:
		  sigtar - signature files
		  manifest - signature files
		Files excluded are:
		  non-duplicity files
		  duplicity partial files
		  
		@rtype: list
		@return: list of duplicity metadata files
		"""
		metafiles = {}
		need_passphrase = False
		for fn in filelist:
			pr = file_naming.parse(fn)
			if not pr:
				continue
			if pr.partial:
				continue
			if pr.encrypted:
				need_passphrase = True
			if pr.type in ["full-sig", "new-sig"] or pr.manifest:
				base, ext = os.path.splitext(fn)
				if ext in suffixes:
					metafiles[base] = fn
				else:
					metafiles[fn] = fn
		return metafiles, need_passphrase
		
	def copy_raw(src_iter, filename):
		"""
		Copy data from src_iter to file at fn
		"""
		block_size = 128 * 1024
		file = open(filename, "wb")
		while True:
			try:
				data = src_iter.next(block_size).data
			except StopIteration:
				break
			file.write(data)
		file.close()
	
	def resolve_basename(fn, backend):
		"""
		@return: (parsedresult, local_name, remote_name)
		"""
		pr = file_naming.parse(fn)
		if pr.manifest:
			suffix = file_naming.get_suffix(globals.encryption, False)
		else:
			suffix = file_naming.get_suffix(globals.encryption, not globals.encryption)
		rem_name = fn + suffix
		
		if pr.manifest:
			suffix = file_naming.get_suffix(False, False)
		else:
			suffix = file_naming.get_suffix(False, True)
		loc_name = fn + suffix
		
		return (pr, loc_name, rem_name)
		
	def remove_local(fn):
		pr, loc_name, rem_name = resolve_basename(fn, backend)
		
		del_name = globals.archive_dir.append(loc_name).name
		log.Notice(_("Deleting local %s (not authoritative at backend).") % del_name)
		os.unlink(del_name)
	
	def copy_to_local(fn, backend):
		"""
		Copy remote file fn to local cache.
		"""
		class Block:
			"""
			Data block to return from SrcIter
			"""
			def __init__(self, data):
				self.data = data
		
		class SrcIter:
			"""
			Iterate over source and return Block of data.
			"""
			def __init__(self, fileobj):
				self.fileobj = fileobj
				
			def next(self, size):
				try:
					res = Block(self.fileobj.read(size))
				except:
					log.FatalError(_("Failed to read %s: %s") %
						(self.fileobj.name, sys.exc_info()),
						log.ErrorCode.generic)
				
				if not res.data:
					self.fileobj.close()
					raise StopIteration
				return res
			
			def get_footer(self):
				return ""
			
		log.Notice(_("Copying %s to local cache.") % fn)
		
		pr, loc_name, rem_name = resolve_basename(fn, backend)
			
		fileobj = backend.get_fileobj_read(rem_name)
		src_iter = SrcIter(fileobj)
		if pr.manifest:
			copy_raw(src_iter, globals.archive_dir.append(loc_name).name)
		else:
			gpg.GzipWriteFile(src_iter, globals.archive_dir.append(loc_name).name, 
				size = sys.maxint)
	
	# get remote metafile list
	remlist = backend.list()
	remote_metafiles, rem_needpass = get_metafiles(remlist)
	
	# get local metafile list
	loclist = globals.archive_dir.listdir()
	local_metafiles, loc_needpass = get_metafiles(loclist)
	
	if rem_needpass or loc_needpass:
		globals.gpg_profile.passphrase = gnu_passphrase
	
	# we have the list of metafiles on both sides. remote is always
	# authoritative. figure out which are local spurious (should not
	# be there) and missing (should be there but are not).
	local_keys = local_metafiles.keys()
	remote_keys = remote_metafiles.keys()
	
	local_missing = []
	local_spurious = []
	
	for key in remote_keys:
		if not key in local_keys:
			local_missing.append(key)
	
	for key in local_keys:
		if not key in remote_keys:
			local_spurious.append(key)
			
	# finally finish the process
	if not local_missing and not local_spurious:
		log.Notice(_("Local and Remote metadata are synchronized, no sync needed."))
	else:
		local_missing.sort()
		local_spurious.sort()
		if not globals.dry_run:
			log.Notice(_("Synchronizing remote metadata to local cache..."))
			for fn in local_spurious:
				remove_local(fn)
			for fn in local_missing:
				copy_to_local(fn, backend)
		else:
			if local_missing:
				log.Notice(_("Sync would copy the following from remote to local:")
					+ "\n" + "\n".join(local_missing))
			if local_spurious:
				log.Notice(_("Sync would remove the following spurious local files:")
					+ "\n" + "\n".join(local_spurious))

def _execute(cmd, callback = None, user_data = None):
	ret_val = 0
	
	if callback is None:
		ret_val = os.system(cmd)
	else:
		pipe = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,)
		
		while True:
			line = pipe.stdout.readline()
			if len( line ) == 0:
				break
			callback( line.strip(), user_data)
			
		pipe.stdout.close()
		if pipe.wait() is 0:
			ret_val = 0
	
	return ret_val

def _execute_output(cmd, callback = None, user_data = None ):
	output = ''
	pipe = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,)
		
	while True:
		line = pipe.stdout.readline()
		if len( line ) == 0:
			break
		output = output + line
		if not callback is None:
			callback( line.strip(), user_data )
	
	pipe.stdout.close()
	if pipe.wait() is 0:
		ret_val = 0

	return output


def set_AmazonS3(S3_AccessId, S3_SecretKey):
	os.environ["AWS_ACCESS_KEY_ID"] = str(S3_AccessId)
	os.environ["AWS_SECRET_ACCESS_KEY"] = str(S3_SecretKey)

def unset_AmazonS3():
	os.unsetenv("AWS_ACCESS_KEY_ID")
	os.unsetenv("AWS_SECRET_ACCESS_KEY")

def set_Passphrase(passphrase):
	os.putenv("PASSPHRASE", passphrase)
	
def unset_Passphrase():
	os.unsetenv("PASSPHRASE")

References

Python API
From: Michael Terry, 2009-09-17
Re: Python API
From: Kenneth Loafman, 2009-09-18
Re: Python API
From: edgar . soldin, 2009-09-19