← Back to team overview

zeitgeist team mailing list archive

[Bug 778140] Re: Time warp problem in MOVE_EVENT handling

 

** Description changed:

  MOVE EVENTS
  ============================================
  
  PRESENTATION
  
  By definition, Zeitgeist's events are immutable, and the subject meta-data
  they contain is a snapshot of how a given resource was back when the event
  happened.
  
  To be useful, some way of linking event subjects to their physical
  representation is needed. The primary identifier for doing this is the
  subject's URI.
  
  However, URIs, especially local ones, are transient and may change. To solve
  this problem, a new field was added to subjects, and it is special in that
  it isn't considered to be immutable. This is the `current_uri' field.
  
  INITIAL IDEA
  
  When a subject is inserted, its `current_uri' field is initially set to the
  same value as its `uri' field. When Zeitgeist receives a MOVE_EVENT for that
  file (with a coherent timestamp), the value of `current_uri' is updated to
  its new file name.
  
  The idea here is that this is done in a way that, if we deleted the
  `current_uri' of all subjects and restored them looking at all MOVE_EVENTs
  in the database, the result would be the same as before.
  
  CURRENT IMPLEMENTATION
  
  As of now, `current_uri' is initially set to the same value as `current_uri'.
  Once a MOVE_EVENT is inserted, all events with a timestamp before that of the
  move are updated.
  
  However, after the point the MOVE_EVENT has been inserted, it is never
  considered again. This is so for performance reasons, since the initial plan
  would require pretty much "rebuilding the database".
  
  PROBLEMS
  
  There are numerous problems with this implementation, at least in theoretical
  situations.
  
  One problem is that of events coming in after the MOVE_EVENT (maybe because
  the application is batching them). In this case they won't be updated.
  
  We also have the opposite problem, a MOVE_EVENT coming in late after another
  conflicting MOVE_EVENT happened. For instance, we have the following events:
-  > T5 a.txt, T10 a.txt, T15 a.txt
+  > T5 a.txt, T10 a.txt, T15 a.txt
  We receive a first MOVE_EVENT from a.txt to b.txt with timestamp T7. Now we
  have (time / current_uri):
-  > T5 a.txt, T10 b.txt, T15 b.txt
+  > T5 a.txt, T10 b.txt, T15 b.txt
  Finally, we receive a further MOVE_EVENT from a.txt to c.txt with timestamp T0.
  The result is:
-  > T5 c.txt, T10 b.txt, T15 b.txt
+  > T5 c.txt, T10 b.txt, T15 b.txt
  This is totally inconsistent; the correct result would have been:
-  > T5 c.txt, T10 c.txt, T15 b.txt
+  > T5 c.txt, T10 c.txt, T15 b.txt
  
  Further, even if implemented as described in the "initial idea" section, the
  concept is flawed in that it may happen that events are inserted
  retrospectively using already their updated URI. This could give rise to
  further inconsistencies.
  
  PROPOSAL
  
  No clear way to avoid this problem is evident. Maybe the best idea is to
  formalize the current behavior by documenting it and requesting that MOVE
  and DELETE events be inserted near real time (for local files).
  
- ADDITIONAL PROPOSAL
+ OUTSTANDING ISSUES
  
- So far we haven't taken resource deletions into account at all. However,
- those also affect the URI of a resource, in that it ceases to exist (and
- may be subsequently reused for an unrelated resource).
+ a) Deletion of MOVE_EVENT
+ What happens upon deletion of a MOVE_EVENT? Should the current_uri changes be reverted?
  
- For this reason, I propose that DELETE_EVENTs also update `current_uri'. In
- particular, they should change said URI to "" (empty).
+ b) Insertion of other events
+ When inserting an event, should Zeitgeist check whether a MOVE_EVENT happened for that URI after the event's timestamp, and update it accordingly?
+ 
+ c) Directory
+ Should the insertion of a MOVE_EVENT with the renaming from "file:///home/user/dir1" to "file:///home/user/dir2" also update all events with uri "file:///home/user/dir1/*" to "file:///home/user/dir2/*"? I think so.
+ 
+ SEE ALSO
+ 
+ Related to this, please also check my proposal for improved DELETE_EVENT
+ handling in bug #954206.

** Description changed:

  MOVE EVENTS
  ============================================
  
  PRESENTATION
  
  By definition, Zeitgeist's events are immutable, and the subject meta-data
  they contain is a snapshot of how a given resource was back when the event
  happened.
  
  To be useful, some way of linking event subjects to their physical
  representation is needed. The primary identifier for doing this is the
  subject's URI.
  
  However, URIs, especially local ones, are transient and may change. To solve
  this problem, a new field was added to subjects, and it is special in that
  it isn't considered to be immutable. This is the `current_uri' field.
  
  INITIAL IDEA
  
  When a subject is inserted, its `current_uri' field is initially set to the
  same value as its `uri' field. When Zeitgeist receives a MOVE_EVENT for that
  file (with a coherent timestamp), the value of `current_uri' is updated to
  its new file name.
  
  The idea here is that this is done in a way that, if we deleted the
  `current_uri' of all subjects and restored them looking at all MOVE_EVENTs
  in the database, the result would be the same as before.
  
  CURRENT IMPLEMENTATION
  
  As of now, `current_uri' is initially set to the same value as `current_uri'.
  Once a MOVE_EVENT is inserted, all events with a timestamp before that of the
  move are updated.
  
  However, after the point the MOVE_EVENT has been inserted, it is never
  considered again. This is so for performance reasons, since the initial plan
  would require pretty much "rebuilding the database".
  
  PROBLEMS
  
  There are numerous problems with this implementation, at least in theoretical
  situations.
  
  One problem is that of events coming in after the MOVE_EVENT (maybe because
  the application is batching them). In this case they won't be updated.
  
  We also have the opposite problem, a MOVE_EVENT coming in late after another
  conflicting MOVE_EVENT happened. For instance, we have the following events:
   > T5 a.txt, T10 a.txt, T15 a.txt
  We receive a first MOVE_EVENT from a.txt to b.txt with timestamp T7. Now we
  have (time / current_uri):
   > T5 a.txt, T10 b.txt, T15 b.txt
  Finally, we receive a further MOVE_EVENT from a.txt to c.txt with timestamp T0.
  The result is:
   > T5 c.txt, T10 b.txt, T15 b.txt
  This is totally inconsistent; the correct result would have been:
   > T5 c.txt, T10 c.txt, T15 b.txt
  
  Further, even if implemented as described in the "initial idea" section, the
  concept is flawed in that it may happen that events are inserted
  retrospectively using already their updated URI. This could give rise to
  further inconsistencies.
  
  PROPOSAL
  
  No clear way to avoid this problem is evident. Maybe the best idea is to
  formalize the current behavior by documenting it and requesting that MOVE
  and DELETE events be inserted near real time (for local files).
  
  OUTSTANDING ISSUES
  
  a) Deletion of MOVE_EVENT
  What happens upon deletion of a MOVE_EVENT? Should the current_uri changes be reverted?
  
  b) Insertion of other events
  When inserting an event, should Zeitgeist check whether a MOVE_EVENT happened for that URI after the event's timestamp, and update it accordingly?
  
- c) Directory
+ c) Directories
  Should the insertion of a MOVE_EVENT with the renaming from "file:///home/user/dir1" to "file:///home/user/dir2" also update all events with uri "file:///home/user/dir1/*" to "file:///home/user/dir2/*"? I think so.
  
  SEE ALSO
  
  Related to this, please also check my proposal for improved DELETE_EVENT
  handling in bug #954206.

** Summary changed:

- Time warp problem in MOVE_EVENT handling
+ Improved MOVE_EVENT handling (was: Time warp problem)

-- 
You received this bug notification because you are a member of Zeitgeist
Framework Team, which is subscribed to Zeitgeist Framework.
https://bugs.launchpad.net/bugs/778140

Title:
  Improved MOVE_EVENT handling (was: Time warp problem)

Status in Zeitgeist Framework:
  Triaged

Bug description:
  MOVE EVENTS
  ============================================

  PRESENTATION

  By definition, Zeitgeist's events are immutable, and the subject meta-data
  they contain is a snapshot of how a given resource was back when the event
  happened.

  To be useful, some way of linking event subjects to their physical
  representation is needed. The primary identifier for doing this is the
  subject's URI.

  However, URIs, especially local ones, are transient and may change. To solve
  this problem, a new field was added to subjects, and it is special in that
  it isn't considered to be immutable. This is the `current_uri' field.

  INITIAL IDEA

  When a subject is inserted, its `current_uri' field is initially set to the
  same value as its `uri' field. When Zeitgeist receives a MOVE_EVENT for that
  file (with a coherent timestamp), the value of `current_uri' is updated to
  its new file name.

  The idea here is that this is done in a way that, if we deleted the
  `current_uri' of all subjects and restored them looking at all MOVE_EVENTs
  in the database, the result would be the same as before.

  CURRENT IMPLEMENTATION

  As of now, `current_uri' is initially set to the same value as `current_uri'.
  Once a MOVE_EVENT is inserted, all events with a timestamp before that of the
  move are updated.

  However, after the point the MOVE_EVENT has been inserted, it is never
  considered again. This is so for performance reasons, since the initial plan
  would require pretty much "rebuilding the database".

  PROBLEMS

  There are numerous problems with this implementation, at least in theoretical
  situations.

  One problem is that of events coming in after the MOVE_EVENT (maybe because
  the application is batching them). In this case they won't be updated.

  We also have the opposite problem, a MOVE_EVENT coming in late after another
  conflicting MOVE_EVENT happened. For instance, we have the following events:
   > T5 a.txt, T10 a.txt, T15 a.txt
  We receive a first MOVE_EVENT from a.txt to b.txt with timestamp T7. Now we
  have (time / current_uri):
   > T5 a.txt, T10 b.txt, T15 b.txt
  Finally, we receive a further MOVE_EVENT from a.txt to c.txt with timestamp T0.
  The result is:
   > T5 c.txt, T10 b.txt, T15 b.txt
  This is totally inconsistent; the correct result would have been:
   > T5 c.txt, T10 c.txt, T15 b.txt

  Further, even if implemented as described in the "initial idea" section, the
  concept is flawed in that it may happen that events are inserted
  retrospectively using already their updated URI. This could give rise to
  further inconsistencies.

  PROPOSAL

  No clear way to avoid this problem is evident. Maybe the best idea is to
  formalize the current behavior by documenting it and requesting that MOVE
  and DELETE events be inserted near real time (for local files).

  OUTSTANDING ISSUES

  a) Deletion of MOVE_EVENT
  What happens upon deletion of a MOVE_EVENT? Should the current_uri changes be reverted?

  b) Insertion of other events
  When inserting an event, should Zeitgeist check whether a MOVE_EVENT happened for that URI after the event's timestamp, and update it accordingly?

  c) Directories
  Should the insertion of a MOVE_EVENT with the renaming from "file:///home/user/dir1" to "file:///home/user/dir2" also update all events with uri "file:///home/user/dir1/*" to "file:///home/user/dir2/*"? I think so.

  SEE ALSO

  Related to this, please also check my proposal for improved
  DELETE_EVENT handling in bug #954206.

To manage notifications about this bug go to:
https://bugs.launchpad.net/zeitgeist/+bug/778140/+subscriptions


References