Last updated:
0 purchases
irodscapabilityautomatedingest 0.5.0
iRODS Automated Ingest Framework
The automated ingest framework gives iRODS an enterprise solution that solves two major use cases: getting existing data under management and ingesting incoming data hitting a landing zone.
Based on the Python iRODS Client and Celery, this framework can scale up to match the demands of data coming off instruments, satellites, or parallel filesystems.
The example diagrams below show a filesystem scanner and a landing zone.
Usage options
Redis options
option
effect
default
redis_host
Domain or IP address of Redis host
localhost
redis_port
Port number for Redis
6379
redis_db
Redis DB number to use for ingest
0
S3 options
To scan S3 bucket, minimally requires --s3_keypair and source path of the form /bucket_name/path/to/root/folder.
option
effect
default
s3_keypair
path to S3 keypair file
None
s3_endpoint_domain
S3 endpoint domain
s3.amazonaws.com
s3_region_name
S3 region name
us-east-1
s3_proxy_url
URL to proxy for S3 access
None
s3_insecure_connection
Do not use SSL when connecting to S3 endpoint
False
Logging/Profiling options
option
effect
default
log_filename
Path to output file for logs
None
log_level
Minimum level of message to log
None
log_interval
Time interval with which to rollover ingest log file
None
log_when
Type/units of log_interval (see TimedRotatingFileHandler)
None
--profile allows you to use vis to visualize a profile of Celery workers over time of ingest job.
option
effect
default
profile_filename
Specify name of profile filename (JSON output)
None
profile_level
Minimum level of message to log for profiling
None
profile_interval
Time interval with which to rollover ingest profile file
None
profile_when
Type/units of profile_interval (see TimedRotatingFileHandler)
None
Ingest start options
These options are used at the "start" of an ingest job.
option
effect
default
job_name
Reference name for ingest job
a generated uuid
interval
Restart interval (in seconds). If absent, will only sync once.
None
file_queue
Name for the file queue.
file
path_queue
Name for the path queue.
path
restart_queue
Name for the restart queue.
restart
event_handler
Path to event handler file
None (see "event_handler methods" below)
synchronous
Block until sync job is completed
False
progress
Show progress bar and task counts (must have --synchronous flag)
False
ignore_cache
Ignore last sync time in cache - like starting a new sync
False
Optimization options
option
effect
default
exclude_file_type
types of files to exclude: regular, directory, character, block, socket, pipe, link
None
exclude_file_name
a list of space-separated python regular expressions defining the file names to exclude such as "(\S+)exclude" "(\S+).hidden"
None
exclude_directory_name
a list of space-separated python regular expressions defining the directory names to exclude such as "(\S+)exclude" "(\S+).hidden"
None
files_per_task
Number of paths to process in a given task on the queue
50
initial_ingest
Use this flag on initial ingest to avoid check for data object paths already in iRODS
False
irods_idle_disconnect_seconds
Seconds to hold open iRODS connection while idle
60
available --event_handler methods
method
effect
default
pre_data_obj_create
user-defined python
none
post_data_obj_create
user-defined python
none
pre_data_obj_modify
user-defined python
none
post_data_obj_modify
user-defined python
none
pre_coll_create
user-defined python
none
post_coll_create
user-defined python
none
pre_coll_modify
user-defined python
none
post_coll_modify
user-defined python
none
character_map
user-defined python
none
as_user
takes action as this iRODS user
authenticated user
target_path
set mount path on the irods server which can be different from client mount path
client mount path
to_resource
defines target resource request of operation
as provided by client environment
operation
defines the mode of operation
Operation.REGISTER_SYNC
max_retries
defines max number of retries on failure
0
timeout
defines seconds until job times out
3600
delay
defines seconds between retries
0
Event handlers can use logger to write logs. See structlog for available logging methods and signatures.
Operation mode
operation
new files
updated files
Operation.REGISTER_SYNC (default)
registers in catalog
updates size in catalog
Operation.REGISTER_AS_REPLICA_SYNC
registers first or additional replica
updates size in catalog
Operation.PUT
copies file to target vault, and registers in catalog
no action
Operation.PUT_SYNC
copies file to target vault, and registers in catalog
copies entire file again, and updates catalog
Operation.PUT_APPEND
copies file to target vault, and registers in catalog
copies only appended part of file, and updates catalog
Operation.NO_OP
no action
no action
--event_handler usage examples can be found in the examples directory.
Character Mapping option
If an application should require that iRODS logical paths produced by the ingest process exclude subsets of the
range of possible Unicode characters, we can add a character_map method that returns a dict object. For example:
class event_handler(Core):
@staticmethod
def character_map():
return {
re.compile('[^a-zA-Z0-9]'):'_'
}
# ...
The returned dictionary, in this case, indicates that the ingest process should replace all non-alphanumeric (as
well as non-ASCII) characters with an underscore wherever they may occur in an otherwise normally generated logical path.
The substition process also applies to the intermediate (ie collection name) elements in a logical path, and a suffix is
appended to affected path elements to avoid potential collisions with other remapped object names.
Each key of the returned dictionary indicates a character or set of characters needing substitution.
Possible key types include:
character
# substitute backslashes with underscores
'\\': '_'
tuple of characters
# any character of the tuple is replaced by a Unicode small script x
('\\','#','-'): '\u2093'
regular expression
# any character outside of range 0-256 becomes an underscore
re.compile('[\u0100-\U0010ffff]'): '_'
callable accepting a character argument and returning a boolean
# ASCII codes above 'z' become ':'
(lambda c: ord(c) in range(ord('z')+1,128)): ':'
In the event that the order-of-substitution is significant, the method may instead return a list of key-value tuples.
UnicodeEncodeError
Any file whose path in the filesystem whose ingest results in a UnicodeEncodeError exception being raised (e.g. by the
inclusion of an unencodable UTF8 sequence) will be automatically renamed using a base-64 sequence to represent the original,
unmodified vault path.
Additionally, data objects that have had their names remapped, whether pro forma or via a UnicodeEncodeError, will be
annotated with an AVU of the form
Attribute: "irods::automated_ingest::" + ANNOTATION_REASON
Value: A PREFIX plus the base64-converted "bad filepath"
Units: "python3.base64.b64encode(full_path_of_source_file)"
Where :
ANNOTATION_REASON is either "UnicodeEncodeError" or "character_map" depending on why the remapping occurred.
PREFIX is either "irods_UnicodeEncodeError_" or blank(""), again depending on the re-mapping cause.
Note that the UnicodeEncodeError type of remapping is unconditional, whereas the character remapping is contingent on
an event handler's character_map method being defined. Also, if a UnicodeEncodeError-style ingest is performed on a
given object, this precludes character mapping being done for the object.
Manual Deployment
Configure python-irodsclient environment
python-irodsclient (PRC) is used by the Automated Ingest tool to interact with iRODS. The configuration and client environment files used for a PRC application applies here as well.
If you are using PAM authentication, remember to use the Client Settings File.
iRODSSessions are instantiated using an iRODS client environment file. The client environment file used can be controlled with the IRODS_ENVIRONMENT_FILE environment variable. If no such environment variable is set, the file is expected to be found at ${HOME}/.irods/irods_environment.json. A secure connection can be made by making the appropriate configurations in the client environment file.
Starting Redis Server
Install Redis server: https://redis.io/docs/latest/get-started
Starting the Redis server with package installation:
redis-server
Or, dameonized:
sudo service redis-server start
sudo systemctl start redis
The Redis GitHub page also describes how to build and run Redis: https://github.com/redis/redis?tab=readme-ov-file#running-redis
The Redis documentation also recommends an additional step:
Make sure to set the Linux kernel overcommit memory setting to 1. Add vm.overcommit_memory = 1 to /etc/sysctl.conf and then reboot or run the command sysctl vm.overcommit_memory=1 for this to take effect immediately.
This allows the Linux kernel to overcommit virtual memory even if this exceeds the physical memory on the host machine. See kernel.org documentation for more information.
Note: If running in a distributed environment, make sure Redis server accepts connections by editing the bind line in /etc/redis/redis.conf or /etc/redis.conf.
Setting up virtual environment
You may need to upgrade pip:
pip install --upgrade pip
Install virtualenv:
pip install virtualenv
Create a virtualenv with python3:
virtualenv -p python3 rodssync
Activate virtual environment:
source rodssync/bin/activate
Install this package
pip install irods_capability_automated_ingest
Set up environment for Celery:
export CELERY_BROKER_URL=redis://<redis host>:<redis port>/<redis db> # e.g. redis://127.0.0.1:6379/0
export PYTHONPATH=`pwd`
Start celery worker(s):
celery -A irods_capability_automated_ingest.sync_task worker -l error -Q restart,path,file -c <num workers>
Note: Make sure queue names match those of the ingest job (default queue names shown here).
Using the sync script
Start sync job
python -m irods_capability_automated_ingest.irods_sync start <source dir> <destination collection>
List jobs
python -m irods_capability_automated_ingest.irods_sync list
Stop jobs
python -m irods_capability_automated_ingest.irods_sync stop <job name>
Watch jobs (same as using --progress)
python -m irods_capability_automated_ingest.irods_sync watch <job name>
Run tests
Note: The tests start and stop their own Celery workers, and they assume a clean Redis database.
python -m irods_capability_automated_ingest.test.test_irods_sync
See docker/ingest-test/README.md for how to run tests with Docker Compose.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.