Peter Dietz

Friday, May 22, 2015

Changing DSpace domain name

At Longsight, we've been hosting a DSpace site at saylor.longsight.com for over a year, this .longsight.com is an internal placeholder location for spinning up a new site, we control this domain space. When it comes time for a DSpace site to have a proper domain name, such as library.saylor.org there are a few steps.

1) Set up SSL
Get the SSL .crt and .key, add it to your nginx server, and configure your /etc/nginx/conf.d/.conf

You can test that you are listening to the proper hostname on 80 and 443, by editing your local development computers /etc/hosts
IP.OF.WEB.SERVER library.saylor.org

Get the sysadmin of saylor to CNAME library.saylor.org to saylor.longsight.com

2) Change all mentions of saylor.longsight.com to library.saylor.org in the config directory for this instance.

3) Write a SQL query to change the site url in all the handle metadata.

select * from metadatavalue where text_value like '%saylor.longsight.com%';

14000+ results

select * from metadatavalue where text_value like '%library.saylor.org%';

0 results

BEGIN;

update metadatavalue set text_value = replace(text_value, 'https://saylor.longsight.com', 'https://library.saylor.org');

COMMIT;

select * from metadatavalue where text_value like '%saylor.longsight.com%';

0 results

select * from metadatavalue where text_value like '%library.saylor.org%';

14000+ results

4) Reindex Discovery
You've changed metadata outside of the system, no Events were fired, so you'll have to manually force DSpace to refresh its metadata index.

bin/dspace index-discovery -b

5) Regenerate Sitemaps
Your sitemaps (for search engines) are outdated, and have a link to your old domain. Re-run the sitemap generator, for search engines to crawl your site, with updated URLs.

bin/dspace generate-sitemaps

6) Measure success of Google picking up the redirect

Search: site:saylor.longsight.com, there are 65,900 results

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=site%3Asaylor.longsight.com

Search: site:library.saylor.org, there is 1 result.

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=site%3Alibrary.saylor.org&qscrl=1

This will take several weeks for the search engines to recrawl and update their search index.

I add Google Analytics and Google Webmaster Tools, and re-upload the sitemap, to ensure to robots pick this up ASAP.

Friday, November 21, 2014

How to replace all tab and newline characters in a Google Docs Spreadsheet

I've gotten a spreadsheet that is riddled with tab characters and newlines. It's so bad that the reader of this can't process this. So. If I could just remove all tab characters and newline characters from the spreadsheet, I'd be golden.

My first through was, how do I paste in a tab character, or a new line character.. What's the five sequence command to do that. Well, Google Sheets has a much easier way: REGEX.

"\t" is regex for tab, and "\n" for newline.

Find: \n to find newlines Replace with a space, and check regular expressions

Find: \t to find tabs Replace with a space, and ensure regular expressions is checked.

I found a full list of regular expression characters at: https://help.libreoffice.org/Common/List_of_Regular_Expressions

Thursday, October 09, 2014

Play Framework: IntelliJ IDEA cannot find declaration to go to

I'm doing a Play! Framework project, and all of a sudden, IntelliJ IDEA forgets about everything, doesn't provide any autocomplete / intellisense, doesn't check syntax, doesn't check my imports, nada, kinda just a text-editor like Sublime at that point.

What I'm using:
- play! framework 2.3.5
- IntelliJ IDEA 13.5.1
- SBT 0.13.5
- Mix of Java and Scala
- Activator UI
- OSX Maverick

Bascially my issue/sympton is: IntelliJ cannot find declaration to go to in Play Framework, and provides no autocomplete / syntax check support.

Stack Overflow had me Invalidate Caches and Restart. ehh, wasn't enough.

Solution: Specify JDK for Scala to JDK 1.7

I'm on OSX, so Linux/Windows users will have to ad-lib.
IntelliJ -> Preferences -> IDE Settings -> Scala -> JVM SDK
Mine was oddly set to , thus nothing worked, so I flipped it to JDK 1.7, and then re-ran Invalidate Caches and Restart. A few minutes later, and I'm back in business.

Friday, October 03, 2014

Using PDFBox to create a PDF and PdfLayoutManager

I'm looking for documentation for creating a PDF with Apache PDFBox, and I'm hitting some limits. Either I'm not figuring out how to use this tool, or it doesn't have API's for how to draw what should be basic things.

So, there's a project from Glen Peterson to add PdfLayoutManager, which should be contributed upstream to PDFBox. Anyways, I was testing out his additions to the project, and here's the PDF it generates (I've removed the image it add to the PDF, I didn't want to include the resource bundle):

Here is a series of screenshots of the output of this. It can wrap text, make tables, draw shapes, insert an image, and specify colors.

Wednesday, September 24, 2014

DSpace: Harvesting an external collection using OAI-ORE

Would you like to create a collection in DSpace that is automatically capable of mirroring content from some other source? Well, if that external source support OAI, your in luck. First a quick primer: OAI has two modes, OAI-PMH (metadata only), and OAI-ORE (also get the bitstreams / content files). If you only need the metadata for metadata-records only, you'll be fine with OAI-PMH, if you also want to reference, or store the bitstreams into DSpace, then hopefully your data provider support OAI-ORE. DSpace by the way supports both OAI-PMH and OAI-ORE. So you can harvest a DSpace collection and get metadata and files.

First: Create a new Collection in DSpace.

Second: Edit Collection - Content Source - OAI Provider
In DSpace, go to Edit Collection, then click the tab for Content Source.
Then choose the option that "This collection harvests its content from an external source".
Once you save, you can then enter the OAI provider base url, then enter the set ID. Also there is an option to choose between "Harvest Metadata Only", or, if the data source supports ORE, you can either choose to have a reference to the files, or have DSpace download the files, and store them in DSpace.

Once you Save, then you get the option to "Import Now".

Import Now will import this right now. Reset and Reimport will delete the previously harvested contents, and reimport.

You can also see all of the collections that have OAI Harvesting enabled from your Control Panel:

Lastly, if you find yourself harvesting from a source that is going to regularly update their contents, and you want to regularly harvest their content, then setup a cron task to have DSpace Command Line harvest the collection each day.

peterdietz:dspace peterdietz$ /dspace/bin/dspace harvest --start
Starting harvest loop... running.

Thursday, September 11, 2014

DSpace Additions: Author page and Altmetric statistics badge

It's a mixture of big things and little things that can add additional value to your DSpace site. Two interesting additions that I've recently stumbled upon are: Researcher Pages, and Altmetric statistics badge.

Researcher Pages

A project between @Mire and The World Bank's Open Knowledge Repository is to add author pages to DSpace. Thus far, it appears that it shows the author's name, a photo of the author, their biography, and a list of their item's in DSpace that they are an author of.

This is in use at: https://openknowledge.worldbank.org/author-page?author=Abras%2C+Ana+Luisa

Altmetrics Statistics Badge

For articles that have a DOI, you can integrate with the Altmetrics statistics service to display a badge of alternative usage of that article. Altmetrics are things like people citing the paper, mentioning them in a social network or blog, or adding it to your Mendeley library. I've seen this integrated into DSpace by Longsight's Sam Ottenhoff for Marine Biology Laboratory / Woods Hole Oceanographic Institution Open Access Server.

See DSpace and Altmetric's in use at: https://darchive.mblwhoilibrary.org/handle/1912/6598

Monday, August 25, 2014

DSpace OAI profiles

By default in DSpace, OAI-PMH will share all of your public accessible Items in DSpace through OAI. In case you wanted to restrict or modify the set of results that get shared, you would have to customize the ouput, luckily recent versions of DSpace have an easily modifiable configuration, that essentially gives you "profiles" in OAI.

The default profile is called "request", it doesn't filter the results, and it allows harvesting in many different metadata formats. Note: only publicly accessible items/objects can be disseminatable through OAI.

The other profiles in DSpace are OpenAIRE (Open Access Infrastructure for Research in Europe) and DRIVER (Digital Repository Infrastructure Vision for European Research). By default your repository won't disseminate any objects in OpenAIRE or DRIVER format because the filters in place require some specific metadata to be collected for those profiles/guidelines.

https://github.com/DSpace/DSpace/blob/dspace-4_x/dspace/config/crosswalks/oai/xoai.xml#L33

The DRIVER profile declares a number of filters, which restrict the items that disseminate under that profile, to match the requirements of DRIVER. In this case the filters will require: that there is a title (dc.title), that there is an author (dc.contributor.author), that the document type (dc.type) is one of article, thesis, book, etc, also that dc.rights is equal to "open access", and lastly that there is a publicly accessible bitstream, hopefully that means that the full text is available.

So, in case you wanted to customize your default "request" profile to restrict the output to all items in the repository that also had full-text available, you would customize:

 <context baseurl="request">  
 To add:  
 <filter refid="bitstreamaccessFilter"/>

In addition to this information about DSpace OAI profiles, I did run into some bugs or potential issues in the DSpace XOAI code base. For one, there are two modes to run DSpace XOAI in. There is either database mode, where the database responds to all OAI queries, or a performance optimized version, where SOLR indexes your repository. One of the bugs was that the solr mode had a slightly different interpretation of "bitstreamaccessFilter", i.e. database required that there was an original bundle bitstream, the solr version only required that the item was public. To correct this I've patched our code at Longsight, and have contacted the XOAI author to confirm and test the issue.